Before You Begin
Make sure you have already completed Cluster Startup and Scaling and have 5 nodes running locally.
Step 1. Set up load balancing
In this module, you'll run a sample workload to simulate multiple client connections. Each node is an equally suitable SQL gateway for the load, but it's always recommended to spread requests evenly across nodes. You'll use the open-source HAProxy load balancer to do that here.
In a new terminal, install HAProxy.
If you're on a Mac and use Homebrew, run:
copy$ brew install haproxy
If you're using Linux and use apt-get, run:
copy$ sudo apt-get install haproxy
Run the
cockroach gen haproxy
command, specifying the port of any node:copy$ cockroach gen haproxy \ --insecure \ --host=localhost \ --port=26257
This command generates an
haproxy.cfg
file automatically configured to work with the nodes of your running cluster.In
haproxy.cfg
, changebind :26257
tobind :26000
. This changes the port on which HAProxy accepts requests to a port that is not already in use by a node.copysed -i.saved 's/^ bind :26257/ bind :26000/' haproxy.cfg
Start HAProxy, with the
-f
flag pointing to thehaproxy.cfg
file:copy$ haproxy -f haproxy.cfg &
Step 2. Run a sample workload
Now that you have a load balancer running in front of your cluster, use the YCSB workload built into CockroachDB to simulate multiple client connections, each performing mixed read/write workloads.
In a new terminal, load the initial
ycsb
schema and data, pointing it at HAProxy's port:copy$ cockroach workload init ycsb \ 'postgresql://root@localhost:26000?sslmode=disable'
Run the
ycsb
workload, pointing it at HAProxy's port:copy$ cockroach workload run ycsb \ --duration=20m \ --concurrency=3 \ --max-rate=1000 \ --splits=50 \ 'postgresql://root@localhost:26000?sslmode=disable'
This command initiates 3 concurrent client workloads for 20 minutes, but limits the total load to 1000 operations per second (since you're running everything on a single machine).
Also, the
--splits
flag tells the workload to manually split ranges a number of times. This is not something you'd normally do, but for the purpose of this training, it makes it easier to visualize the movement of data in the cluster.
Step 3. Check the workload
Initially, the workload creates a new database called ycsb
, creates a usertable
table in that database, and inserts a bunch of rows into the table. Soon, the load generator starts executing approximately 95% reads and 5% writes.
To check the SQL queries getting executed, go back to the Admin UI at http://localhost:8080, click Metrics on the left, and hover over the SQL Queries graph at the top:
To check the client connections from the load generator, select the SQL dashboard and hover over the SQL Connections graph:
You'll notice 3 client connections for the 3 concurrent workloads from the load generator. If you want to check that HAProxy balanced each connection to a different node, you can change the Graph dropdown from Cluster to each of the first three nodes. For each node, you'll see a single client connection.
To see more details about the
ycsb
database andusertable
table, click Databases in the upper left and then scroll down until you see ycsb:You can also view the schema of the
usertable
by clicking the table name:
Step 4. Simulate a single node failure
When a node fails, the cluster waits for the node to remain offline for 5 minutes by default before considering it dead, at which point the cluster automatically repairs itself by re-replicating any of the replicas on the down nodes to other available nodes.
In a new terminal, reduce the amount of time the cluster waits before considering a node dead to the minimum allowed of 1 minute and 15 seconds:
copy$ cockroach sql \ --insecure \ --host=localhost:26000 \ --execute="SET CLUSTER SETTING server.time_until_store_dead = '1m15s';"
Then use the
cockroach quit
command to stop node 5:copy$ cockroach quit \ --insecure \ --host=localhost:26261
Step 5. Check load continuity and cluster health
Go back to the Admin UI, click Metrics on the left, and verify that the cluster as a whole continues serving data, despite one of the nodes being unavailable and marked as Suspect:
This shows that when all ranges are replicated 3 times (the default), the cluster can tolerate a single node failure because the surviving nodes have a majority of each range's replicas (2/3).
Step 6. Watch the cluster repair itself
Scroll down to the Replicas per Node graph:
Because you reduced the time it takes for the cluster to consider the down node dead, after 1 minute or so, you'll see the replica count on nodes 1 through 4 increase. This shows the cluster repairing itself by re-replicating missing replicas.
Step 7. Prepare for two simultaneous node failures
At this point, the cluster has recovered and is ready to handle another failure. However, the cluster cannot handle two near-simultaneous failures in this configuration. Failures are "near-simultaneous" if they are closer together than the server.time_until_store_dead
setting plus the time taken for the number of replicas on the dead node to drop to zero. If two failures occurred in this configuration, some ranges would become unavailable until one of the nodes recovers.
To be able to tolerate 2 of 5 nodes failing simultaneously without any service interruption, ranges must be replicated 5 times.
Restart node 5, using the same command you used to start the node initially:
copy$ cockroach start \ --insecure \ --store=node5 \ --listen-addr=localhost:26261 \ --http-addr=localhost:8084 \ --join=localhost:26257,localhost:26258,localhost:26259 \ --background
In a new terminal, use the
ALTER RANGE ... CONFIGURE ZONE
command to change the cluster'sdefault
replication factor to 5:copy$ cockroach sql --execute="ALTER RANGE default CONFIGURE ZONE USING num_replicas=5;" --insecure --host=localhost:26000
Back in the Admin UI Overview dashboard, watch the Replicas per Node graph to see how the replica count increases and evens out across all 5 nodes:
This shows the cluster up-replicating so that each range has 5 replicas, one on each node.
Step 8. Simulate two simultaneous node failures
Use the
cockroach quit
command to stop nodes 4 and 5:copy$ cockroach quit --insecure --host=localhost:26260
copy$ cockroach quit --insecure --host=localhost:26261
Step 9. Check load continuity and cluster health
Like before, go to the Admin UI, click Metrics on the left, and verify that the cluster as a whole continues serving data, despite 2 nodes being offline:
This shows that when all ranges are replicated 5 times, the cluster can tolerate 2 simultaneous node outages because the surviving nodes have a majority of each range's replicas (3/5).
To verify this further, use the
cockroach sql
command to count the number of rows in theycsb.usertable
table and verify that it is still serving reads:copy$ cockroach sql \ --insecure \ --host=localhost:26257 \ --execute="SELECT count(*) FROM ycsb.usertable;"
count +-------+ 10000 (1 row)
And writes:
copy$ cockroach sql \ --insecure \ --host=localhost:26257 \ --execute="INSERT INTO ycsb.usertable VALUES ('asdf', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL);"
copy$ cockroach sql \ --insecure \ --host=localhost:26257 \ --execute="SELECT count(*) FROM ycsb.usertable;"
count +-------+ 10001 (1 row)
Step 10. Clean up
In the next module, you'll start a new cluster from scratch, so take a moment to clean things up.
Stop all CockroachDB nodes, HAProxy, and the YCSB load generator:
copy$ pkill -9 cockroach haproxy ycsb
This simplified shutdown process is only appropriate for a lab/evaluation scenario.
Remove the nodes' data directories and the HAProxy config:
copy$ rm -rf node1 node2 node3 node4 node5 haproxy.cfg
What's Next?
Locality and Replication Zones