Test Database Fault Tolerance | CockroachDB AZ Failure Demo

Key Takeaways

Test database fault tolerance by triggering a real AZ failure within your cluster
CockroachDB maintains availability during the failure event
The demo measures QPS and latency to validate resilience under disruption

CockroachDB Fault Tolerance Demo BLOG_webp

The CockroachDB database fault tolerance demo lets you trigger a real availability zone (AZ) failure within your cluster. Users can observe how the system maintains availability, performance, and data consistency under disruption.

CockroachDB Cloud Advanced customers can now trigger a live AZ failure on their cluster from their Cloud Console and watch the database keep running. The Fault Tolerance Demo is available in Public Preview.

Most database administrators are familiar with the “primary/secondary” model: When a primary node goes down a replica is promoted, connections reset, and eventually the cluster is restored through a failback. In traditional primary/secondary architectures, failover is a disruptive event that resets connections and can temporarily impact availability.

CockroachDB distributes nodes across AZs, and every node actively participates in reads and writes. When one zone fails, the nodes in that zone go offline and the remaining zones keep serving traffic. When the AZ recovers, replicas rebalance automatically.

The fault tolerance demo makes this observable. Rather than only describing what happens during an AZ failure, it triggers a real availability zone failure in your cluster and walks through each step in real time, with live metrics and clear narration as events unfold.

How to run the fault tolerance demo

This demo is designed to help you test and visualize database resilience in a controlled environment before production.

Open the Cloud Console, navigate to the Overview Page for your cluster, and select Actions > Run Fault tolerance demo. The demo takes 10-15 minutes end-to-end.

NOTE: Be sure to run this demo on a dedicated test or staging cluster, not production! The failure it triggers is real.

Before launching, the system will verify:

Your cluster has at least three nodes, all healthy
The cluster’s CPU utilization is below 30%
The cluster is not in a locked state (e.g., mid-maintenance, mid-upgrade, mid-creation)
You have Cluster Operator or Cluster Admin role assigned for your cluster
Another fault tolerance demo is not running on the cluster

Two operational notes: If you want to cancel and restart the demo, wait a few minutes before trying again. Cleanup of the temporary database continues for a few minutes after the demo ends.

Try the Fault Tolerance Demo in the Cloud Console →

How the fault tolerance demo triggers an availability zone failure in your cluster

The CockroachDB fault tolerance demo spins up a temporary database and runs a TPC-C workload against your cluster – concurrent inserts, updates, and reads that give the cluster meaningful traffic before the disruption starts. Baseline QPS and latency are captured first, so you have a clear before-and-after. This allows you to validate database performance during failover, by comparing baseline and disruption metrics in real time.

Then it blocks network communication to all nodes in one availability zone. This disruption is real and those nodes go offline. An interactive explainer steps through each event as it happens — node offline, leader re-elected, data rebalancing underway — and you can click through to see active databases and live SQL activity throughout. Logs, node status, and performance metrics stream in the console during the disruption window.

The disruption automatically recovers after 10 minutes, followed by a full failover and recovery summary. Here’s what that looks like on a real run:

_{Post failure failover and recovery summary}

In this run, QPS held steady in the 1,470–1,960 range through the entire failure window. p99 latency remained essentially flat, indicating consistent performance even during the disruption.

Note before you start: During the disruption phase, QPS may briefly appear to drop significantly on the chart. This is a metrics artifact, not a loss of availability. Nodes in the affected AZ stop reporting metrics, so the chart temporarily loses data points from that zone. The remaining nodes continue serving traffic normally throughout the failure.

Test Additional Database Failure Scenarios

If you want to explore CockroachDB’s behavior across a wider range of failure scenarios, the Performance under Adversity Benchmark defines six levels of failure severity – from routine disruptions like schema changes or backups all the way to regional outages. Results are published in a live dashboard you can explore directly.

Try CockroachDB Today

Spin up your first CockroachDB Cloud cluster in minutes. Start with $400 in free credits. Or get a free 30-day trial of CockroachDB Enterprise on self-hosted environments.

David Bressler is Staff Product Marketer for Cockroach Labs. He has worked in 26 countries, is an accomplished public speaker, and graduated with distinction with an MBA from NYU.

Ayushi Jain is a Staff Product Manager at Cockroach Labs, where she focuses on the cloud user experience and helping developers get up and running quickly. With over 10 years in enterprise SaaS, she’s passionate about making powerful infrastructure feel simple and accessible.

Application Resilience

Test Database Resilience with the CockroachDB Fault Tolerance Demo

How to run the fault tolerance demo

How the fault tolerance demo triggers an availability zone failure in your cluster

Test Additional Database Failure Scenarios

Try CockroachDB Today