blog-banner

Outages Observer: Learning from 2025's Top Outages to Build Unbreakable Systems in 2026

Published on December 17, 2025

0 minute read

    AI Summary

    Key Takeaways

    • 2025 exposed structural fragility across clouds, networks, and core services.

    • AI-era traffic will amplify outage impact and stress every dependency.

    • 2026 demands architectures built for full-region failure and continuous availability.

    Top Outages of 2025 BLOG WEBP

    If 2024 was the year outage anxiety crept into boardrooms, 2025 was the year fragility became impossible to ignore.

    The world watched nervously over the last 12 months as cloud platforms, telecom networks, security systems, payment rails, productivity suites, and even national power grids buckled under pressure. There were different causes, vendors, and industries, but the same outcome: widespread disruption that exposed how brittle our digital infrastructure can be.

    What organizations once dismissed as "rare exceptions" now looks like unmistakable evidence of a systemic resilience problem. The data backs it up: According to Cockroach Labs' State of Resilience 2025 report, which surveyed 1,000 senior technology executives worldwide, 93% worry about downtime's impact on their business, and 100% experienced outage-related revenue loss this year. Per-incident losses ranged from $10,000 to over $1 million, with larger enterprises reporting average losses of nearly $500,000 per outage.

    Perhaps most revealing: Only 20% of executives feel fully prepared to respond to outages, even as their organizations endure an average 86 hours of downtime annually. Fifty-five percent report outages at least weekly.

    Global IT isn't "occasionally unreliable" anymore. It's structurally fragile, and 2025 made that crystal clear.

    When Critical Infrastructure Failed, Everything Else FollowedCopy Icon

    Outages in 2025 spanned every layer of the digital ecosystem, revealing just how interdependent modern infrastructure has become.

    The cloud concentration risk crystallized

    • Google Cloud & Workspace instability (January): Multi-hour disruptions rippled through dependent SaaS providers, highlighting how tightly coupled the ecosystem has become.

    • Microsoft 365 global outage (March): Millions lost access to email, collaboration tools, and internal workflows, demonstrating how operational continuity can hinge on a single SaaS provider.

    • Google Cloud global disruption (June): Another broad Google incident underscored the structural risk of concentration in one hyperscaler.

    • AWS us-east-1 outage (October): A DNS-initiated cascade took down DynamoDB, EC2, NLB, and numerous AWS services, sending major apps offline for hours. It reinforced that us-east-1 remains a global single point of failure.

    • Azure + Xbox + Outlook + Microsoft 365 outage (October 29): A massive Azure configuration issue propagated across tightly coupled services, impacting both enterprise and consumer ecosystems.

    The network layer proved more fragile than assumed

    • Verizon wireless outage (August): Phones across major U.S. metros fell to “SOS only,” disrupting logistics, emergency communications, payments, and authentication flows. These connectivity failures showed how quickly real-world operations can grind to a halt.

    Middleware chokepoints became systemic vulnerabilities

    • Cloudflare edge network outage (November 18): A single configuration error disrupted ~20% of global web traffic, affecting platforms from X to global SaaS tools.

    • Cloudflare WAF rule push incident (December 5): A mitigation update intended to improve security instead knocked out 28% of traffic. It was a sharp reminder that shared middleware layers can magnify even small changes into global incidents.

    The data and productivity layer faltered at scale

    • Google Drive/Docs/Sheets outage (November 12): Hours of downtime froze knowledge work across U.S. businesses, underscoring how mission-critical productivity layers have become.

    • Iberian Peninsula blackout (April): A grid failure shut down telecoms, payments, airports, and Internet access across three countries, evidence that even cloud-native architectures inherit risk from their underlying physical infrastructure.

    Security and financial control planes created new risk categories

    • SentinelOne console outage (May): Endpoint agents continued running, but SOC teams lost visibility for nearly seven hours. This “flying blind” scenario exposed how brittle centralized security control planes can be in practice.

    • Venmo payments outage (December 3): The popular payment processor’s hours-long failure processing failure left users unable to authenticate or send money, revealing how even well-scaled consumer fintech systems can falter under seasonal load and dependency spikes.


    RELATED: Outages Observer: When A Region or Cloud Fails, Resilience Must Be Automatic


    The Root Causes Behind 2025's FailuresCopy Icon

    Across this year's incidents, we can see clear structural and operational issues that go beyond individual vendor mistakes.

    Overdependence on single regions and single providersCopy Icon

    AWS us-east-1 continues to operate as a global single point of failure: October’s outage made that clear as thousands of businesses experienced cascading breakdowns. Despite years of multi-cloud rhetoric, most organizations remain tightly coupled to a single region within a single cloud. Even multi-AZ deployments often hide centralized dependencies such as load balancers, DNS, or control planes that only reveal themselves when everything goes down at once.

    Fragile middleware layers with concentrated riskCopy Icon

    Cloudflare’s two major 2025 outages underscored how DNS, CDN, identity, and security layers have become shared bottlenecks where a single misconfiguration can disrupt huge portions of the Internet. These dependencies are so deeply embedded in modern architectures that most companies can’t operate without them, yet few treat the edge itself as a potential failure domain.

    Legacy data architectures not built for region-level failureCopy Icon

    A defining theme of 2025 was how many organizations discovered their data layers couldn’t withstand the failures their applications were supposedly architected to survive. Multi-region correctness remains an Achilles' heel: databases that work only when conditions are ideal, and lose consistency the moment real failure occurs. The gap between “we have replicas” and “we stay correct during regional loss” proved dangerously wide this year.

    Inadequate change management and poor blast-radius modelingCopy Icon

    From Azure’s October 29 outage to Cloudflare’s December WAF incident, 2025 made it clear that misconfigured control planes and well-intentioned mitigation steps can spiral into global disruption. The issue isn’t human error, but immature blast-radius modeling. Teams push changes without fully understanding their dependency surface, and too few organizations have the tooling to simulate how configuration shifts behave under real traffic conditions.

    Limited visibility into true dependency chainsCopy Icon

    The Venmo, Google Docs, and SentinelOne outages highlighted how deeply interconnected modern systems are, and how little visibility most teams have into those connections. First-order dependencies are usually understood, but second- and third-order links remain invisible until failure forces an unpleasant reveal.

    According to Cockroach Labs’ State of Resilience 2025 report, 95% of executives acknowledge structural weaknesses, yet fewer than one-third perform regular failover testing, and only one in three has a coordinated response plan. The result: enterprises aren’t just running fragile systems, they’re relying on fragile operational practices that leave them exposed even when their architectures look robust on paper.

    Why Downtime Tolerance Evaporated in 2025Copy Icon

    Customer expectations, regulatory requirements, and business realities converged this year to eliminate any remaining tolerance for downtime.

    • Customers expect continuous availability: Users have zero tolerance for downtime, especially when alternatives are only a tap away. The December 3 Venmo outage during peak holiday shopping wasn't just an inconvenience — it was a breach of trust that sent customers to competitors. Even brief interruptions now have lasting consequences. Continuous availability has become the minimum expectation, not a differentiator.

    • Regulators are tightening operational resilience requirements: With mandates like DORA and NIS2 taking effect, operational resilience is shifting from best practice to legal obligation. Yet 79% of executives acknowledge they’re not prepared to meet these standards. Regulators no longer accept “unexpected outage” as an excuse; they expect documented failover processes and routine testing. Compliance now requires architectural proof that critical systems can withstand regional failure.

    • The business costs are compounding: Outages don’t just drain revenue: They create weeks of operational fallout. Teams report heavier workloads, rising burnout, and technical debt introduced under pressure as they scramble to restore service. Leadership is increasingly forced to answer to boards who view repeated downtime as a sign of systemic weakness. “Sorry for the inconvenience” no longer lands as an apology; it lands as negligence.

    AI Will Exploit Every Weakness in 2026Copy Icon

    How does the impact of major outages look when viewed through the lens of our increasing AI dependence? The upward surge of LLMs adds another wrinkle.

    Humans are increasingly generating discrete, predictable workloads; meanwhile AI systems generate continuous, high-velocity, parallel workloads that test infrastructure in fundamentally different ways. Autonomous agents will make API calls at volumes that humans never could, overwhelm authentication and identity layers with relentless verification requests, and stress data planes with constant reads and writes. These massive demands amplify every system dependency that was previously manageable under human-scale traffic.

    During the November Cloudflare outage, traffic rerouting alone strained edge capacity as systems attempted to fail over to alternative paths. Now imagine that same scenario under 2026 traffic loads, when AI-driven transaction patterns multiply baseline demand by orders of magnitude. The compound effect will be severe: more requests hitting more dependencies at higher concurrency, with faster failure propagation and wider blast radiuses when something inevitably breaks.

    AI doesn't replace existing traffic, but compounds it. Organizations unprepared for this exponential load increase will discover that architectures which survived 2025 are dangerously inadequate for what's coming. If systems are going to weather 2026, they must be built from the start to handle AI-era load patterns.


    RELATED: Ideal isn’t real: Stress testing CockroachDB’s resilience


    What Leaders Must Do to Prepare for 2026Copy Icon

    While 2025 was a warning, 2026 will be a full-on escalation. Enterprises can start the New Year right with a resolution to be ready! Here's what leaders need to prioritize immediately to build systems that can survive what's coming:

    Design for full-region failure, not just zonal redundancyCopy Icon

    Multi-availability-zone deployments aren't enough. Data architects should assume that an entire region will disappear, because several effectively did in 2025. This means architecting systems where: 

    • no single region holds state that can't be instantly accessed elsewhere 

    • control planes don't create hidden couplings back to a single geography

    • failover happens automatically without manual intervention or extended recovery windows

    Diversify critical dependencies across providers and regionsCopy Icon

    Reduce exposure to any single cloud region, network provider, CDN, DNS system, identity provider, or control plane. Dependency diversity equals resilience, which doesn't mean abandoning primary providers — it means ensuring that when they fail, your systems have alternative paths that maintain availability without sacrificing correctness. 

    The organizations that weathered 2025's outages best were those that had anticipated failure and built redundancy into every critical dependency.

    Modernize the data layer for global consistency and survivabilityCopy Icon

    Architectures built around monolithic or region-bound databases cannot survive AI-era load patterns or multi-region failure modes. This is where distributed SQL naturally enters the conversation. 

    CockroachDB has demonstrated through its Performance under Adversity benchmark and real-world behavior during cloud-wide incidents that distributed architectures can maintain correctness, consistency, and availability even during regional outages. During the October AWS incident, CockroachDB Cloud clusters worldwide continued processing business-critical workloads without interruption. This wasn’t because of luck, but because the architecture was designed from the ground up to survive exactly these failure scenarios.

    Stress-test systems against AI-driven load patternsCopy Icon

    Synthetic testing won't be sufficient. Leaders must simulate autonomous agent traffic, surge concurrency, parallelized workflows, and high churn on reads and writes. The systems that perform well under traditional load testing may collapse under AI-era patterns where thousands of agents simultaneously hammer the same endpoints with correlated requests. Testing must evolve to match the workloads that are coming, not the workloads we're accustomed to handling.

    Implement continuity playbooks and run them regularlyCopy Icon

    Resilience isn't documentation, it's a discipline. Organizations need documented failover procedures, regular chaos testing, and latency drills that build confidence and preparedness. 

    The State of Resilience 2025 report found that fewer than one-third of organizations conduct regular failover testing, which explains why so many teams were caught flat-footed when outages struck. The companies that responded effectively to 2025’s worst outages had already run the playbook dozens of times in controlled environments.

    Elevate resilience to a board-level priorityCopy Icon

    This mirrors the cybersecurity industry's evolution in the 2010s, when organizations formalized the Chief Information Security Officer role as the stakes became sky-high. 

    Now we're seeing the next evolution: The Chief Resilience Officer, a role already in place at companies like CrowdStrike, is tasked with owning operational continuity end-to-end. Boards are realizing that resilience isn't a technical detail, but a governance obligation that requires executive ownership, clear accountability, and regular reporting on preparedness.

    The 2026 Mandate: Build unbreakable systems! Copy Icon

    The digital world is entering a phase of unprecedented load and interdependence. Outages will definitely continue, but who will survive them?

    Companies that modernize their architectures now will gain meaningful advantages in reliability, customer trust, operational scale, and regulatory readiness. Companies that don't will remain vulnerable to the next inevitable outage, whether it's a cloud provider misconfiguration, a telecom failure, a middleware meltdown, or an AI-driven traffic surge that overwhelms systems already strained to their breaking point.

    That’s why resilience has grown from a feature to a prerequisite. Reliability is a must-have for an AI-intensive era where downtime compounds faster, costs more, and damages trust in ways that take years to rebuild.

    As we speed into 2026, the winners won't be those who recover fastest from outages. They'll be the ones who never crash in the first place.

    That's the north star guiding Cockroach Labs’ Outages Observer series. Join us as we continue exploring what global disruptions reveal about data resilience, and how modern systems must evolve to meet the demands of a constant uptime world.

    How will your enterprise improve its performance under adversity in 2026? Visit here to talk to an expert.

    Try CockroachDB Today

    Spin up your first CockroachDB Cloud cluster in minutes. Start with $400 in free credits. Or get a free 30-day trial of CockroachDB Enterprise on self-hosted environments.


    David Weiss is Senior Technical Content Marketer for Cockroach Labs. In addition to data, his deep content portfolio includes cloud, SaaS, cybersecurity, and crypto/blockchain.