CalCompute — Cloud Infrastructure Outages Timeline

Chronological Overview

Nov 2024

GCP Vertex Gemini 19hr

Late 2024

Azure China North 3 50hr

Jun 12, 2025

GCP Global IAM Crash

Jul–Aug 2025

Azure East US Capacity

Oct 20, 2025

AWS US-EAST-1 DNS Cascade

Oct 29, 2025

Azure Front Door Global

Nov 18, 2025

Cloudflare Bot Mgmt Crash

Dec 5, 2025

Cloudflare WAF Rule Bug

Incident Log

Major Outages in Detail

Google Cloud

Nov 13–18, 2024

Moderate

Duration

~19 Hours

Region

Asia Multi-Region

Services

Vertex Gemini API

Root Cause

API Endpoint Issue

Google Cloud Vertex Gemini API — Extended Degradation

What Happened

Beginning November 13, 2024, Google Cloud's Vertex Gemini API experienced an elevated rate of errors and degraded performance in several Asia-region endpoints. The incident persisted for nearly 19 hours, making it one of the longer individual service outages in the measurement period. The impact was scoped to specific AI model endpoints rather than general infrastructure, but for developers and businesses relying on Gemini 1.5 Flash and Gemini 1.5 Pro via Vertex AI, the experience was near-total unavailability of those capabilities.

Root Cause

Google's official incident report cited increased error rates when customers accessed the Vertex Gemini API global endpoint. An API endpoint configuration issue prevented requests from routing correctly to healthy backend capacity. The incident was compounded by the multi-region propagation of the faulty state, requiring careful, staged remediation rather than a simple rollback.

Impact on the Public & California

AI Developers Enterprise ML Pipelines Startups on Vertex AI

While not a consumer-facing outage in the traditional sense, this disruption affected the California-dense AI startup and developer community heavily. The Bay Area and Los Angeles are home to hundreds of AI product companies building on Vertex AI and Gemini APIs. Developers reported workflow interruptions lasting multiple hours, delayed production deployments, and the need to rapidly implement fallback paths to alternative AI providers.

Microsoft Azure

Late 2024

Moderate

Duration

~50 Hours

Region

China North 3

Services

Networking, Compute

Root Cause

Networking Config Change

Azure China North 3 — Longest Outage of the Measurement Period

What Happened

In late 2024, Microsoft Azure's China North 3 region suffered the single longest outage recorded in the August 2024–August 2025 research dataset — a 50-hour disruption that raised Azure's average outage duration significantly. A networking configuration change caused connectivity issues, prolonged timeouts, connection drops, and resource allocation failures across multiple Azure services.

Root Cause

Loss of indexing data in the Azure PubSub service — which is used by the networking control plane to communicate between control entities and agents on individual hosts — caused networking configuration updates to not be delivered to agents. This created a paralysis state in the control plane for the affected region. Notably, services configured to be "zonally redundant" still experienced multi-zone impact because the VNet integration itself depended on the broken control plane.

Impact on the Public & California

Enterprise Customers (China Operations) Multinational Corporations California Firms with China Offices

The geographic scope was limited to Azure's China region, meaning direct impact on California consumers was minimal. However, California-headquartered companies with operations or customers in China faced multi-day interruptions to cloud-hosted services. The 50-hour duration exceeded most SLA windows, raising uncomfortable questions about recovery guarantees for enterprises relying on a single regional deployment.

Google Cloud

Jun 12, 2025

Critical

Duration

~7.5 Hours Total

Region

Global

Services Affected

76+ GCP Products

Root Cause

Null Pointer in Service Control

Google Cloud Global IAM / Service Control Crash — "A Single Bug Took Down the Internet"

What Happened

On June 12, 2025, large swaths of the internet went dark starting at approximately 10:49 AM PDT. The incident originated in Google Cloud's Service Control system — the component responsible for authorizing every API request across Google's infrastructure. A null pointer exception caused Service Control binaries to crash globally, rejecting API requests with 503 errors and making 76 Google Cloud services unavailable or severely degraded across North America, Europe, Asia, Africa, and the Middle East.

The outage triggered immediate cascading failures across the internet ecosystem. Spotify, Discord, Snapchat, OpenAI (ChatGPT), Cloudflare, Twitch, Cursor, Replit, Fitbit, and Character.AI all experienced outages. Even Gmail, Google Drive, Google Calendar, Google Meet, and Google Docs went down for users requiring new authentication tokens. Full recovery didn't complete until 6:18 PM PDT.

Root Cause — Technical Deep Dive

On May 29, 2025, Google engineers deployed a new code change to Service Control to add additional quota policy checks. This code lacked proper error handling and was not protected by a feature flag. It lay dormant until June 12, when a routine policy change touching the regional Spanner tables that Service Control depends on inadvertently contained blank fields. This triggered the dormant faulty code path, producing null pointer exceptions that cascaded globally within seconds. The "herd effect" made recovery slower: as engineers restarted Service Control tasks, the restart storm itself overloaded the underlying Spanner infrastructure.

Impact on the Public & California

Gmail / Google Workspace Spotify Discord Snapchat OpenAI / ChatGPT Cloudflare Services AI Dev Tools

For California, the impact was deep and personal. The timing (mid-morning PDT) hit at peak productivity hours. Millions of Californians found Gmail, Google Drive, and Google Meet inaccessible during business hours. Tech companies across Silicon Valley lost access to collaborative tools, AI developer tooling (Cursor, Replit), and their own products built on GCP.

California Impact

Google Workspace is the dominant productivity suite for California school districts, healthcare providers, and tech companies. School platforms integrated with Google services were disrupted mid-day. The Bay Area's massive concentration of AI startups using Google Cloud and Gemini APIs suffered acute productivity loss during the roughly 7.5-hour outage.

Microsoft Azure

Late Jul – Aug 5, 2025

Major

Duration

~7–10 Days

Region

Azure East US

Services Affected

VM Allocation, Compute

Root Cause

Capacity Exhaustion (AI Demand)

Azure East US — Compute Capacity Crisis Driven by AI Demand

What Happened

In late July 2025, Azure's East US region began experiencing widespread allocation failures when customers attempted to create or update virtual machines. The root cause was not a technical fault in the traditional sense, but rather a surge in AI-driven compute demand that outstripped Microsoft's available capacity. Customers saw allocation failure errors and could not spin up new cloud resources. Microsoft reported the problem officially resolved by August 5, though some users continued to experience problems days later.

Root Cause

The AI infrastructure boom of 2025 created extraordinary pressure on data center capacity. GPU instances and high-memory VM families in Azure's busiest US region were effectively sold out or over-subscribed. This represented a new category of cloud "outage" — not a failure of software or hardware, but of forecasting and provisioning. The event highlighted a structural tension that became a defining theme of 2025 cloud infrastructure: rapid AI adoption was outpacing the physical realities of data center expansion, power procurement, and hardware supply chains.

Impact on the Public & California

AI Startups Enterprise Dev Teams Cloud-Native Businesses

California technology companies attempting to scale AI workloads on Azure during this period found their deployments blocked. Companies building AI products on Azure — including many healthcare AI firms, fintech startups, and enterprise software vendors headquartered in California — were unable to expand capacity at a critical time. The event foreshadowed the October Azure outage and reinforced the case for multi-cloud strategies.

Amazon Web Services

Oct 19–20, 2025

Critical

Duration

~14–15 Hours

Region

US-EAST-1 (Global Impact)

Services Affected

141+ AWS Services

Root Cause

DNS Race Condition in DynamoDB

AWS US-EAST-1 — "The Outage That Shook the Internet"

What Happened

On the night of October 19 into the early morning of October 20, 2025 — starting around midnight Pacific Time — Amazon Web Services' US-EAST-1 region experienced what became one of the largest and most consequential cloud outages in history. The disruption lasted approximately 14–15 hours. At its peak, Downdetector recorded over 17 million user reports from more than 60 countries — a 970% spike above the normal baseline — making it one of the largest internet disruptions ever tracked.

The cascade swept through the digital economy: Snapchat, Slack, Atlassian (Jira, Confluence), Netflix, Disney+, Hulu, Prime Video, Coinbase, Venmo, Roblox, Fortnite, Ring doorbells, Pokémon GO, Duolingo, Alexa, Signal, smart beds, and UK government services (HMRC) all went down or severely degraded. Medical practices couldn't access patient records. Law firms lost access to documents for time-sensitive court filings.

Root Cause — Technical Deep Dive

AWS uses a system called DWFM (Distributed Workflow Manager) to coordinate DNS record updates. A rare timing race condition between the Planner and Enactor components caused the Enactor to process a stale, already-superseded DNS plan for DynamoDB's regional endpoint. Cleanup automation then deleted what it believed were orphaned records — but these were actually the correct, current records pointing DynamoDB to real servers. The result: an empty DNS record for DynamoDB in US-EAST-1.

Because 141 AWS services internally depend on DynamoDB, the failure cascaded rapidly. Even after DNS was restored at approximately 9:25 AM UTC, accumulated state inconsistencies from three hours of lost DynamoDB lease management continued to cause EC2 failures until 8:50 PM UTC — over 11 additional hours.

Impact on the Public & California

Snapchat Slack Netflix / Streaming Jira / Atlassian Venmo Ring Canvas (Education) Coinbase

The October 20 AWS outage was the single most disruptive cloud event within the research period for California residents. Snapchat — headquartered in Santa Monica — received approximately 3 million outage reports. Millions of Californians found home security systems unresponsive. UC Riverside students and staff were among those who couldn't access Canvas, a learning management system used by 50% of college students in North America.

California-Specific Impact

The outage started around midnight PT — meaning California-based consumer and enterprise systems failed at the start of the Pacific business day. By morning, millions of workers discovered their Slack workspaces unreachable, Ring doorbells unresponsive, and Netflix unavailable. Canvas disrupted instruction at UC Riverside and multiple California community colleges. The economic cost was estimated at over $1 billion globally, with California's dense tech and enterprise sector absorbing a disproportionate share.

Microsoft Azure

Oct 29, 2025

Critical

Duration

~8.5 Hours

Region

Global

Services Affected

M365, Xbox, Azure Portal, Airlines

Root Cause

Azure Front Door Config Error

Azure Front Door Global Failure — "Front Door Locked Everyone Out"

What Happened

Just nine days after the catastrophic AWS outage, Microsoft Azure experienced its own major global failure. Beginning at approximately 3:45 PM UTC (9:45 AM PT) on October 29, 2025, an inadvertent configuration change to Azure Front Door — Microsoft's global content delivery and application delivery network — caused cascading failures affecting services on every continent. The outage lasted approximately 8 hours and 20 minutes.

The affected services read like a catalog of modern digital life: Microsoft 365 (Outlook, Teams, OneDrive), Xbox Live, Minecraft, the Azure Portal, Microsoft Copilot, Microsoft Entra ID, Azure SQL Database, Azure Databricks, Azure Healthcare APIs, and the websites of Costco, Starbucks, and Alaska Airlines. Over 30,000 reports were logged in the first hour.

Root Cause — Technical Deep Dive

A tenant configuration change was processed simultaneously by two different versions of Azure's control plane software running in parallel. Both versions produced slightly different interpretations of the same configuration update, resulting in an invalid payload being propagated to Azure Front Door edge nodes globally. Because AFD sits in the client handshake path and fronts identity issuance, the configuration failure prevented clients from completing TLS handshakes and obtaining authentication tokens — effectively locking users out of services even when backend origin servers were healthy.

Impact on the Public & California

Microsoft 365 / Outlook Teams Xbox Live Alaska Airlines Starbucks Costco Healthcare APIs

For California, the timing was particularly acute — hitting during the mid-morning business window on the West Coast. Organizations running on Microsoft 365 (schools, government agencies, healthcare providers, enterprises) lost access to email, video conferencing, and file collaboration. Alaska Airlines faced disruptions at Los Angeles and San Francisco hubs.

California-Specific Impact

Millions of California students, teachers, healthcare workers, and government employees using Microsoft 365 lost productivity access mid-morning. The outage came just nine days after the AWS outage, creating a compounding "cloud fatigue" among IT professionals who had barely recovered from the prior incident.

Cloudflare

Nov 18, 2025

Critical

Duration

~5.6 Hours (core: ~3hrs)

Region

Global

Services Affected

X, ChatGPT, Spotify, Claude, More

Root Cause

Oversized Bot Config File Crash

Cloudflare Global CDN Failure — "Worst Outage Since 2019"

What Happened

On November 18, 2025, at 11:20 UTC (3:20 AM PT), Cloudflare's global network began returning HTTP 500 errors for a significant fraction of all web traffic passing through its infrastructure. The outage affected major platforms including X (Twitter), ChatGPT (OpenAI), Anthropic's Claude, Spotify, Canva, League of Legends, Dropbox, Shopify, Coinbase, and thousands of other sites. The major impact period lasted approximately three hours. This was described as Cloudflare's most severe service disruption since 2019.

Root Cause — Technical Deep Dive

At 11:05 UTC, a database permissions change was deployed to Cloudflare's ClickHouse cluster. This inadvertently caused a query for the Bot Management system to return duplicate rows, doubling the size of the Bot Management feature file from ~60 features to over 200. When this oversized file propagated to Cloudflare's proxy servers globally during their regular five-minute refresh cycle, the file exceeded a hardcoded memory allocation limit. Cloudflare's core proxy software (written in Rust) called Result::unwrap() on an Err value when the file failed to load — causing a panic and crashing the proxy process.

A circular dependency compounded user pain: Cloudflare's own dashboard used Turnstile (their CAPTCHA service) for login, but Turnstile itself ran through the broken proxy layer — making it impossible for most Cloudflare customers to log in to manage their own configurations during the outage.

Impact on the Public & California

X (Twitter) ChatGPT Anthropic Claude Spotify Canva Shopify Banking Interfaces Transit Systems

Cloudflare powers an estimated 20%+ of all internet traffic globally. For Californians, the 3:20 AM start time meant the worst of the outage occurred during pre-dawn and early morning hours, with many services still degraded at the start of the business day. This was the third major cloud disruption within five weeks, creating an extraordinary period of infrastructure instability.

California-Specific Impact

California-headquartered AI companies (Anthropic in San Francisco, OpenAI in San Francisco, Shopify's California seller ecosystem) were directly impacted. The compounding of three major outages in five weeks drove California-based IT teams toward emergency multi-cloud and CDN diversification planning.

Cloudflare

Dec 5, 2025

Major

Duration

~25 Minutes

Region

Global (28% of traffic)

Services Affected

LinkedIn, Zoom, Canva, Shopify

Root Cause

WAF Rule Deployment Bug

Cloudflare — Second Outage in Three Weeks

What Happened

On December 5, 2025, at 08:47 UTC, Cloudflare's network experienced a second major outage, just 17 days after the devastating November 18 incident. This time the disruption lasted only approximately 25 minutes and affected approximately 28% of Cloudflare's HTTP traffic, but the breadth was still significant: LinkedIn, Zoom, Canva, and Shopify were among the confirmed impacted platforms. The incident was resolved at 09:12 UTC when engineers reverted the triggering change.

Root Cause

While attempting to rapidly mitigate a newly discovered industry-wide vulnerability in React Server Components, Cloudflare's security team deployed a change to WAF rule testing tooling. The change contained a Lua exception bug: when the WAF rules module callback was removed, the code attempted to index a nil field named 'execute' — causing an HTTP 500 to be returned for all affected requests. Cloudflare had pledged after the November 18 outage to implement safeguards preventing single updates from causing widespread impact, but those safeguards had not yet been fully deployed.

Impact on the Public & California

LinkedIn Zoom Canva Shopify

Although brief, the outage disrupted Zoom video calls and LinkedIn networking activity — both heavily used by California's professional and business communities — during a morning UTC window. The incident reinforced that Cloudflare had not yet solved its deployment safety problem, further eroding trust in single-CDN architectures.

Cross-Incident Analysis

Patterns & Lessons

The Common Thread: Good Changes Gone Global Bad

Every major outage in this period shared a striking pattern: a change intended to be beneficial (a security improvement, a quota update, a configuration optimization) was deployed to global infrastructure without adequate staged rollout or error containment. Because modern cloud control planes replicate changes globally in seconds, a bad update has the same propagation speed as a good one. The centralization that makes cloud efficient is also what amplifies failure blast radius to a planetary scale.

California as Ground Zero for Economic Impact

California's unique position — home to AWS, Google, Meta, Snap, and the nation's densest concentration of cloud-dependent businesses — means every major cloud outage hits the state disproportionately. The October 2025 AWS outage disrupted Snap (Santa Monica HQ), affected UC Riverside students, paralyzed Silicon Valley engineering teams, and rendered Ring smart home devices offline for millions of California households. The state's tech economy, worth over $1 trillion annually, is deeply exposed to infrastructure concentration risk.

The Hidden Dependencies Problem

A recurring theme: organizations that believed they had multi-AZ or multi-region redundancy discovered their architectures still had hidden dependencies on single control-plane components. AWS's October outage proved multi-AZ deployments provided no protection when the underlying DNS and DynamoDB control plane failed. Azure's outage showed that even healthy origin servers couldn't be reached when the "Front Door" routing layer failed. True resilience requires mapping these hidden dependencies — a practice most organizations had not completed.

The $1B+ Cost of Concentration Risk

The October 20, 2025 AWS outage alone was estimated to have caused over $1 billion in global economic loss, with the US facing $458 million in losses per hour of nationwide internet outage. IT downtime averaged $14,056 per minute in 2024. The three-cloud oligopoly (AWS 30%, Azure 20%, GCP 13%) controlling 63% of cloud infrastructure means that when one provider has a major incident, the economic consequences are no longer just that provider's business — they are a systemic risk to the global economy.

Education Disruption — An Underreported Impact

The October 2025 AWS outage exposed a vulnerability in American education: Canvas, used by 50% of college students in North America, went offline mid-semester, preventing assignment submission and access to course materials. UC Riverside and institutions across California were affected. This reflects a broader trend of educational infrastructure migrating to cloud platforms without adequate fallback plans. For K-12 and higher education institutions in California, any major cloud outage is now also an educational disruption.

The Path to Resilience

Organizations that survived these outages with minimal impact had prepared: multi-cloud or multi-CDN routing, automated DNS failover tested in advance, independent monitoring (not relying solely on vendor status pages), and documented incident response playbooks. The industry consensus that emerged from 2025's cascade of failures: true availability requires treating your cloud provider as a component that will fail, not as an infallible foundation. For California policymakers, the concentration of critical infrastructure on a handful of providers represents an emerging systemic risk requiring governance attention.

Trend Analysis

Five Structural Trends

Trend 1

Frequency accelerated sharply in late 2025

Three major incidents in five weeks (Oct–Nov 2025). Bubble size = relative user impact.

Trend 2

Longer outages cluster at GCP and Azure

Duration per incident in hours (Azure China 50hr and Azure Capacity 7–10 days scaled for display).

Trend 3

"Good change, global blast" dominates root causes

5 of 8 incidents caused by config/code changes with no staged rollout or feature flag.

Config/code change, no flag (5)

Capacity exhaustion (1)

Race condition / timing (1)

API endpoint misconfiguration (1)

Trend 4

AWS Oct 2025 dwarfs every other incident in user impact

Peak Downdetector reports (thousands). AWS generated 17 million — a 970% spike above baseline.

Trend 5

Three failure patterns recur across all providers — often simultaneously

Every incident exhibited at least one pattern. Five exhibited all three at once. Cloudflare Nov 2025 was the most complete example: all three patterns, ~20% of internet traffic lost for 5+ hours.

Pattern A

Good change → global blast

① Config deployed globally in seconds, not hours

② Latent bug activates worldwide simultaneously

③ Control plane crashes → auth, DNS, routing broken

④ Cascades to all dependent services

GCP Jun 2025 · AWS Oct 2025 · CF Nov 2025 · Azure Oct 2025

Pattern B

Routing fails → overload cascade

① Ingress or routing layer fails

② Traffic concentrates on remaining healthy nodes

③ Healthy nodes fail under load

④ State inconsistency lingers hours past DNS restore

AWS Oct 2025 (EC2 + ELB) · Azure Oct 2025 (AFD nodes)

Pattern C

Circular dependency trap

① Fix requires the broken service to work first

② Automation tools also fail → manual recovery only

③ Customer dashboards inaccessible (CF: Turnstile on broken proxy)

④ MTTR extends to 5–15 hrs regardless of fast detection

GCP Jun 2025 · CF Nov 2025 · AWS Oct 2025

Detection speed (MTTD)

~2 min

AWS + GCP both detected within 2 minutes. Detection is now fast.

Recovery time (MTTR)

7–15 hrs

Distributed state debt accumulates during outage and must be manually cleared after the fix.

Key Insight

MTTR has decoupled from MTTD. Fixing the root cause no longer ends the outage — state recovery takes hours longer.

Cloud Outages Timeline

Major Outages in Detail

Google Cloud Vertex Gemini API — Extended Degradation

Azure China North 3 — Longest Outage of the Measurement Period

Google Cloud Global IAM / Service Control Crash — "A Single Bug Took Down the Internet"

Azure East US — Compute Capacity Crisis Driven by AI Demand

AWS US-EAST-1 — "The Outage That Shook the Internet"

Azure Front Door Global Failure — "Front Door Locked Everyone Out"

Cloudflare Global CDN Failure — "Worst Outage Since 2019"

Cloudflare — Second Outage in Three Weeks

Patterns & Lessons

The Common Thread: Good Changes Gone Global Bad

California as Ground Zero for Economic Impact

The Hidden Dependencies Problem

The $1B+ Cost of Concentration Risk

Education Disruption — An Underreported Impact

The Path to Resilience

Five Structural Trends