CalCompute — When the Cloud Fails California

Executive Summary

Between September 2024 and September 2026, California experienced eight major commercial cloud outages — events that collectively disrupted the education, healthcare, communications, financial services, and public safety infrastructure of the most populous and economically productive state in the nation. These were not edge cases. They were the predictable consequence of concentrating critical digital infrastructure in the hands of three private corporations: Amazon Web Services, Microsoft Azure, and Google Cloud, which together control over 63 percent of global cloud computing capacity.

The data is unambiguous. The October 20, 2025 AWS outage generated 17 million user reports across 60 countries — 6.3 million from the United States alone. University of California students lost access to course systems. Ring doorbells went dark. Financial transactions failed. Engineering teams across Silicon Valley lost their tools for an entire business day. The October 29 Azure outage followed nine days later, disrupting Microsoft 365, Xbox, and Alaska Airlines operations out of LAX and SFO. Eighteen days after that, Cloudflare took down an estimated 20 percent of all global internet traffic, including ChatGPT, Anthropic’s Claude, Spotify, and transit systems.

Three major disruptions in five weeks. Every one of them caused by a routine, well-intentioned infrastructure change deployed without adequate staged rollout, feature flags, or blast-radius containment.

This is not a technology problem. It is a governance problem — and CalCompute is the governance solution.

The CalCompute Initiative, established under SB-53, exists precisely because private cloud providers have no structural obligation to prioritize California’s public interest over their own operational and commercial priorities. This analysis documents why that matters, quantifies what it has already cost Californians, and articulates what a resilient public computing alternative must deliver.

I. The Outage Record: Twenty-Six Months of Evidence

The following eight incidents constitute the documented major commercial cloud outages affecting California between September 4, 2024, and September 26, 2026. Each incident is assessed for its root cause, its duration, and its specific consequences for California residents and institutions.

Incident	Date	Provider	Duration	Severity
GCP Vertex Gemini API	Nov 2024	Google Cloud	~19 hrs	Moderate
Azure China North 3	Late 2024	Microsoft Azure	~50 hrs	Moderate
GCP Global IAM / Service Control	Jun 12, 2025	Google Cloud	~7.5 hrs	Critical
Azure East US Capacity Crisis	Jul–Aug 2025	Microsoft Azure	~7–10 days	Major
AWS US-EAST-1 DNS Cascade	Oct 20, 2025	Amazon Web Services	~14.5 hrs	Critical
Azure Front Door Global Failure	Oct 29, 2025	Microsoft Azure	~8.5 hrs	Critical
Cloudflare Global CDN Crash	Nov 18, 2025	Cloudflare	~5.6 hrs	Critical
Cloudflare WAF Rule Bug	Dec 5, 2025	Cloudflare	~25 min	Major

The AWS October 20, 2025 Outage: A Case Study in Planetary-Scale Fragility

No single event in the measurement period better illustrates the structural risk posed by commercial cloud concentration than the AWS US-EAST-1 outage of October 20, 2025. Beginning at midnight Pacific Time, a DNS race condition between two internal components of Amazon’s distributed workflow system caused the deletion of valid DNS records for DynamoDB, Amazon’s foundational database service. Because 141 AWS services depend on DynamoDB, the failure cascaded globally within minutes.

The recovery timeline makes the structural problem plain. AWS engineers detected the root cause within two minutes. They fixed the DNS records by approximately 9:25 AM UTC. But the outage continued for eleven more hours — until 8:50 PM UTC — because three hours of lost DynamoDB lease management had created state inconsistencies throughout EC2 that had to be manually resolved, server by server, zone by zone.

For California, the consequences were tangible and widespread. Snapchat — headquartered in Santa Monica — received approximately three million outage reports, the single most-affected platform globally. The Canvas learning management system, used by half of all college students in North America, went offline, disrupting instruction at UC Riverside and dozens of California community colleges mid-semester. Ring security cameras — installed in millions of California homes — became unresponsive. Slack, the communications backbone of Silicon Valley’s engineering culture, was inaccessible for most of the Pacific business day.

California Impact: AWS October 20, 2025

6.3 million U.S. user reports — the largest recorded incident by volume
Snapchat (HQ: Santa Monica): ~3 million reports; platform unavailable for most of the day
Canvas LMS offline: UC Riverside, California community college students unable to access coursework
Ring smart home devices unresponsive across millions of California households
Slack, Jira, Confluence, Zoom unavailable for Silicon Valley engineering teams
Venmo and Coinbase disrupted — financial services inaccessible
Estimated global economic loss exceeding $1 billion; California absorbing disproportionate share

The GCP June 12, 2025 Outage: When Authentication Fails Everything

The Google Cloud Service Control failure of June 12, 2025 demonstrated a different but equally instructive failure mode: the cascading collapse of a foundational authentication and authorization service. A null pointer exception in Service Control binaries — triggered by a blank-field condition in production Spanner database tables — caused the service to crash globally within seconds. Because every API request across Google’s infrastructure passes through Service Control, 76 Google Cloud products became unavailable simultaneously.

The outage struck California at peak business hours — 10:49 AM Pacific Time — and lasted until 6:18 PM. Gmail, Google Drive, Google Meet, and Google Docs became inaccessible for users requiring new authentication tokens. Google Workspace, the dominant productivity platform for California school districts, hospitals, and public agencies, failed en masse during the workday. OpenAI’s ChatGPT and Anthropic’s Claude both experienced disruptions due to their dependencies on Google infrastructure. Spotify, Discord, Snapchat, Twitch, and Shopify — companies collectively employing tens of thousands in California — lost service availability.

The circular dependency trap that extended the outage is particularly relevant to CalCompute’s design philosophy. Google’s authentication system was the failed service — and restoring it required issuing new authentication credentials. The system needed to work before it could be fixed. Manual intervention, staged partial restores, and careful Spanner load management were required to recover — a process that took over seven hours from a detection time measured in minutes.

The Five-Week Cascade: October–November 2025

The most alarming pattern in the twenty-six-month dataset is the clustering of critical outages in a five-week window: AWS on October 20, Azure on October 29, and Cloudflare on November 18. Each incident was independently caused, yet they shared a common structural origin: infrastructure changes deployed at global scope without adequate safety mechanisms.

The Azure Front Door failure of October 29 occurred when a tenant configuration change was simultaneously processed by two different versions of Azure’s control plane software running in parallel — a dual-version processing state that should have been architecturally impossible. The resulting invalid payload, propagated to Azure Front Door edge nodes globally, prevented TLS handshakes and blocked authentication token issuance. Microsoft 365, Teams, Xbox Live, Azure SQL, and the websites of Alaska Airlines, Starbucks, and Costco all failed. The outage hit at 9:45 AM Pacific Time — directly coinciding with the California business day — and lasted eight hours and twenty minutes.

Nine days after California’s IT managers had barely recovered from the AWS outage, they were managing the Azure incident. Eighteen days later, Cloudflare’s global CDN failure — caused by an oversized bot management configuration file crashing proxy processes worldwide — took down X, ChatGPT, Claude, Spotify, Canva, and banking interfaces for five-and-a-half hours. The compounding effect on organizational trust, business continuity planning, and public confidence in digital infrastructure was severe and lasting.

The Five-Week Cascade: Key Numbers

Oct 20 — AWS: 17,000,000 user reports globally; ~$1B+ economic loss
Oct 29 — Azure: 30,000+ outage reports in first hour; 8.5-hour duration
Nov 18 — Cloudflare: ~20% of all internet traffic disrupted; 5.6-hour duration
Dec 5 — Cloudflare (again): Same root cause category; safeguards not yet deployed
Total critical incidents in 5 weeks: 3, plus 1 major recurrence

II. Structural Trends: What the Data Reveals

Taken individually, each outage can be explained, apologized for, and improved upon. Taken together, they reveal five structural trends that no individual post-incident improvement program can address — because they are properties of the architecture, not defects in the implementations.

Trend 1: The “Good Change, Global Blast” Failure Dominates

Five of the eight documented incidents share an identical triggering pattern: a well-intentioned, operationally routine change was deployed to global infrastructure simultaneously, without feature flags, without staged progressive rollout, and without blast-radius containment. The change in the AWS case was a DNS cleanup job. In the GCP case, a quota policy update. In the Azure case, a configuration change processed by two software versions simultaneously. In both Cloudflare cases, security response deployments.

This is not a pattern of recklessness. All three cloud providers employ thousands of world-class reliability engineers. The pattern persists because the same technical property that makes centralized cloud infrastructure efficient — instantaneous global state replication — also makes it instantaneously vulnerable to global failures. There is no architectural difference between a good config and a bad config from the propagation system’s perspective. Both reach every data center on earth in seconds.

Private cloud providers have financial incentives to move fast and deploy broadly. Their competitive advantage depends on rapid feature iteration. CalCompute’s design brief, by contrast, places reliability and public accountability above deployment velocity. A sovereign public infrastructure can — and must — adopt deployment practices that prioritize safety over speed, because its stakeholders are not investors but the California public.

Trend 2: Outage Frequency Accelerated as AI Infrastructure Demand Surged

The twelve months from October 2024 to October 2025 saw a sharp acceleration in both the frequency and severity of cloud outages. The Azure East US capacity crisis of July 2025 — in which compute allocation failed for seven to ten days simply because AI-driven demand had outrun Microsoft’s provisioning capacity — is the clearest symptom of a systemic tension that defined the period.

The AI infrastructure boom created extraordinary pressure on cloud control planes: new services, new scaling events, new API integrations, and new hardware classes (GPU clusters, TPUs, specialized inference accelerators) were being integrated into production systems faster than traditional reliability disciplines could accommodate. The providers were, in effect, retrofitting safety systems onto moving vehicles.

For CalCompute, this trend is directly relevant. One of the Initiative’s core mandates is the provision of sovereign AI computing infrastructure for California public institutions. This analysis demonstrates that procuring such capacity from commercial providers — the alternative to CalCompute — exposes California to exactly the failure modes that have already disrupted public institutions, including UC campuses and K-12 school districts, during this period.

Trend 3: Recovery Time Has Decoupled From Detection Time

A critical and underappreciated finding from this dataset is the systematic decoupling of Mean Time to Detect (MTTD) from Mean Time to Recover (MTTR). In both the AWS and GCP critical incidents, engineers detected the root cause within approximately two minutes of the outage beginning. The AWS DNS record deletion was identified and corrected within roughly three hours. Yet the outage continued for eleven additional hours.

The reason is distributed state debt: modern cloud systems accumulate inconsistency during an outage that does not automatically self-correct when the triggering cause is resolved. EC2 lease management inconsistencies, load balancer health check failures, session state drift, and database replication lag must all be individually resolved after the fact. Detection is now nearly instantaneous. Recovery is not.

This has a direct policy implication. Service level agreements negotiated with commercial cloud providers are typically structured around uptime percentages and incident detection commitments. They do not capture recovery duration adequately. A provider that detects a problem in two minutes and takes fourteen hours to fully restore service has technically met its detection SLA while causing catastrophic disruption. CalCompute’s service design and public accountability framework must explicitly address MTTR, not merely MTTD.

Trend 4: Circular Dependencies Make Critical Outages Self-Reinforcing

Three of the eight incidents demonstrated a particularly dangerous property: the service most needed to fix the outage was itself a victim of the outage. In the GCP Service Control failure, the authentication system required to issue repair credentials was the same authentication system that had failed. In Cloudflare’s November outage, the customer-facing dashboard required the Turnstile CAPTCHA service — which ran on the broken proxy layer — making it impossible for Cloudflare’s own customers to log in and manage their configurations during the incident.

In the AWS case, the state recovery required DynamoDB operations — performed against the same DynamoDB infrastructure that had just experienced a multi-hour failure and was still in an inconsistent state. Engineers were repairing the ship’s hull from inside the ship.

These are not design oversights that will be corrected in the next release. They are emergent properties of systems that have grown to depend on themselves at every layer. CalCompute’s architecture must be designed from first principles to avoid this failure mode — maintaining out-of-band management channels, air-gapped control planes, and recovery tooling that does not depend on production-tier services.

No state is more exposed to commercial cloud failure than California. This is not coincidental — it is the direct consequence of the state’s role as the geographic headquarters of American technology. Google, Amazon, Apple, Meta, Snap, Cloudflare, and dozens of the companies most affected by these outages are based in California. The state’s $4 trillion economy is more deeply integrated with cloud infrastructure than any comparable jurisdiction on earth.

When AWS fails, Snapchat fails — and Snapchat is in Santa Monica. When GCP fails, Google Workspace fails — and the overwhelming majority of California’s K-12 school districts, UC campuses, and state agency offices run Google Workspace. When Cloudflare fails, Claude and ChatGPT fail — and the Bay Area’s AI economy, now the most economically significant in the world, depends on those services.

The harm is not merely economic. The October AWS outage disrupted Ring security cameras across millions of California households at a time when wildfire-season anxiety about home monitoring is at its annual peak. The GCP outage disrupted medical record access. The Azure outage disrupted Alaska Airlines operations at LAX and SFO, affecting travelers. Public safety, educational continuity, and economic productivity are all implicated. California cannot treat these as acceptable externalities of private infrastructure decisions made in boardrooms with no accountability to state residents.

III. The CalCompute Response: From Evidence to Architecture

The foregoing analysis is not presented as a critique of commercial cloud providers’ engineering competence. AWS, Google, and Microsoft employ the most capable infrastructure engineers in the world. The failures documented here are structural — they arise from the intersection of scale, speed, and commercial incentive — and cannot be engineered away within the current private-provider model. The policy response must match the structural diagnosis.

What the Outage Record Demands of CalCompute

The eight incidents examined here collectively specify a set of non-negotiable requirements for any public computing alternative that California builds. These requirements did not come from theoretical risk analysis — they came from observed failures that have already harmed Californians:

Staged, blast-radius-limited deployment pipelines — The dominant failure mode across five of eight incidents was a single change deployed globally without containment. CalCompute’s operational protocols must mandate progressive deployment with automatic rollback triggers, with no single change permitted to affect more than a defined percentage of capacity simultaneously.
Out-of-band control plane independence — Three incidents demonstrated that recovery tooling cannot depend on production services. CalCompute must maintain a physically separate management infrastructure capable of operating and recovering the production environment even when production services are entirely unavailable.
MTTR-explicit service accountability — SLAs must be written around time-to-full-recovery, not merely time-to-detection or time-to-root-cause-resolution. The eleven-hour recovery tail of the AWS October incident — after a three-hour fix — must be a separately measured and accountable metric.
Elimination of circular recovery dependencies — CalCompute’s architecture review process must explicitly map and eliminate scenarios in which the service needed to recover a failure is itself affected by that failure. This is an architectural requirement, not an operational one.
Sovereignty over AI compute capacity — The Azure East US capacity crisis demonstrated that commercial providers will prioritize aggregate demand over individual customers when capacity is scarce. California’s public institutions — UC campuses, state agencies, K-12 districts — cannot be subject to allocation failure when AI compute demand surges. CalCompute must maintain dedicated, ring-fenced capacity for public-interest workloads.
Geographic and jurisdictional resilience — Every critical incident in this dataset originated in infrastructure controlled by a company headquartered outside California, operating under federal and international commercial law with no obligation to California’s public interest. CalCompute’s sovereign mandate means its operational continuity decisions are made by and for California, not by shareholders in Seattle or Redmond.

The “But Commercial Cloud Is Cheaper” Objection

The most common objection raised against CalCompute investment is that commercial cloud infrastructure is cheaper than building sovereign capacity. This analysis makes the cost comparison considerably more complex than unit-pricing comparisons suggest.

The October 2025 AWS outage alone caused estimated economic losses exceeding $1 billion globally. The United States experienced losses of approximately $458 million per hour of nationwide internet outage. IT downtime costs averaged $14,056 per minute across enterprises in 2024. When UC Riverside students could not submit assignments, when California hospitals could not access records, when state agency workers lost their productivity tools, when millions of Ring cameras went offline — these costs do not appear in cloud provider billing statements. They are externalized to California taxpayers, institutions, and residents.

A full cost accounting of commercial cloud dependency must include: direct productivity losses during outages, the cost of redundancy systems required to hedge against outage risk, the cost of legal and regulatory exposure when service failures affect sensitive data, the economic multiplier effect of tech-sector downtime in an economy as cloud-dense as California’s, and the unquantifiable cost of eroding public trust in digital government services.

The question is not whether CalCompute can match commercial cloud pricing. The question is whether California can afford the true, fully-loaded cost of commercial cloud dependency.

Serving California’s Public Interest Specifically

The outage record reveals specific California public-interest failures that a CalCompute-hosted alternative would have been structured to prevent or mitigate. Three deserve particular emphasis.

Education. The Canvas LMS outage during the October AWS incident — affecting UC Riverside and dozens of California community colleges mid-semester — represents the kind of critical public-sector disruption that CalCompute is designed to address. Educational institutions procuring cloud services from commercial providers accept a structural dependency on infrastructure that has no obligation to prioritize their continuity. A CalCompute-hosted educational infrastructure tier would be explicitly designed and resourced with educational availability as a first-order requirement, not a commercial afterthought.

Healthcare. The GCP June 2025 outage, which began at 10:49 AM on a weekday, disrupted medical record access systems across California. Healthcare providers operating on commercial cloud platforms have no path to escalate their criticality when a provider experiences a global control-plane failure. CalCompute’s public accountability structure creates exactly that path — healthcare institutions would be recognized stakeholders in CalCompute’s capacity planning and incident response protocols, not anonymous line items on a commercial usage report.

Smart infrastructure and public safety. Ring cameras, smart traffic systems, and emergency notification platforms all experienced disruptions during one or more of the documented outages. As California continues to build out smart city infrastructure and AI-assisted emergency management systems, the dependency on commercial cloud providers for these public safety applications creates unacceptable risk. Critical public safety systems must run on infrastructure with a direct public accountability obligation.

IV. Implications for Policy and Implementation

The evidence assembled in this analysis supports several concrete policy conclusions that bear on the CalCompute Initiative’s implementation trajectory as of September 2026.

Priority 1: Public-Sector Dependency Mapping

The most urgent immediate step is a comprehensive audit of California public-sector dependencies on commercial cloud infrastructure. Which state agencies, school districts, UC and CSU campuses, hospital systems, and public safety organizations run mission-critical workloads on AWS, Azure, or GCP? Which of those workloads experienced disruption during any of the eight documented outages? The outage record analyzed here provides a retroactive framework for this mapping — but prospective risk assessment requires systematic inventory.

CalCompute should commission this dependency mapping as a precondition for prioritized migration planning. Workloads serving the highest-criticality public functions — emergency management, public health records, educational assessment systems, and financial aid disbursement — should be the first candidates for CalCompute-hosted alternatives.

Priority 2: SLA Reform for Interim Commercial Contracts

Until CalCompute capacity is sufficient to serve public-sector workloads at scale, California’s public institutions will continue to rely on commercial cloud providers. The outage record analyzed here demonstrates that existing commercial SLAs are inadequate to protect public-sector interests.

The state should, through the Department of General Services and CalOES, establish revised minimum SLA requirements for commercial cloud contracts serving state agencies and educational institutions. These requirements should include: time-to-full-recovery metrics (not merely uptime percentages), mandatory public disclosure of incidents affecting California public-sector workloads, graduated financial remedies proportional to the public impact of outages, and explicit redundancy requirements for workloads designated as critical public infrastructure.

Priority 3: CalCompute Architecture Must Embody Trend-Derived Lessons

The five structural trends identified in Section II should be treated as binding architectural constraints on CalCompute’s technical design, not aspirational objectives. Specifically:

The deployment pipeline architecture must be specified and independently audited before any production workload is onboarded.
Out-of-band control plane independence must be demonstrated under simulated full-production failure conditions before CalCompute declares operational readiness.
MTTR accountability must be built into CalCompute’s performance framework from inception — not bolted on after the first incident.
The AI compute sovereignty function — ring-fenced capacity for UC and state agency AI workloads — must be treated as a day-one requirement, not a future expansion, given the surge in AI-related cloud demand documented in Trend 2.

Priority 4: Public Transparency as a Structural Differentiator

One of CalCompute’s most significant potential differentiations from commercial providers is its public accountability obligation. When AWS suffers a major outage, its public-facing status page lags the actual incident. When Google writes a post-incident report, it controls the framing, the timeline, and what technical details to disclose. When Cloudflare repeats the same failure category seventeen days after pledging to prevent it, there is no external accountability mechanism.

CalCompute should commit to a public incident reporting framework that includes: real-time status reporting with granular service-by-service accuracy, post-incident reports released within 72 hours of resolution, quarterly reliability reports reviewed by a public advisory body, and a formal process by which California public institutions can report service quality concerns. This framework would transform CalCompute from a technically differentiated alternative into a demonstrably accountable one — and would set a standard that commercial providers may ultimately be compelled to match.

Conclusion: The Evidence Is In

For two years, commercial cloud failures have visited measurable harm on California residents, students, workers, and institutions. The pattern is not one of random bad luck. It is the systematic consequence of concentrating critical public-interest infrastructure in private providers who are structurally optimized for deployment velocity and commercial scale, not for the kind of deliberate, accountable, public-service reliability that California’s 39 million residents deserve.

The CalCompute Initiative is not a reaction to these failures. It was conceived in anticipation of them — grounded in the recognition that California’s digital future cannot be entrusted entirely to infrastructure whose priorities it does not control. What this twenty-six-month record of outages provides is not the rationale for CalCompute, but the empirical vindication of it.

The five structural trends documented here — the global blast radius of routine changes, the acceleration of AI-driven infrastructure instability, the decoupling of detection from recovery, the self-reinforcing circularity of critical failures, and the disproportionate burden borne by California — are not trends that will self-correct. They are endemic to the current model. CalCompute offers California a different model: one built on public accountability, sovereign design, and the explicit prioritization of California’s public interest.

Every hour the October 2025 AWS outage ran, California bore costs — in lost productivity, in disrupted education, in degraded public safety, in eroded trust — that never appeared on anyone's AWS bill. CalCompute changes the calculus.

The evidence is in. The case is made. The work of implementation is what California now owes its public.

— ◆ —

CalCompute Policy Analysis · Published September 26, 2026 · CalCompute Initiative Policy Team

This analysis draws on official post-incident reports from AWS, Google Cloud, and Microsoft Azure; Cloudflare incident blog posts; ThousandEyes network intelligence data; Downdetector/Ookla incident tracking; and published coverage from AP, TechCrunch, CIO, and DemandSage. All economic estimates are drawn from publicly available third-party analyses. This document does not constitute legal advice.

When the Cloud Fails California