InsightsApril 21, 2026· 9 min read· By Marcus Webb

The Biggest AWS Outages in History: A Complete Timeline and What Engineers Can Learn

A definitive timeline of the biggest AWS outages from 2011 to 2026, including S3, DynamoDB, Lambda, and us-east-1 incidents. Root causes, blast radius, and durable lessons for every engineering team.

Why AWS Outages Dominate the Internet's Worst Days

AWS runs an estimated 32% of the global public cloud market, which means a major AWS incident isn't just one company's problem — it's an internet-scale event. When us-east-1 has a bad day, so do Slack, Disney+, Coinbase, Robinhood, Reddit, and tens of thousands of SaaS businesses that silently depend on the region for control-plane operations.

The pattern repeats because us-east-1 is the default region in countless SDKs, Terraform modules, and tutorials. Global AWS services like IAM, Route 53 health checks, and billing also anchor in us-east-1, so a regional incident can look like a global one even for workloads running in other regions.

Understanding the history of AWS outages is not academic. It tells you which failure modes are likely to repeat, which dependencies are load-bearing for your own stack, and where to invest in redundancy before the next incident forces the decision.

The Ten Most Significant AWS Incidents, 2011–2026

April 2011 — EBS re-mirroring storm, us-east-1. A network change triggered a cascade of EBS volumes attempting to re-mirror simultaneously. Reddit, Quora, and Foursquare went dark for most of a workday. Duration: ~4 days for full recovery. Lesson: control-plane congestion compounds faster than data-plane failure.

February 2017 — S3 typo outage, us-east-1. An engineer removing capacity from a billing subsystem mistakenly took down a larger set of servers. S3 was unavailable for 4 hours and took most of the AWS Console with it — including the Service Health Dashboard, which ironically couldn't update because it relied on S3. Lesson: your status page must not depend on the system it reports on.

November 2020 — Kinesis outage, us-east-1. An OS file-descriptor limit was hit after a capacity addition, degrading Kinesis for 9 hours and cascading into Cognito, CloudWatch, EventBridge, and dozens of downstream services. Lesson: shared infrastructure multiplies blast radius invisibly.

December 2021 — us-east-1 network device failure. Automated scaling activity triggered abnormal network behavior in the internal AWS network, impacting the AWS Console, API Gateway, Fargate, Lambda, and EC2 APIs for over 7 hours. Lesson: the AWS Console and its APIs share fate — a runbook that requires Console access during a us-east-1 incident is not a runbook.

June 2023 — Lambda and API Gateway degradation. A subsystem handling capacity caused elevated error rates for Lambda invocations and API Gateway responses for roughly 3 hours in us-east-1. Lesson: serverless does not mean zero-dependency.

October 2025 — DynamoDB throughput throttling, multiple regions. Metadata layer issues led to elevated DynamoDB errors affecting teams that had explicitly multi-regioned their data plane. Lesson: multi-region doesn't help when the control plane is shared.

The Patterns That Keep Repeating

Across 15 years of incidents, a few patterns dominate. First, us-east-1 is disproportionately represented — not because the region is poorly operated, but because it carries more traffic, more global service dependencies, and more customer workloads than any other region. If you can run outside us-east-1, you should.

Second, the failure mode is almost always a control-plane or shared-infrastructure problem rather than a raw hardware failure. Capacity additions, configuration changes, and throttling subsystems are the usual triggers. This means engineering teams can't protect themselves purely by adding more AWS — they need to reduce coupling to AWS control planes during an incident.

Third, the AWS Service Health Dashboard is systematically late. Across the incidents above, customer-impacting degradation was detectable via synthetic probes and community chatter 15 to 90 minutes before AWS acknowledged the incident on its status page. PulsAPI users consistently see earlier signals because our monitoring combines real probe data, community reports, and vendor status feeds in parallel.

Building an AWS Resilience Posture That Actually Holds

The durable lessons from AWS outage history translate into a small set of concrete practices. Avoid us-east-1 as your default region when you have a choice. Treat IAM, Route 53, and billing as us-east-1 dependencies regardless of where your workload runs, and assume they can fail together. Do not rely on the AWS Console to run an incident — your runbook must work via AWS CLI, Terraform, or a pre-configured break-glass path.

Subscribe to AWS components individually in PulsAPI rather than treating 'AWS' as a single entity. S3 us-east-1 being degraded has nothing to do with EC2 us-west-2 being healthy. PulsAPI tracks more than 200 individual AWS components across every region, so alerts you receive are scoped to dependencies you actually use.

Finally, pre-write your customer communication templates for AWS incidents specifically. When us-east-1 goes down, your customers will see outages across dozens of SaaS tools simultaneously — they don't need another vague 'we're investigating' post. They need a clear statement of which of your features are affected, which are not, and what they should do in the meantime. Teams that publish within 5 minutes of a major AWS incident consistently earn trust; teams that publish 40 minutes in consistently lose it.

About the Author

Marcus WebbHead of Product

Marcus leads product at PulsAPI, where he focuses on making operational awareness effortless for engineering teams. Previously at Datadog and PagerDuty.

Start monitoring your stack

Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.

Create Free Dashboard

GuidesHow to Set Up Real-Time Status Monitoring for Your Entire AWS Infrastructure7 min read EngineeringCloud Outage Report: Which Services Had the Most Downtime in Q1 20268 min read DevOpsHow to Build an Incident Response Runbook for Third-Party Cloud Outages8 min read

Back to all articles