EngineeringMarch 25, 2026· 8 min read· By James Okafor

Cloud Outage Report: Which Services Had the Most Downtime in Q1 2026

PulsAPI analyzed 1,240 incidents across 292 cloud services in Q1 2026. Here's which services had the most outages, the longest MTTR, and the worst SLA compliance.

Methodology

This report covers January 1 through March 31, 2026 (Q1 2026). Data is sourced from PulsAPI's monitoring of 292 cloud services, which polls official vendor status pages every 60 seconds and records status transitions with millisecond-precision timestamps. All metrics reflect what was publicly reported by vendors on their status pages — we do not include incidents detected solely by our crawlers or by community reports that were never officially acknowledged.

We categorized incidents by severity (Degraded Performance, Partial Outage, Major Outage) and measured uptime as the percentage of the 90-day period in which a service showed Operational status. MTTR (Mean Time to Recovery) is calculated from the first public status transition to Operational recovery, per incident.

Methodology note: uptime percentages reflect vendor-reported status only. Actual experienced uptime — including degraded performance that falls below threshold — may differ. These figures are the same metrics that would apply to vendor SLA credit claims.

By the Numbers: Q1 2026 Summary

Across 292 monitored services in Q1 2026, PulsAPI recorded 1,240 status incidents. Of these, 61% were Degraded Performance events, 29% were Partial Outages, and 10% were Major Outages. The average uptime across all monitored services was 99.71% — meaning the average service experienced approximately 19.6 hours of sub-operational status over the quarter.

The median MTTR across all incidents was 47 minutes. Major Outages had a median MTTR of 2 hours 14 minutes. The fastest-recovering category was Degraded Performance events at 22 minutes median MTTR, which typically reflects self-healing infrastructure or quick rollbacks.

Category breakdown: Cloud Infrastructure (AWS, GCP, Azure) had the lowest average incident rate but the longest MTTRs when incidents occurred — a reflection of their operational complexity. Communication services (Twilio, SendGrid) had the highest incident frequency but generally short MTTRs. AI/ML services showed the highest variance, with most days fully operational but occasional deep degradation events.

Standout Reliability Performers

Fastly was the standout CDN performer in Q1 2026, recording 99.98% uptime with zero Major Outage events. Cloudflare followed at 99.96%. Both companies invest heavily in distributed infrastructure and have demonstrated consistent reliability across multiple quarters of PulsAPI monitoring.

In the payments category, Braintree and Adyen posted 99.97% and 99.95% uptime respectively — outperforming their larger competitor Stripe, which recorded 99.89% due to two Partial Outage events in February and March affecting Dashboard and API services.

GitHub achieved 99.94% uptime despite two visible degraded performance events affecting Actions and API — better than its trailing-twelve-month average and reflecting improvements the team made to their build infrastructure in late 2025.

Where Engineering Teams Should Focus Redundancy Efforts

Based on Q1 2026 data, the services with the highest combined 'risk score' (incident frequency × MTTR × user dependency) are email delivery APIs, third-party authentication providers, and AI/ML inference APIs. These categories are high-dependency, have moderate-to-high incident rates, and often lack easy failover options — making them the most dangerous single points of failure for modern SaaS products.

Email delivery is the most commonly overlooked. SendGrid, Mailgun, and Postmark collectively serve the majority of transactional email for SaaS products, and each recorded multiple degraded performance events in Q1. Building a simple fallback that switches providers on delivery failure — or queuing emails locally during outages — is one of the highest-ROI reliability investments a growing SaaS can make.

For teams using AI/ML APIs as core product functionality (OpenAI, Anthropic, Google Gemini), Q1 data confirms that treating these as potentially unavailable is prudent architecture. The services themselves are relatively new, have complex infrastructure, and are under enormous load growth. Plan for degraded-mode experiences rather than assuming availability.

About the Author

James OkaforCTO

James is CTO of PulsAPI. Before PulsAPI he was a staff engineer at a Series C infrastructure company where third-party outages were a constant operational pain. He started PulsAPI to solve the problem once and for all.

Start monitoring your stack

Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.

Create Free Dashboard

DevOpsUnderstanding SLA Metrics: MTTR, Uptime, and Incident Response8 min read EngineeringThe Missing Layer in Your Observability Stack: Third-Party Cloud Dependencies7 min read

Back to all articles