Understanding SLA Metrics: MTTR, Uptime, and Incident Response
What do 99.9% and 99.99% uptime actually mean? A practical guide to SLA metrics every engineering team should track.
Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.
The Nines of Uptime
SLA uptime percentages look similar but mean vastly different things in practice. 99% uptime allows for 87.6 hours of downtime per year — over 3.5 days. 99.9% (three nines) reduces that to 8.76 hours. 99.99% (four nines) means only 52.6 minutes of downtime per year. And 99.999% (five nines) allows just 5.26 minutes.
Most cloud providers advertise three to four nines of uptime, but the reality often differs. Partial outages, degraded performance, and regional incidents can reduce effective uptime below advertised SLAs. The only way to know for sure is to measure it yourself.
PulsAPI tracks uptime on a per-service, per-component basis over rolling 90-day windows. This gives you granular data that matches your actual experience, not the provider's marketing page.
MTTR: Mean Time to Recovery
MTTR measures how long it takes for a service to recover from an incident. A provider with 99.9% uptime and a 10-minute average MTTR has a very different reliability profile than one with 99.9% uptime and a 4-hour MTTR — even though the total downtime might be similar.
Low MTTR indicates that a provider has mature incident response processes: fast detection, clear communication, and effective remediation. High MTTR suggests systemic issues with how they handle outages.
When evaluating vendors, MTTR is often more actionable than raw uptime percentage. A service that has frequent but quickly-resolved blips may be more reliable in practice than one with rare but hours-long outages.
Incident Frequency and Severity
Beyond uptime and MTTR, the frequency and severity distribution of incidents tells you about a provider's operational maturity. A service that has one major outage per year is in a fundamentally different category than one with weekly degraded performance events.
PulsAPI categorizes incidents by severity: Operational, Degraded, Partial Outage, Major Outage, and Maintenance. Tracking the distribution of these over time reveals patterns. Does a provider tend to have issues during deployments? Are outages clustered around specific regions?
This pattern analysis turns reactive monitoring into proactive risk management. If you see a provider trending toward more frequent degraded states, that's an early signal to evaluate backup options before a major outage hits.
Using SLA Data in Vendor Reviews
Armed with 90 days of objective SLA data, your vendor review meetings transform from anecdotal discussions into data-driven evaluations. You can present uptime trends, MTTR comparisons, and incident frequency charts that show exactly how each dependency has performed.
For enterprise contracts, this data is leverage. If a vendor's SLA agreement guarantees 99.99% uptime but your monitoring shows 99.95%, you have grounds for service credits and contract renegotiation.
For startups and growing teams, SLA data helps prioritize where to invest in redundancy. If your payment processor has been rock-solid but your email delivery service has had monthly issues, you know where to build fallbacks first.
Start monitoring your stack
Free for up to 10 services. No credit card required.