Back to blog
SLA & SLOApril 3, 2026· 9 min read· By James Okafor

SLA vs SLO vs SLI: The Definitive Guide for Platform Engineers

Confused by the acronym soup of reliability engineering? This guide demystifies SLAs, SLOs, and SLIs with real-world examples, calculation formulas, and a calculator to find your error budget before it finds you.

Defining the Terms

SLIs are the metrics you measure, SLOs are the targets you aim for, and SLAs are the contracts you sign with customers. Understanding the distinction is the first step toward effective reliability engineering.

Service Level Indicators (SLIs) are the raw measurements: request success rate, latency P99, error rate percentage, throughput. SLIs answer 'what is the system doing right now?' Service Level Objectives (SLOs) are the internal targets you set for those indicators: 'our API must return 99.9% of requests successfully.' SLOs answer 'what should the system be doing?' Service Level Agreements (SLAs) are the external commitments made to customers, often with financial consequences for breach: 'we guarantee 99.9% uptime or provide a service credit.' SLAs answer 'what have we promised?'

The hierarchy matters: SLIs inform SLOs, SLOs should be stricter than SLAs, and the gap between your SLO and SLA is your safety margin. If your SLA promises 99.9% and your SLO targets 99.95%, you have a 0.05% buffer to absorb incidents without breaching customer commitments.

Choosing the Right SLIs for Your Service

Not every metric is a good SLI. The Google SRE Book identifies four golden signals: latency, traffic, errors, and saturation. Of these, latency and error rate are typically the most user-relevant and make the best foundation for SLIs.

For an API service, strong SLIs include: request success rate (percentage of requests returning non-5xx responses), latency P50 and P95 (median and 95th percentile response times), and availability (percentage of time the service is successfully serving requests). Avoid SLIs that are too granular (individual endpoint latency) or too broad (server CPU utilization) to be meaningful to customers.

For third-party dependencies, you often have limited visibility into their internal SLIs — but you can derive SLIs from your own experience. Track the error rate of calls to each dependency, and the latency distribution of those calls. These derived SLIs tell you how your vendors are performing from your perspective, which is what actually matters for your own SLAs.

Setting SLOs That Drive the Right Behavior

SLOs should be set based on customer needs, not engineering comfort. The right question is not 'what can we achieve?' but 'what level of reliability do our customers actually need?' For a real-time payments API, customers may need 99.99% reliability. For a batch analytics service, 99.5% might be perfectly acceptable.

SLOs also need to be meaningful to operate. An SLO so aggressive that it's breached every sprint creates a culture of constant firefighting. An SLO so lenient it's never at risk provides no pressure to improve. The right SLO is one that is occasionally at risk, creating healthy urgency around reliability investments without constant alarm.

Use error budgets — the complement of your SLO — as a management tool. A 99.9% SLO over 30 days means you have 43.2 minutes of error budget per month. When you've spent 80% of your budget, reliability work gets prioritized over feature work. When you have 100% of your budget remaining, you can take on more deployment risk. This turns reliability from a constant tension into a quantified, manageable resource.

Applying SLAs to Third-Party Vendor Evaluation

Every vendor your product depends on publishes an SLA. These SLAs matter for two reasons: they define the baseline reliability you can expect (and plan architecture around), and they create financial accountability through service credits when the vendor falls short.

PulsAPI tracks uptime and latency for 278+ cloud services, giving you objective data to compare against vendor SLA claims. If your payment processor claims 99.99% uptime but PulsAPI's 90-day data shows 99.91%, you have grounds for service credit claims and contract renegotiation. More importantly, you have data to make architecture decisions: a vendor delivering 99.91% against a 99.99% SLA claim needs a fallback or redundancy layer.

When building your own SLA commitments to customers, account for your vendor SLA stack. If your service depends on AWS (99.99% SLA), Stripe (99.99% SLA), and Auth0 (99.99% SLA), your achievable uptime ceiling — assuming independent failures — is approximately 99.97%. You cannot reliably promise customers 99.99% if your dependencies can't collectively deliver it. PulsAPI's historical uptime data gives you the empirical foundation to set realistic, defensible SLAs.

About the Author

J
James OkaforCTO

Start monitoring your stack

Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.

Create Free Dashboard
SLA vs SLO vs SLI: The Definitive Guide for Platform Engineers