Back to blog
SLA & SLOMarch 22, 2026· 6 min read· By James Okafor

Calculating Your Error Budget: A Step-by-Step Workbook

An error budget is the difference between 100% uptime and your SLO target — and it's the key to balancing feature velocity with reliability. This workbook provides formulas, real examples, and a decision framework.

Spending Your Budget

An error budget is meant to be spent on releases, risk-taking, and innovation. If you consistently have budget left over, you might be moving too slowly. If you burn through it, you need to prioritize reliability.

The error budget is the mathematical link between your SLO and your engineering velocity. It quantifies how much 'unreliability' your customers will accept before your SLA is breached — and turns that quantity into a management tool that governs the tradeoff between shipping new features and maintaining existing stability.

This workbook walks through the formulas, shows three real examples at different SLO levels, and provides a decision framework for what to do when your budget is at risk.

Step 1: Calculate Your Error Budget

The formula is simple: Error Budget = 1 - SLO target. For a 99.9% SLO, your error budget is 0.1% — meaning 0.1% of all interactions can fail without breaching the SLO. Convert this to time: over a 30-day period (43,200 minutes), a 99.9% SLO allows 43.2 minutes of downtime. Over a year, that's 8.76 hours.

Calculate error budgets for each time window you track. Most teams use 30-day rolling windows for operational decisions and 90-day windows for SLA reporting. A 99.95% SLO gives you 21.6 minutes per month; 99.99% gives you just 4.3 minutes. The tighter the SLO, the less room for planned risk-taking.

For third-party dependencies, your available error budget is the residual after accounting for vendor SLA. If your own SLO is 99.9% but Stripe's SLA is 99.9%, Stripe alone could consume your entire budget in a single month of exactly-SLA performance. This is why tracking vendor uptime with PulsAPI and using that data to set realistic internal SLOs matters — your error budget must account for the reliability of every dependency, not just your own code.

Step 2: Track Consumption Against Your Budget

Error budget consumption is calculated from your SLI data. If your request success rate SLI shows 99.85% over the past 30 days against a 99.9% SLO, you have consumed 0.05% / 0.1% = 50% of your monthly error budget. Half your budget is gone; half remains.

Build a burn rate metric: how fast are you consuming your error budget relative to the period? A burn rate of 1 means you're consuming budget at exactly the rate that will exhaust it at period end. A burn rate of 2 means you'll exhaust it halfway through the period — a critical signal. Google's SRE workbook recommends alerting at burn rates above 14.4 (budget exhausted in 1 hour) and 6 (budget exhausted in 6 hours).

Include third-party outage time in your error budget consumption tracking. If Stripe had a 90-minute partial outage that affected your checkout success rate, that incident consumed error budget — even though the root cause was external. This creates honest accounting: your SLO represents the reliability your customers experience, regardless of whether failures originate in your code or your dependencies.

Step 3: The Budget Decision Framework

Error budget status should drive engineering decisions on a rolling basis. Define three states and their operational policies: Budget Healthy (>50% remaining) — ship features normally, accept reasonable deployment risk, run experiments; Budget At Risk (20-50% remaining) — increase deployment caution, require staging validation for all changes, defer non-critical releases; Budget Exhausted (<20% remaining) — freeze feature releases, focus engineering effort on reliability work until budget recovers.

Enforcing these policies requires organizational buy-in. The error budget framework works best when product managers understand that a depleted budget is not an engineering problem — it's a product priority question. When the budget is exhausted, the decision to freeze features in favor of reliability is a business decision supported by data, not an engineering judgment call made in isolation.

Revisit your SLO targets annually. If your budget is consistently 90%+ healthy, your SLO may be too lenient — you're leaving reliability room on the table that could fund faster shipping. If your budget is consistently exhausted, your SLO may be set higher than your current infrastructure can reliably support. The right SLO is one that creates meaningful tension, not one that's either always comfortable or always in crisis.

About the Author

J
James OkaforCTO

Start monitoring your stack

Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.

Create Free Dashboard
Calculating Your Error Budget: A Step-by-Step Workbook