When Third-Party Downtime Eats Your Error Budget: An SRE Guide
A 2-hour vendor outage can consume your entire monthly error budget without a single line of your code failing. Here's how SREs should account for, attribute, and respond to externally caused budget burns.
The Third-Party Error Budget Problem
Error budgets are designed to be a management tool: when you have budget, ship features and take risk; when budget is depleted, focus on reliability. This model breaks down in a specific and frustrating way when a vendor outage consumes your budget. A 90-minute Stripe partial outage can burn 50% of a 99.9% monthly error budget. Under a strict error budget policy, this would trigger a feature freeze — but the root cause was Stripe, not your code. Freezing feature development in response to a vendor outage you couldn't prevent feels both unfair and operationally counterproductive.
Most SRE teams that have operated error budgets for more than a year have encountered this tension. The standard guidance — all budget consumption is equal regardless of cause — creates perverse incentives: teams are penalized for depending on third-party services, and the error budget's signal value (your reliability needs improvement) is diluted when consumption comes from external causes beyond your control.
There is no universally agreed-upon resolution to this problem. Different organizations have landed on different approaches, each with legitimate tradeoffs. The key is making an explicit choice about how your team handles externally caused budget consumption — rather than leaving it ambiguous, which creates recurring frustration and inconsistent enforcement.
Three Approaches to External Budget Attribution
Approach 1: Full attribution regardless of cause. Your SLO is a promise to customers about their experience — if they experienced an outage, your budget burns, regardless of whether the cause was your code or your vendor. This approach preserves the integrity of the error budget as a customer experience signal and creates strong incentives to build resilience (circuit breakers, fallbacks, vendor redundancy) rather than relying on vendor reliability. The tradeoff: error budgets become a noisy signal for internal reliability improvement when a significant portion of consumption is externally caused.
Approach 2: Split attribution with separate budgets. Track two error budgets: an internal budget (consumption from your own incidents) and an external budget (consumption from third-party outages). SLO compliance is measured against the combined budget for customer-facing purposes, but engineering policy decisions (feature freezes, reliability sprints) are driven by internal budget consumption only. This approach preserves accurate customer-experience accounting while avoiding penalizing teams for vendor outages they can't prevent. The tradeoff: operational complexity — you need reliable third-party attribution data to correctly classify each budget consumption event.
Approach 3: Error budget holidays for documented vendor outages. When a vendor outage is independently verified (through PulsAPI monitoring data and official vendor incident records), the team can petition for a budget holiday — a period during which consumption is not counted against the internal error budget. This requires a formal process and approval mechanism but gives teams relief from clearly external causes. Google's SRE Book describes a similar concept as 'external dependencies' that may warrant separate treatment. The tradeoff: process overhead and potential for gaming if the approval mechanism isn't disciplined.
Attribution Infrastructure: Knowing What Caused What
Any approach beyond full attribution requires reliable attribution data — a record of when your error budget consumption occurred and what caused it. Without this, split budgets and holiday petitions are based on memory and estimation rather than data. Building attribution infrastructure before you need it is significantly easier than reconstructing it after a contentious budget review.
The attribution data you need: a timestamped record of every period of elevated error rate or degraded availability, and a corresponding record of third-party service status during those periods. PulsAPI provides the second piece: a searchable incident history for every monitored service with precise timestamps. Your own APM or uptime monitoring provides the first. Cross-referencing these two datasets during your monthly error budget review tells you, with high confidence, which consumption events had a known third-party root cause.
Automate the cross-reference where possible. If your error budget tracking system has an API, you can write a simple script that queries your incident data for each budget consumption event and checks PulsAPI's incident history for the corresponding vendor at the corresponding time. This turns a 2-hour manual audit into a 5-minute automated report, and removes the subjectivity that makes attribution debates contentious.
Adjusting SLOs to Account for Vendor Reality
Regardless of which attribution approach you choose, externally caused budget consumption is a signal that your SLOs may not account for vendor reliability. If your 99.9% SLO (43 minutes downtime per month) is routinely consumed by a vendor with 99.9% uptime SLA, you have a mathematical problem: that vendor, performing exactly at their SLA, will consume all of your budget in a month where they have a single SLA-compliant incident.
The architectural solution is to reduce your dependency on that vendor's uptime for your SLO achievement. Circuit breakers and fallbacks mean that a vendor partial outage doesn't necessarily cause your SLO to breach — your application degrades gracefully rather than failing completely. The goal is that a vendor degraded event moves your checkout success rate from 100% to 98%, not from 100% to 0%. That 2% degradation against a 99% request success SLI has very different error budget implications than a complete checkout outage.
Use PulsAPI's 90-day vendor uptime data to set realistic SLOs that account for the vendor reliability ceiling of your stack. If your two most critical vendors have historically delivered 99.92% and 99.95% uptime, your achievable SLO is bounded by the combined reliability of those dependencies — not by the uptime of your own infrastructure. Setting a 99.99% SLO against a vendor stack that historically delivers 99.87% combined reliability creates guaranteed SLO breaches regardless of your own engineering quality. Honest SLO setting, grounded in vendor historical data, is the foundation of a functional error budget program.
About the Author
Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.
Start monitoring your stack
Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.