DevOpsMarch 21, 2026· 8 min read· By Sofia Andrade

How to Build an Incident Response Runbook for Third-Party Cloud Outages

Most incident runbooks only cover outages you cause. Here's a template for handling third-party vendor outages — from detection to customer communication to postmortem.

Why Third-Party Runbooks Are Different

Standard incident runbooks assume you have control over the system that's failing. You can roll back a deployment, scale up infrastructure, or hotfix a bug. Third-party outages are fundamentally different: you have no control over resolution timeline, limited visibility into root cause, and must communicate to customers about a problem you didn't create and can't fix.

The psychological dynamic is also different. When your own code causes an incident, engineers feel urgency and ownership. When a vendor is down, there's often a temptation to wait and see — after all, there's nothing to fix on your end. That passive response typically leads to longer customer impact than necessary, because the workarounds and communications that are possible don't happen fast enough.

A third-party outage runbook addresses both the process and the psychology. It defines in advance what actions are possible and expected, even when you can't fix the underlying problem, so your team acts confidently rather than waiting for clarity that may not come for hours.

Phase 1: Detection and Attribution (Target: Under 5 Minutes)

The first phase is confirming that the incident is third-party and identifying which vendor is affected. With PulsAPI configured, this should be automatic: an alert fires with the affected service name, severity, and component. Without automated monitoring, this phase is manual and typically takes 15 to 30 minutes of investigation.

Runbook step 1: Acknowledge the PulsAPI alert (or manually check PulsAPI if you don't have alerts configured). Confirm the service status on PulsAPI's service page — look at the vendor status, crawler data, and community signal together. If all three show problems, you have high confidence in attribution.

Runbook step 2: Determine the blast radius. Which of your product features depend on this service? Create a quick dependency map in your runbook for each critical vendor so this question takes seconds, not minutes. For example: 'Stripe outage affects checkout and subscription renewal flows. Users cannot complete purchases.'

Phase 2: Triage and Workaround (Target: Under 15 Minutes)

Once the outage is confirmed and attributed, the question is: what can we do about it? For many third-party outages, the answer is a combination of graceful degradation and proactive communication.

Runbook step 3: Activate graceful degradation if applicable. For payment processor outages, this might mean disabling checkout while displaying a helpful message. For email provider outages, it means queuing outbound emails locally. For authentication provider outages, it means keeping existing sessions alive while blocking new logins gracefully. Document these specific degradation modes in your runbook for each critical vendor.

Runbook step 4: Draft the customer-facing status update. Don't wait until resolution to communicate — proactive transparency during an outage is one of the strongest signals of operational maturity to customers. A template: 'We're aware of an issue affecting [feature] due to an outage with our infrastructure provider. We're monitoring the situation and will update when resolved. Track updates at [your status page].'

Phase 3: Resolution and Postmortem

Resolution for third-party outages is waiting — but waiting actively. Set a status page update cadence (every 30 minutes during the outage). Assign someone to monitor the vendor's status page and PulsAPI for the recovery signal. Define the recovery confirmation criteria: don't close the incident until PulsAPI shows the vendor as Operational and your own error rates have returned to baseline.

Runbook step 5: Post the resolution update. Acknowledge what happened, what the impact was for your users, and — if applicable — what changes you're making to reduce future impact. If the vendor's SLA breach was significant, note that you're filing for service credits.

Third-party postmortems differ from internal ones in a key way: your action items are about reducing your dependency or improving your resilience, not about fixing the root cause. Common outputs: implement fallback for critical vendor X, add circuit breakers to Y integration, evaluate alternative providers for Z, add vendor X to PulsAPI monitoring with PagerDuty escalation.

About the Author

Sofia AndradeSenior Infrastructure Engineer

Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.

Start monitoring your stack

Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.

Create Free Dashboard

DevOpsOn-Call Best Practices: Setting Up Third-Party Outage Alerts That Actually Work7 min read GuidesHow to Route PulsAPI Alerts to PagerDuty for On-Call Escalation6 min read

Back to all articles