IncidentsApril 18, 2026· 8 min read· By Sofia Andrade

Postmortem Template for Third-Party Outages: When It Wasn't Your Fault

Most postmortem templates assume you caused the outage. Here's a template built specifically for third-party incidents — covering attribution, blast radius, resilience gaps, and vendor accountability.

Why Third-Party Postmortems Are Different

Standard postmortem templates — timeline, root cause, contributing factors, action items — are designed for incidents you caused. The root cause analysis focuses on your code, your configuration, your deployment. Action items target your processes and your infrastructure. When a third-party vendor causes your incident, this template produces a frustrating and misleading output: the 'root cause' is 'Stripe had a partial outage,' the contributing factors are 'we depend on Stripe,' and the action item is 'Stripe shouldn't have outages.' None of this is useful.

Third-party postmortems require a different analytical frame. You didn't cause the outage, but you did control how quickly you detected it, how effectively you degraded, how promptly you communicated to customers, and what your architecture does to limit blast radius. These are the levers you control — and they're what your postmortem action items should target. A well-run third-party postmortem shifts the frame from 'what went wrong?' to 'how did our system respond, and how could it respond better?'

This distinction is also important for team morale and organizational learning. Blaming your engineering team for a Stripe outage is demoralizing and inaccurate. But treating the incident as purely external — 'nothing to learn here, it was their fault' — misses the genuine improvements available in your detection speed, degradation quality, and communication response. The third-party postmortem template below is designed to surface those improvements without misattributing responsibility.

The Third-Party Postmortem Template

Section 1: Incident Summary. Vendor affected, incident start and end time, incident type (partial outage / major outage / degraded performance), official vendor incident link, your first detection time, your first customer-facing communication time, total customer impact duration. This section is factual and draws primarily from PulsAPI's incident timeline and your own incident channel history.

Section 2: Timeline. A chronological sequence covering: T+0 (vendor incident begins — from PulsAPI data), T+X (first internal detection — from your alert or PulsAPI notification), T+X (incident declared, war room opened), T+X (graceful degradation activated, if applicable), T+X (first customer communication posted), T+X (vendor resolves incident — from PulsAPI), T+X (your degradation mode deactivated and full service restored), T+X (resolution notice posted to customers). Include the specific people involved at each step — not for blame assignment, but so the postmortem captures institutional knowledge about who did what.

Section 3: Our Response Quality Assessment. Rate your team's response on four dimensions: Detection Speed (how long between vendor incident start and your internal detection?), Communication Speed (how long between detection and first customer-facing update?), Degradation Effectiveness (did your graceful degradation work as designed, or did users experience hard failures?), Resolution Confidence (did you verify restoration through your own synthetic checks before removing degradation mode?). Each dimension gets a 1-5 rating with supporting evidence. This section is the heart of the third-party postmortem — it shows where your response was strong and where it needs improvement.

Action Items That Actually Improve Resilience

Section 4: Resilience Gap Analysis. Ask three questions for each critical dependency involved in this incident. First: did we have monitoring and alerting configured before this incident? If not, action item: set up PulsAPI monitoring with appropriate alert routing for this vendor. Second: did we have graceful degradation implemented for this dependency? If not, action item: design and implement a degradation mode for this specific vendor outage scenario. Third: is there a viable alternative vendor or redundancy architecture for this dependency? If not, action item: evaluate alternatives and assess the cost/complexity of a fallback implementation.

Section 5: Vendor Accountability. Document the vendor's SLA commitment, the actual uptime delivered during this incident, and whether the incident constitutes an SLA breach. If it does, note the service credit claim process and assign an owner to file the claim. For repeated incidents with the same vendor, this section should include a trend analysis: how many SLA breaches in the past 90 days, what is the pattern, and at what point does the breach history justify vendor evaluation or contract renegotiation?

Section 6: Action Items with Owners and Deadlines. Every action item from sections 4 and 5 needs a single owner, a concrete deliverable (not 'investigate circuit breakers' but 'implement circuit breaker for Stripe API calls with 50% failure threshold and checkout degradation fallback'), and a deadline within 30 days. Track these action items in your engineering project management system alongside feature work — the operational value of a postmortem is entirely in action item completion, not in the document itself.

Running the Postmortem Meeting

Schedule the postmortem within 24 hours of incident resolution while memory is fresh. Circulate the template pre-filled with factual data (timeline, impact metrics) 2 hours before the meeting so participants arrive informed rather than spending the first 20 minutes reconstructing the timeline. The meeting itself should focus on the Response Quality Assessment and Resilience Gap Analysis — the sections that require discussion and judgment.

Use a blameless facilitator, ideally someone who was not directly involved in the incident response. The facilitator's job is to keep the discussion focused on systems and processes rather than individual decisions, to push back on 'root cause = vendor' reasoning that short-circuits useful analysis, and to ensure every identified gap produces a concrete, assigned action item. A third-party postmortem where the meeting ends with 'not much we can do, it was their fault' has failed regardless of how accurate that attribution is.

Share postmortem summaries with stakeholders outside engineering. Your support team benefits from knowing how you detected the incident, what communication was sent, and what the degradation experience looked like from a user perspective. Product managers benefit from understanding the resilience gaps that are now on the engineering backlog. Customer success teams benefit from having documentation they can share with enterprise customers who raise the incident at their next QBR. A postmortem shared widely is a trust-building document; one that lives in an engineering-only wiki is a missed opportunity.

About the Author

Sofia AndradeSenior Infrastructure Engineer

Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.