Back to blog
DevOpsMarch 19, 2026· 7 min read

On-Call Best Practices: Setting Up Third-Party Outage Alerts That Actually Work

Most on-call setups only alert on your own infrastructure. Here's how to extend your alerting to cover the third-party services your stack depends on — without drowning in noise.

S
Sofia AndradeSenior Infrastructure Engineer

Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.

The Blind Spot in Most On-Call Setups

A well-configured on-call setup alerts on high error rates, latency spikes, and infrastructure failures. But the most common cause of production incidents at SaaS companies isn't internal — it's third-party services. AWS degradation, Stripe API issues, GitHub Actions failures, Datadog outages. These external dependencies cause real user impact, but they typically don't show up in your internal monitoring until the symptoms (elevated error rates, timeouts) reach threshold — often several minutes after the root cause began.

The result is an on-call engineer who gets paged for a symptom (high error rate in checkout service) and spends 15 minutes in a code and infrastructure investigation before discovering that the root cause is a Stripe partial outage. That 15 minutes is wasted debugging time, compounded stress, and delayed customer communication.

Extending your on-call setup to include third-party monitoring closes this gap. When a vendor outage is the cause, your on-call engineer should know that before their symptoms alert fires — not after.

Tier Your Vendors by On-Call Criticality

Not every vendor dependency warrants paging your on-call engineer at 3 AM. A logging service degradation is annoying; a payment processor outage is a P1. Before configuring alerts, create a tiered dependency map with three levels: Tier 1 (page immediately — these services being down causes direct user-facing failures), Tier 2 (notify the team but don't page — degradation increases error risk but doesn't immediately break user flows), and Tier 3 (monitor on the dashboard — these services are used but not on critical paths).

Typical Tier 1 vendors for a SaaS product: your primary payment processor (Stripe, Braintree), your main database-as-a-service (if using RDS or equivalent), your authentication provider (Auth0, Clerk), and any third-party service that is a hard dependency on your core user journey.

Tier 2 might include: your email delivery service (SendGrid, Postmark), your CDN (Cloudflare, Fastly), your observability tools (Datadog, Sentry), and your CI/CD platform (GitHub Actions, CircleCI). Tier 3 would cover marketing tools, analytics, and any service you use for non-critical features.

Alert Routing: Signal Without Noise

In PulsAPI, configure alert rules to match your tier structure. Tier 1 vendors should route to PagerDuty with 'critical' severity — this triggers your on-call rotation immediately. Tier 2 vendors should route to your team's Slack #incidents channel with 'warning' severity — visible and actionable but not paging. Tier 3 vendors should be visible on your PulsAPI dashboard without any active notifications.

Add severity filters to your Tier 1 rules: only page on Partial Outage or Major Outage status, not on Degraded Performance. A momentary degradation of 3% of Stripe API requests doesn't warrant waking someone up; a full API outage does. Use PulsAPI's status severity levels to draw this line precisely.

Maintenance windows deserve special handling. Vendor-scheduled maintenance appears in PulsAPI as a Maintenance status, not an outage. Configure your alert rules to suppress maintenance window notifications during off-hours, or route them to a separate low-priority channel so they don't pollute your incident history with planned events.

Testing and Refining Your Alert Configuration

After initial setup, run a weekly review of your alert configuration for the first month. Look at every PulsAPI notification your team received: Was it actionable? Did it page at the right severity level? Did you miss any incidents that showed up as symptoms in your internal monitoring first? This review takes 15 minutes and progressively tuning it will dramatically reduce noise and improve signal quality.

Use PulsAPI's test functionality to verify each integration channel is connected and delivering correctly. A broken Slack webhook that was never tested is indistinguishable from working configuration until you're in the middle of an incident and realize you missed alerts for the past two weeks.

Build a runbook entry for each Tier 1 vendor. When an alert fires for Stripe, your on-call engineer shouldn't have to figure out what to do from scratch. The runbook should list: the blast radius, the immediate actions to take, the customer communication template, and the escalation path if the outage extends beyond a defined duration.

Start monitoring your stack

Free for up to 10 services. No credit card required.

Create Free Dashboard
On-Call Third-Party Alerts: SRE Best Practices