Heartbeat vs Health Check Endpoints: Designing Signals That Actually Mean Something
A heartbeat says 'I'm alive'; a health check says 'I'm ready to serve traffic.' Most teams conflate them and end up with green dashboards over broken services. Here's how to design each one correctly.
Two Different Questions, Two Different Endpoints
A heartbeat answers 'is this process still running?' A health check answers 'should this process be receiving traffic right now?' These are not the same question, and a single endpoint that tries to answer both will answer neither well.
The conflation is so common that almost every framework ships a single /health endpoint by default and calls it done. That endpoint typically returns 200 OK if the HTTP server is up — which is true approximately one millisecond before, during, and after an OOM kill, a database connection pool exhaustion, or a deadlocked thread that has stopped processing real requests. Green dashboards, broken service.
The fix is to separate the two concerns into two endpoints: /health/live for the heartbeat (Kubernetes calls this a liveness probe) and /health/ready for the readiness check (Kubernetes calls this a readiness probe). Each has a different consumer, different response criteria, and different consequences when it fails.
Heartbeat (Liveness): Designed to Be Killed
The heartbeat exists for one purpose: to tell an orchestrator whether to restart the process. Its only correct response is 'the process is alive enough to respond' — nothing more. It should not check downstream dependencies, it should not query the database, it should not validate cache health. If it fails, the consumer (Kubernetes, Nomad, ECS, your supervisor of choice) will kill and restart the process. So it should fail only when restarting is the right action.
A correct liveness endpoint is essentially: 'return 200 OK in under 50ms with no side effects.' The HTTP handler itself running is sufficient evidence. The handler can additionally check for clearly broken process state — a permanently deadlocked goroutine, a flag set by a watchdog timer that fired during a 60-second event loop stall — but the bar is 'something only a process restart can fix.'
Common mistakes: checking the database in the liveness endpoint (a database hiccup will trigger pod restarts that don't help), checking external APIs (a Stripe outage will cascade into a restart loop on every pod, making everything worse), or running expensive queries (a slow liveness probe under load can itself cause the timeouts it was meant to detect). The right mental model is: liveness probes are paranoid suicide switches, and you want them to be conservative about pulling the trigger.
Health Check (Readiness): Designed to Drain Traffic
The health check exists for a different purpose: to tell a load balancer whether to route traffic to this instance. When it fails, the consumer pulls the instance out of rotation, but does not restart it — the assumption is that the instance is alive but temporarily unable to serve. This makes the readiness endpoint the right place for richer dependency checks.
A good readiness endpoint reflects what the service actually needs to do its job: database reachable and a representative query under 100ms, cache reachable, critical downstream services reachable (with appropriate timeouts), config loaded, warmup complete, no recent fatal errors. If any of those fail, the instance reports not-ready and traffic shifts to peers. The instance keeps running, can recover, and is added back when it's healthy.
Readiness checks are also where graceful shutdown happens. When a deploy or scaling event signals the process to terminate, the process should immediately start failing its readiness check while continuing to serve in-flight requests. The load balancer drains traffic from the instance, the instance finishes its current work, and then it exits. Without a readiness endpoint, every deploy involves a few seconds of customer-facing 502s as connections get dropped mid-flight.
External Probes Need a Third Endpoint
Liveness and readiness are operationally focused — they exist to coordinate with the orchestrator and the load balancer. They are usually not appropriate as the target of external monitoring probes. A readiness check that fails causes traffic to be drained, which is exactly what should happen in production — but it tells you nothing about the customer experience for the traffic that's still being served.
For external monitoring (PulsAPI, Pingdom, your own external probes), use a third endpoint: /health/external or, better, a real customer-facing endpoint with a known-stable test payload. This endpoint should exercise the actual code path a customer uses — not a special bypass — so that a regression in the real path is caught. Many teams use a synthetic order with a $0.01 amount, a pre-created test account login, or a dedicated 'canary' account that has a known set of data and predictable expected responses.
The full pattern for a production service: /health/live (cheap, returns 200 if the process is responsive), /health/ready (richer, returns 200 only if the instance is fit to serve), and a real user-flow endpoint or scripted transaction for external SLA-relevant probes. Each has its own consumer, its own failure semantics, and its own consequences. Building all three is the work of half a day and pays back the first time a deploy goes sideways at 3 AM.
If you're auditing an existing service and find a single /health endpoint doing all three jobs, the highest-leverage refactor is splitting liveness from readiness first. The cost is one new route handler; the benefit is that orchestrator-driven restarts stop firing during downstream incidents and stop turning a recoverable degradation into a cascading outage.
About the Author
Sofia builds observability tooling at PulsAPI. Previously at Datadog and Honeycomb working on metrics ingestion at scale.
Start monitoring your stack
Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.