Last updated: May 23, 2026
Distributed Tracing Best Practices for Microservices in 2026
Distributed tracing best practices for microservices: span design, sampling, context propagation, third-party calls, and the pitfalls that make traces useless during incidents.
Why Tracing Is the Hardest Telemetry to Get Right
Metrics tell you something is slow. Logs tell you what an individual request did. Only distributed traces tell you which hop in a 12-service request path caused the slowness, and why a retry storm in one service is propagating into another. That is exactly why traces are the highest-value telemetry in a microservices system — and why they are also the easiest to instrument badly.
Bad traces have a shape engineers learn to recognise: huge gaps between spans where context propagation dropped, parents without children where outbound HTTP was not instrumented, and identical span names for thousands of different operations. When that is the telemetry available at 02:00, the incident takes an extra hour.
This guide is for engineering teams already emitting traces who want to lift the quality from technically present to operationally useful. It assumes OpenTelemetry as the instrumentation layer but the patterns apply to any tracing system.
Span Design: Name Operations, Not Implementations
A span name should describe the operation a developer cares about, not the framework function that happened to fire. HTTP server spans should be named after the route template (GET /orders/{id}), not the controller method. Database spans should carry the operation and table (SELECT orders), not the literal SQL string. Background job spans should be named after the job class, not the queue runner.
Use a small, finite set of span names per service. If you let span names include high-cardinality values like user IDs or request UUIDs, your tracing backend will struggle to aggregate, and dashboards built on span name become unusable. Put high-cardinality values into attributes, where backends are designed to query them.
Set semantic conventions on every span: http.method, http.status_code, db.system, messaging.system, peer.service. The OpenTelemetry semantic conventions are not optional polish — they are what enables your backend to render service maps, group errors, and compute RED metrics (Rate, Errors, Duration) without custom per-service config.
Context Propagation Is the Whole Game
A trace is only useful if context propagates across every hop. Internal HTTP calls should propagate W3C Trace Context headers (traceparent and tracestate). Background jobs should carry the parent trace ID through the queue payload. Webhooks delivered to third parties cannot expect the receiver to propagate, but you can record an outbound span on your side and link it to the eventual callback when it arrives.
The two places teams most often lose context: custom HTTP clients written before instrumentation was added, and message bus libraries that strip headers. Add a propagation test to your CI suite — send a request through every major service and assert that the trace ID matches end to end. This single check catches more broken instrumentation than any code review.
For third-party API calls (Stripe, OpenAI, Twilio, GitHub), record outbound spans even though the vendor cannot continue the trace. Pair them with PulsAPI status data and you can answer the question that matters in incidents: 'this slow checkout was a Stripe latency spike at 14:07, not our code.'
Sampling Without Losing the Interesting Traces
At any meaningful scale, you cannot keep every trace. Head-based sampling — deciding at the root span whether to keep a trace — is simple but throws away the rare slow or failing requests that matter most. Tail-based sampling, run in the OpenTelemetry Collector, lets you keep every error trace, every trace above a latency threshold, and a small uniform sample of everything else.
A common starting policy that works for most SaaS teams: keep 100% of traces with error status, 100% of traces above the P99 latency for that service, 10% of traces from authenticated users, and 1% of everything else. Adjust the percentages once you see actual storage cost, but do not start with a single global sample rate.
Always sample at the Collector, not the SDK, once you have more than a few services. Sampling at the SDK forces every service to agree on the same rules or you end up with broken parent-child relationships in the backend.
FAQ: Distributed Tracing
How many spans is too many per request? There is no fixed limit, but most backends start charging or downgrading queries past a few hundred spans per trace. Aim for one span per meaningful operation, not one span per function call.
Should I trace database queries individually? Yes, with the query template as the span name and the table as an attribute. Skip queries inside hot loops or instrument them with a single parent span that records counts.
Can tracing replace logging? No. Traces show structure and timing across services; structured logs carry the detailed payloads, errors, and stack traces. They are complementary, and the best backends correlate them by trace ID.
About the Author
Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.
Start monitoring your stack
Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.