OpenTelemetry Getting Started: A Practical Guide for SaaS Teams

A practical OpenTelemetry getting-started guide: what to instrument first, which SDKs to pick, how to ship to any backend, and the mistakes to avoid in production.

Why OpenTelemetry Won the Standard

OpenTelemetry, often shortened to OTel, is a CNCF project that standardises how applications emit traces, metrics, and logs. It was created by merging the OpenTracing and OpenCensus communities in 2019, and by 2026 it is the second most active CNCF project after Kubernetes. The reason it won is simple: it decouples instrumentation from backend choice. You instrument your code once, then send the data to Datadog, Honeycomb, Grafana, AWS X-Ray, or any combination, without rewriting a single span.

For engineering teams who have been burned by vendor-specific tracing agents — proprietary SDKs, hidden sampling rules, expensive lock-in — OTel is the off-ramp. It also has first-class language support for Go, Java, Python, JavaScript, Rust, .NET, Ruby, PHP, and several others, with stability guarantees on the tracing API since 2021.

This guide is for teams adopting OpenTelemetry for the first time and want to avoid the most common potholes during the first month.

Pick Your Three Components Before Writing Any Code

Every OpenTelemetry deployment has three moving parts: the SDK in your application, the Collector that receives and exports data, and the backend that stores and queries it. Decide all three before you start instrumenting, because changing them later means re-deploying every service.

For the SDK, pick the official language SDK rather than a third-party wrapper unless you have a strong reason. The official SDKs follow the spec closely and ship semantic conventions for HTTP, database, and messaging spans out of the box. For the Collector, run the OpenTelemetry Collector as a sidecar or DaemonSet in Kubernetes — it gives you batching, retries, sampling, and the ability to fan out to multiple backends without touching application code.

For the backend, choose based on the questions you ask most. High-cardinality investigative queries point to Honeycomb or ClickHouse-based stores. Tight integration with existing dashboards points to Datadog, New Relic, or Grafana Tempo. Cost-sensitive teams often start with Tempo or AWS X-Ray and migrate later. The Collector makes that migration trivial — you change one config file, not 40 services.

What to Instrument First

Resist the urge to instrument everything on week one. The 80/20 rule for OTel adoption is: enable automatic instrumentation for HTTP servers, HTTP clients, database drivers, and your queue or messaging library. That alone will produce useful end-to-end traces for the majority of customer-facing requests with no code changes.

Then add four high-value custom spans by hand: the entry point of each background job, calls to critical third-party APIs (payments, auth, email), expensive business operations (PDF generation, search indexing, model inference), and any retry loop. These are the spans that explain incidents, and they are where automatic instrumentation tends to be silent.

Always set resource attributes — service.name, service.version, deployment.environment — at process start. Without them, traces from staging, canary, and production blur together, and every investigation begins with a filter dance you do not need.

Sampling, Cost, and the Mistakes to Avoid

Trace data is cheap to generate and expensive to store. The default head-based sampling in most SDKs is fine for low-traffic services but pathological at scale, where you either pay for billions of unimportant spans or drop the rare slow request that matters most. Move to tail-based sampling in the Collector once you exceed roughly 1,000 requests per second, so you can sample by latency, error status, or trace attributes after the trace completes.

Common rollout mistakes include: instrumenting the framework but forgetting outbound HTTP clients (so traces have a parent service and a stop), enabling logs export before the backend can handle the volume (which doubles ingest costs overnight), and using the default OTLP gRPC endpoint without TLS in production. The Collector documentation covers each, but they are easy to skip.

Plan a cost ceiling before week one. Most teams that abandon OpenTelemetry do so because the bill from their backend grew faster than the value, not because the data was wrong. Tail-based sampling, sensible resource attributes, and Collector-side filtering keep both the bill and the noise low.

FAQ: OpenTelemetry Adoption

Is OpenTelemetry production-ready? Yes. The tracing specification has been stable since 2021, metrics since 2022, and logs reached stable in 2023. Major SDKs are used by companies including Shopify, Microsoft, AWS, and Uber.

Do I need to drop my existing APM? No. The Collector can export to multiple backends at once, so most teams run OpenTelemetry alongside their existing APM for several months while they migrate dashboards and alerts.

How long does a first rollout take? A small team can get automatic HTTP and database tracing live in production for two or three services within a week, and full custom instrumentation for a typical SaaS app within four to six weeks.

About the Author

Sofia AndradeSenior Infrastructure Engineer

Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.