Serverless Monitoring: How to Track AWS Lambda Reliability in Production

How to monitor AWS Lambda in production: cold starts, throttles, async failures, cost spikes, and how regional AWS status fits into the picture.

Why Lambda Monitoring Is Different From Server Monitoring

AWS Lambda inverts most of what traditional infrastructure monitoring assumes. There is no long-lived process to attach an agent to. There is no constant baseline of CPU or memory; usage spikes and falls to zero per invocation. There is no SSH access for live debugging. And the failure modes are different: cold starts, concurrent execution limits, async invocation drops, and event-source pressure all matter more than typical 'server is down' alerts.

Teams that bring server-thinking to Lambda end up with dashboards full of metrics that do not align with how the platform actually fails. They alert on average duration when the P99 is what causes user-visible incidents. They alert on errors when async invocation drops are silent. They miss cost regressions until the bill arrives.

This guide is for engineering teams running Lambda in production who want a monitoring setup that matches the platform — and that combines internal metrics with the AWS regional status data that ultimately controls whether the platform itself is healthy.

The Six Metrics That Actually Predict Lambda Problems

Track these six per function: Invocations (volume), Errors (function exceptions), Throttles (concurrency limit hits), Duration P95 and P99 (not average — averages hide tail latency), IteratorAge for stream-based triggers (Kinesis, DynamoDB Streams, MSK), and AsyncEventsDropped for async invocations. Together they catch about 90% of Lambda incidents.

Cold starts deserve a metric of their own. CloudWatch exposes Init Duration in the Lambda log line; ship that as a structured field and alert on P99 cold start above 3 seconds for user-facing functions. The fix is usually provisioned concurrency, but only after you confirm cold starts are the actual problem — over-provisioning is a common cost regression.

For functions behind API Gateway or ALB, also track 5xx at the gateway layer, because Lambda errors do not always surface as function errors (init failures, timeout cascades, and integration errors all show up at the gateway).

Async Drops and Throttles: The Silent Failures

Async invocations (events from S3, SNS, EventBridge) silently fail and retry, and after the retry budget runs out, they land in a dead-letter queue or vanish. Monitor AsyncEventsDropped on every async function and route the DLQ to a queue you actually inspect. Many production Lambda outages are not 'errors went up' but 'events stopped processing,' which looks like nothing at all in the function metrics.

Throttles indicate the function hit either its reserved concurrency limit or the account regional concurrency limit. The default account limit is 1,000 concurrent executions per region, which sounds high but is easy to hit during traffic spikes or backfill jobs. Track Throttles per function and alert on any sustained throttling — it almost always means user requests are dropping.

If you use provisioned concurrency, track ProvisionedConcurrencyUtilization. Sustained values above 90% mean you are about to spill into on-demand (and cold starts); values below 40% mean you are paying for capacity you do not use.

AWS Regional Status Is Part of Lambda Monitoring

Lambda is a regional service that depends on EC2 capacity, IAM, CloudWatch Logs, and (for VPC functions) VPC networking and ENI provisioning. When AWS reports a regional Lambda incident — or an incident in any of those underlying components — your function metrics will look broken even if your code is fine. Without provider status context, on-call ends up debugging code that has not changed.

Pair internal Lambda dashboards with PulsAPI's AWS component-level monitoring. When Errors and Duration spike together across multiple functions in the same region, the first place to look is whether AWS is reporting an incident on Lambda, EC2, IAM, or ENI provisioning for that region. That correlation saves the median Lambda incident an hour of triage.

Build runbook entries for the three or four AWS components your Lambda fleet depends on most: Lambda itself, EC2 (capacity for cold starts), IAM (for assumed roles), and CloudWatch (for log ingestion). Each entry should describe how the function metrics look when that specific component is degraded, so on-call can recognise the pattern.

FAQ: Serverless Monitoring

Does CloudWatch alone cover Lambda monitoring? CloudWatch has the metrics, but the default dashboards do not surface tail latency, async drops, or cost regressions clearly. Most teams add a structured logging pipeline (Lambda Powertools, OpenTelemetry) and a third-party dashboard.

What about cost monitoring? Track Invocations × Average Duration × Memory together as a 'compute units' metric. A sudden change in any factor previews a cost regression before AWS Budgets fires.

How does PulsAPI fit in? PulsAPI tracks AWS Lambda, EC2, IAM, ENI, and CloudWatch at component and region level, so when your Lambda metrics misbehave, you know in seconds whether AWS is also reporting an incident.

About the Author

Marcus WebbHead of Product

Marcus leads product at PulsAPI, where he focuses on making operational awareness effortless for engineering teams. Previously at Datadog and PagerDuty.