Kubernetes Cluster Monitoring: A Complete Guide for SRE Teams

What to monitor in a Kubernetes cluster, which metrics matter, how to detect control plane issues, and how to combine internal metrics with cloud provider status.

The Three Layers of a Kubernetes Cluster That Need Monitoring

A Kubernetes cluster is not one system — it is at least three, and each needs its own monitoring strategy. The control plane (API server, scheduler, controller manager, etcd) decides what should run. The data plane (nodes, kubelets, container runtime) runs it. The workload plane (deployments, pods, services, ingress) is what your application actually does. Monitoring all three is the difference between knowing a pod crashed and knowing why.

The most common failure mode in production clusters is not what teams instrument for. Pods crashing is loud and obvious. The quiet killers are: etcd disk latency creeping up, the API server being slow to respond, scheduler queue length growing, kubelet certificate rotations failing, or a CNI plugin silently dropping packets. None of these show up in a default Grafana dashboard.

This guide is for SRE and platform teams running Kubernetes in production — whether self-managed, EKS, GKE, or AKS — who want to move past 'CPU and memory dashboards' into operationally meaningful coverage.

Control Plane Metrics That Predict Outages

etcd is the single point of truth for a Kubernetes cluster, and its disk performance is the single most predictive metric for cluster health. Track etcd_disk_wal_fsync_duration_seconds at the 99th percentile; if it crosses 25ms sustained, the cluster will start to fail in ways that look unrelated. Also monitor etcd_server_leader_changes_seen_total — leadership flapping means network or disk pressure.

API server latency matters for everything else, because every controller talks to it. Watch apiserver_request_duration_seconds for verbs LIST and WATCH against high-cardinality resources (pods, events). A slow LIST on pods will cascade into scheduler delays, controller backoff, and CI pipelines that look like they are hung.

Scheduler metrics often get ignored until pods sit in Pending for minutes. Track scheduler_pending_pods and scheduler_pod_scheduling_duration_seconds. If pending pods grow without resource pressure, look at affinity rules, taints, or a misconfigured PriorityClass.

Node and Workload Signals That Catch Real Problems

Node-level monitoring goes beyond CPU and memory. Disk pressure (DiskPressure condition), PID pressure (PIDPressure), and memory eviction (kubelet_evictions) cause silent pod kills that look like application crashes. Container runtime metrics (containerd or CRI-O) detect image pull stalls and broken garbage collection.

For workloads, the four golden signals — latency, traffic, errors, saturation — should be measured per deployment, not per pod. Pod-level metrics churn as pods scale; deployment-level metrics survive autoscaling and give you a clean SLO surface. RED metrics derived from a service mesh (Istio, Linkerd) or HTTP middleware are the easiest way to get there.

Set up CrashLoopBackOff and ImagePullBackOff alerts but suppress them for the first 10 minutes after a rollout. Many teams page on rollout noise that resolves itself, and the on-call learns to ignore the alert. Better to define the alert around 'sustained CrashLoop for over X minutes' than to fire on every restart.

Cloud Provider Status Is Part of Cluster Monitoring

A managed Kubernetes cluster is only as available as the cloud provider behind it. AWS EKS depends on AWS regional health for the control plane, EC2 for nodes, ELB or NLB for ingress, EBS for persistent volumes, and IAM for service accounts. GKE depends on GCE, Cloud Load Balancing, Persistent Disk, and IAM. AKS depends on Azure VM, Azure Load Balancer, and Azure AD.

Internal Kubernetes dashboards will not tell you when AWS us-east-1 is reporting EBS issues, or when GCE has a regional control plane incident. You have to correlate cluster symptoms with provider status. PulsAPI monitors all three providers at component and region level, so when nodes flap or PVCs fail to attach, you have an answer in 30 seconds rather than 30 minutes.

Tie this together in your runbooks: every Kubernetes alert template should include a quick check of the underlying cloud provider's status for the affected region. The five seconds it takes to glance at PulsAPI saves engineers from chasing internal bugs that originate two layers down.

FAQ: Kubernetes Cluster Monitoring

Do I need Prometheus to monitor Kubernetes? Practically yes — kube-state-metrics, the Prometheus node exporter, and the Prometheus Operator are the de facto standard. Managed alternatives (Datadog, New Relic, Grafana Cloud) build on the same metrics.

What is the most important single alert to set up first? Sustained API server 99th percentile latency above 1 second. It is the earliest indicator that something is going wrong cluster-wide.

How does third-party status monitoring fit in? Managed clusters depend on dozens of cloud provider components. PulsAPI normalises those status feeds so cluster operators can correlate internal symptoms with provider incidents in real time.

About the Author

Sofia AndradeSenior Infrastructure Engineer

Sofia is a senior infrastructure engineer at PulsAPI who specialises in on-call tooling and incident response automation. She has worked in SRE roles at cloud-native companies for over eight years.