Grafana and Kubernetes

This lesson is for when you open Grafana (or similar) to check a Node.js service running on Kubernetes — typically with Prometheus scraping cAdvisor / kube-state-metrics / your app’s /metrics endpoint.

Prerequisite: Reading memory graphs (sawtooth vs staircase).

The panels you should have open

Panel	What it tells you
Container memory (working set)	What K8s bills against your limit — start here
Memory % of limit	How close you are to OOM kill
Pod restarts	Confirms kernel/container killed the process
Request rate (RPS)	Separates leak from traffic growth
Node.js heap (if exported)	JS-specific view inside the container

Stack them vertically with a shared time axis and deploy annotations.

flowchart TD
  subgraph grafana [Grafana dashboard]
    mem[container_memory_working_set_bytes]
    pct[memory / limit %]
    rps[http_requests_total rate]
    restarts[kube_pod_container_status_restarts_total]
  end
  mem --> leak{Staircase at flat RPS?}
  pct --> oom{Approaching 100%?}
  restarts --> oom
  leak -->|yes| investigate[Reproduce locally]
  oom -->|yes| incident[OOMKill imminent or happening]

PromQL queries (copy-paste starting points)

Adjust namespace, pod, and container labels to match your setup. These assume the common kube-prometheus-stack label scheme.

Container memory — working set (primary chart)

Working set ≈ memory the kernel won’t easily reclaim. This is what matters for OOM.

container_memory_working_set_bytes{
  namespace="production",
  pod=~"my-api-.*",
  container="my-api"
}

Healthy: sawtooth or flat band. Leak: staircase — baseline rises over hours/days.

Memory as % of limit

100 *
  container_memory_working_set_bytes{namespace="production", pod=~"my-api-.*", container="my-api"}
/
  container_spec_memory_limit_bytes{namespace="production", pod=~"my-api-.*", container="my-api"}

Range	Meaning
< 60%	Comfortable headroom
60–85%	Watch trend — leak or need higher limit?
> 90% sustained	OOM kill likely soon
Hits 100% + restart	See Log signatures — exit 137

Pod restarts (OOM confirmation)

increase(kube_pod_container_status_restarts_total{
  namespace="production",
  pod=~"my-api-.*",
  container="my-api"
}[1h])

Spikes every N hours while memory chart sawtooths up to the limit → classic OOMKill loop.

Correlate with:

kubectl describe pod my-api-xxxxx -n production
# Last State: Terminated, Reason: OOMKilled, Exit Code: 137

Request rate (traffic correlation)

sum(rate(http_requests_total{namespace="production", job="my-api"}[5m]))

Leak signal: memory slope ↑ while this line is flat.

Not a leak: memory and RPS rise together — may be normal caching or proportional load.

Node.js heap (if you export Prometheus metrics)

Many Node apps use prom-client or OpenTelemetry:

nodejs_heap_size_used_bytes{namespace="production", pod=~"my-api-.*"}

Compare with container working set:

Pattern	Likely cause
Heap ↑, RSS ↑	JS object leak (see Node.js lessons)
Heap flat, RSS ↑	Buffers, native addons, `external` memory — streams, DB drivers
Both flat, restarts anyway	Limit set too low for legitimate baseline

Chart shapes on K8s (annotated)

Healthy at steady traffic

xychart-beta
  title "Working set — healthy sawtooth"
  x-axis [00h, 04h, 08h, 12h, 16h, 20h, 24h]
  y-axis "Memory MB" 0 --> 512
  line [280, 320, 290, 330, 285, 325, 295]

GC and allocation cycles visible. Troughs return to roughly the same level.

JS heap leak (staircase before kill)

xychart-beta
  title "Working set — leak then OOMKill reset"
  x-axis [t1, t2, t3, t4, t5, t6, t7, t8]
  y-axis "Memory MB" 0 --> 512
  line [200, 280, 360, 440, 510, 220, 300, 380]

t1–t5: staircase climb toward resources.limits.memory
t5→t6: sharp drop — pod killed and replaced
t6–t8: staircase repeats on the new pod

In Grafana you’ll see restarts increment at t5, t6, etc.

Post-deploy cache warmup (not necessarily a leak)

xychart-beta
  title "Step up after deploy then plateau"
  x-axis [before, deploy, +2h, +6h, +12h, +24h]
  y-axis "Memory MB" 0 --> 400
  line [180, 180, 260, 270, 268, 265]

Step at deploy, then flat. Validate over 48h — if it keeps climbing, it’s a leak.

A practical Grafana workflow

Time range: Last 24h (incident) or 7d (trend)
Add deploy annotations from your CD pipeline
Panel 1: container_memory_working_set_bytes per pod
Panel 2: Memory % of limit
Panel 3: rate(http_requests_total) or ingress RPS
Panel 4: Restart count
Ask:
- Is memory climbing while RPS is flat? → leak suspicion
- Did restarts coincide with memory hitting limit? → OOMKill
- Did it start right after a deploy? → bisect that release

Multi-pod: which replica is leaking?

topk(5,
  max by (pod) (
    container_memory_working_set_bytes{
      namespace="production",
      container="my-api"
    }
  )
)

One pod much higher than siblings → sticky session, uneven load, or per-pod state leak (in-memory cache without shared store).

Alert rules worth having

Leak trend (needs tuning per service):

# 6h linear growth while traffic flat — sketch; adjust thresholds
deriv(container_memory_working_set_bytes{container="my-api"}[6h]) > 0

About to be OOMKilled:

container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9

Restart storm:

increase(kube_pod_container_status_restarts_total{container="my-api"}[15m]) > 2

What Grafana won’t tell you

Dashboards show that memory grows, not what retains it. Next steps:

Identify the deploy or code path (Reproduce reliably)
kubectl port-forward + node --inspect or heap snapshot in staging
Heap snapshot diffing

Quick reference

Grafana signal	Likely issue	Next step
Staircase, flat RPS	Memory leak	Local repro + heap snapshot
Spike to limit, restart, repeat	OOMKill loop	Fix leak or raise limit temporarily
Step at deploy, then flat	Cache warmup	Monitor 48h
One pod outlier	Uneven state / load	Check routing, session affinity
RSS ↑, heap flat	Buffers / native	Streams, DB pool, `external` memory