Skip to content

Grafana and Kubernetes

This lesson is for when you open Grafana (or similar) to check a Node.js service running on Kubernetes — typically with Prometheus scraping cAdvisor / kube-state-metrics / your app’s /metrics endpoint.

Prerequisite: Reading memory graphs (sawtooth vs staircase).

Panel What it tells you
Container memory (working set) What K8s bills against your limit — start here
Memory % of limit How close you are to OOM kill
Pod restarts Confirms kernel/container killed the process
Request rate (RPS) Separates leak from traffic growth
Node.js heap (if exported) JS-specific view inside the container

Stack them vertically with a shared time axis and deploy annotations.

flowchart TD
  subgraph grafana [Grafana dashboard]
    mem[container_memory_working_set_bytes]
    pct[memory / limit %]
    rps[http_requests_total rate]
    restarts[kube_pod_container_status_restarts_total]
  end
  mem --> leak{Staircase at flat RPS?}
  pct --> oom{Approaching 100%?}
  restarts --> oom
  leak -->|yes| investigate[Reproduce locally]
  oom -->|yes| incident[OOMKill imminent or happening]

PromQL queries (copy-paste starting points)

Section titled “PromQL queries (copy-paste starting points)”

Adjust namespace, pod, and container labels to match your setup. These assume the common kube-prometheus-stack label scheme.

Container memory — working set (primary chart)

Section titled “Container memory — working set (primary chart)”

Working set ≈ memory the kernel won’t easily reclaim. This is what matters for OOM.

container_memory_working_set_bytes{
namespace="production",
pod=~"my-api-.*",
container="my-api"
}

Healthy: sawtooth or flat band. Leak: staircase — baseline rises over hours/days.

100 *
container_memory_working_set_bytes{namespace="production", pod=~"my-api-.*", container="my-api"}
/
container_spec_memory_limit_bytes{namespace="production", pod=~"my-api-.*", container="my-api"}
Range Meaning
< 60% Comfortable headroom
60–85% Watch trend — leak or need higher limit?
> 90% sustained OOM kill likely soon
Hits 100% + restart See Log signatures — exit 137
increase(kube_pod_container_status_restarts_total{
namespace="production",
pod=~"my-api-.*",
container="my-api"
}[1h])

Spikes every N hours while memory chart sawtooths up to the limit → classic OOMKill loop.

Correlate with:

Terminal window
kubectl describe pod my-api-xxxxx -n production
# Last State: Terminated, Reason: OOMKilled, Exit Code: 137
sum(rate(http_requests_total{namespace="production", job="my-api"}[5m]))

Leak signal: memory slope ↑ while this line is flat.

Not a leak: memory and RPS rise together — may be normal caching or proportional load.

Node.js heap (if you export Prometheus metrics)

Section titled “Node.js heap (if you export Prometheus metrics)”

Many Node apps use prom-client or OpenTelemetry:

nodejs_heap_size_used_bytes{namespace="production", pod=~"my-api-.*"}

Compare with container working set:

Pattern Likely cause
Heap ↑, RSS ↑ JS object leak (see Node.js lessons)
Heap flat, RSS ↑ Buffers, native addons, external memory — streams, DB drivers
Both flat, restarts anyway Limit set too low for legitimate baseline
xychart-beta
  title "Working set — healthy sawtooth"
  x-axis [00h, 04h, 08h, 12h, 16h, 20h, 24h]
  y-axis "Memory MB" 0 --> 512
  line [280, 320, 290, 330, 285, 325, 295]

GC and allocation cycles visible. Troughs return to roughly the same level.

xychart-beta
  title "Working set — leak then OOMKill reset"
  x-axis [t1, t2, t3, t4, t5, t6, t7, t8]
  y-axis "Memory MB" 0 --> 512
  line [200, 280, 360, 440, 510, 220, 300, 380]
  • t1–t5: staircase climb toward resources.limits.memory
  • t5→t6: sharp drop — pod killed and replaced
  • t6–t8: staircase repeats on the new pod

In Grafana you’ll see restarts increment at t5, t6, etc.

Post-deploy cache warmup (not necessarily a leak)

Section titled “Post-deploy cache warmup (not necessarily a leak)”
xychart-beta
  title "Step up after deploy then plateau"
  x-axis [before, deploy, +2h, +6h, +12h, +24h]
  y-axis "Memory MB" 0 --> 400
  line [180, 180, 260, 270, 268, 265]

Step at deploy, then flat. Validate over 48h — if it keeps climbing, it’s a leak.

  1. Time range: Last 24h (incident) or 7d (trend)
  2. Add deploy annotations from your CD pipeline
  3. Panel 1: container_memory_working_set_bytes per pod
  4. Panel 2: Memory % of limit
  5. Panel 3: rate(http_requests_total) or ingress RPS
  6. Panel 4: Restart count
  7. Ask:
    • Is memory climbing while RPS is flat? → leak suspicion
    • Did restarts coincide with memory hitting limit? → OOMKill
    • Did it start right after a deploy? → bisect that release
topk(5,
max by (pod) (
container_memory_working_set_bytes{
namespace="production",
container="my-api"
}
)
)

One pod much higher than siblings → sticky session, uneven load, or per-pod state leak (in-memory cache without shared store).

Leak trend (needs tuning per service):

# 6h linear growth while traffic flat — sketch; adjust thresholds
deriv(container_memory_working_set_bytes{container="my-api"}[6h]) > 0

About to be OOMKilled:

container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9

Restart storm:

increase(kube_pod_container_status_restarts_total{container="my-api"}[15m]) > 2

Dashboards show that memory grows, not what retains it. Next steps:

  1. Identify the deploy or code path (Reproduce reliably)
  2. kubectl port-forward + node --inspect or heap snapshot in staging
  3. Heap snapshot diffing
Grafana signal Likely issue Next step
Staircase, flat RPS Memory leak Local repro + heap snapshot
Spike to limit, restart, repeat OOMKill loop Fix leak or raise limit temporarily
Step at deploy, then flat Cache warmup Monitor 48h
One pod outlier Uneven state / load Check routing, session affinity
RSS ↑, heap flat Buffers / native Streams, DB pool, external memory