Grafana and Kubernetes
This lesson is for when you open Grafana (or similar) to check a Node.js service running on Kubernetes — typically with Prometheus scraping cAdvisor / kube-state-metrics / your app’s /metrics endpoint.
Prerequisite: Reading memory graphs (sawtooth vs staircase).
The panels you should have open
Section titled “The panels you should have open”| Panel | What it tells you |
|---|---|
| Container memory (working set) | What K8s bills against your limit — start here |
| Memory % of limit | How close you are to OOM kill |
| Pod restarts | Confirms kernel/container killed the process |
| Request rate (RPS) | Separates leak from traffic growth |
| Node.js heap (if exported) | JS-specific view inside the container |
Stack them vertically with a shared time axis and deploy annotations.
flowchart TD
subgraph grafana [Grafana dashboard]
mem[container_memory_working_set_bytes]
pct[memory / limit %]
rps[http_requests_total rate]
restarts[kube_pod_container_status_restarts_total]
end
mem --> leak{Staircase at flat RPS?}
pct --> oom{Approaching 100%?}
restarts --> oom
leak -->|yes| investigate[Reproduce locally]
oom -->|yes| incident[OOMKill imminent or happening]
PromQL queries (copy-paste starting points)
Section titled “PromQL queries (copy-paste starting points)”Adjust namespace, pod, and container labels to match your setup. These assume the common kube-prometheus-stack label scheme.
Container memory — working set (primary chart)
Section titled “Container memory — working set (primary chart)”Working set ≈ memory the kernel won’t easily reclaim. This is what matters for OOM.
container_memory_working_set_bytes{ namespace="production", pod=~"my-api-.*", container="my-api"}Healthy: sawtooth or flat band. Leak: staircase — baseline rises over hours/days.
Memory as % of limit
Section titled “Memory as % of limit”100 * container_memory_working_set_bytes{namespace="production", pod=~"my-api-.*", container="my-api"}/ container_spec_memory_limit_bytes{namespace="production", pod=~"my-api-.*", container="my-api"}| Range | Meaning |
|---|---|
| < 60% | Comfortable headroom |
| 60–85% | Watch trend — leak or need higher limit? |
| > 90% sustained | OOM kill likely soon |
| Hits 100% + restart | See Log signatures — exit 137 |
Pod restarts (OOM confirmation)
Section titled “Pod restarts (OOM confirmation)”increase(kube_pod_container_status_restarts_total{ namespace="production", pod=~"my-api-.*", container="my-api"}[1h])Spikes every N hours while memory chart sawtooths up to the limit → classic OOMKill loop.
Correlate with:
kubectl describe pod my-api-xxxxx -n production# Last State: Terminated, Reason: OOMKilled, Exit Code: 137Request rate (traffic correlation)
Section titled “Request rate (traffic correlation)”sum(rate(http_requests_total{namespace="production", job="my-api"}[5m]))Leak signal: memory slope ↑ while this line is flat.
Not a leak: memory and RPS rise together — may be normal caching or proportional load.
Node.js heap (if you export Prometheus metrics)
Section titled “Node.js heap (if you export Prometheus metrics)”Many Node apps use prom-client or OpenTelemetry:
nodejs_heap_size_used_bytes{namespace="production", pod=~"my-api-.*"}Compare with container working set:
| Pattern | Likely cause |
|---|---|
| Heap ↑, RSS ↑ | JS object leak (see Node.js lessons) |
| Heap flat, RSS ↑ | Buffers, native addons, external memory — streams, DB drivers |
| Both flat, restarts anyway | Limit set too low for legitimate baseline |
Chart shapes on K8s (annotated)
Section titled “Chart shapes on K8s (annotated)”Healthy at steady traffic
Section titled “Healthy at steady traffic”xychart-beta title "Working set — healthy sawtooth" x-axis [00h, 04h, 08h, 12h, 16h, 20h, 24h] y-axis "Memory MB" 0 --> 512 line [280, 320, 290, 330, 285, 325, 295]
GC and allocation cycles visible. Troughs return to roughly the same level.
JS heap leak (staircase before kill)
Section titled “JS heap leak (staircase before kill)”xychart-beta title "Working set — leak then OOMKill reset" x-axis [t1, t2, t3, t4, t5, t6, t7, t8] y-axis "Memory MB" 0 --> 512 line [200, 280, 360, 440, 510, 220, 300, 380]
- t1–t5: staircase climb toward
resources.limits.memory - t5→t6: sharp drop — pod killed and replaced
- t6–t8: staircase repeats on the new pod
In Grafana you’ll see restarts increment at t5, t6, etc.
Post-deploy cache warmup (not necessarily a leak)
Section titled “Post-deploy cache warmup (not necessarily a leak)”xychart-beta title "Step up after deploy then plateau" x-axis [before, deploy, +2h, +6h, +12h, +24h] y-axis "Memory MB" 0 --> 400 line [180, 180, 260, 270, 268, 265]
Step at deploy, then flat. Validate over 48h — if it keeps climbing, it’s a leak.
A practical Grafana workflow
Section titled “A practical Grafana workflow”- Time range: Last 24h (incident) or 7d (trend)
- Add deploy annotations from your CD pipeline
- Panel 1:
container_memory_working_set_bytesper pod - Panel 2: Memory % of limit
- Panel 3:
rate(http_requests_total)or ingress RPS - Panel 4: Restart count
- Ask:
- Is memory climbing while RPS is flat? → leak suspicion
- Did restarts coincide with memory hitting limit? → OOMKill
- Did it start right after a deploy? → bisect that release
Multi-pod: which replica is leaking?
Section titled “Multi-pod: which replica is leaking?”topk(5, max by (pod) ( container_memory_working_set_bytes{ namespace="production", container="my-api" } ))One pod much higher than siblings → sticky session, uneven load, or per-pod state leak (in-memory cache without shared store).
Alert rules worth having
Section titled “Alert rules worth having”Leak trend (needs tuning per service):
# 6h linear growth while traffic flat — sketch; adjust thresholdsderiv(container_memory_working_set_bytes{container="my-api"}[6h]) > 0About to be OOMKilled:
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9Restart storm:
increase(kube_pod_container_status_restarts_total{container="my-api"}[15m]) > 2What Grafana won’t tell you
Section titled “What Grafana won’t tell you”Dashboards show that memory grows, not what retains it. Next steps:
- Identify the deploy or code path (Reproduce reliably)
kubectl port-forward+node --inspector heap snapshot in staging- Heap snapshot diffing
Quick reference
Section titled “Quick reference”| Grafana signal | Likely issue | Next step |
|---|---|---|
| Staircase, flat RPS | Memory leak | Local repro + heap snapshot |
| Spike to limit, restart, repeat | OOMKill loop | Fix leak or raise limit temporarily |
| Step at deploy, then flat | Cache warmup | Monitor 48h |
| One pod outlier | Uneven state / load | Check routing, session affinity |
| RSS ↑, heap flat | Buffers / native | Streams, DB pool, external memory |