Skip to content

Commit 2293a7c

Browse files
author
Siva Guruvareddiar
committed
docs: add AI inference platform observability on Kubernetes blueprint
1 parent 0892369 commit 2293a7c

1 file changed

Lines changed: 231 additions & 0 deletions

File tree

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
---
2+
title: AI Inference Platform Observability on Kubernetes
3+
date: 2026-06-09
4+
author: Siva Guruvareddiar (Amazon Web Services)
5+
cSpell:ignore: DCGM dcgm Guruvareddiar prefill HBM KEDA keda ArgoCD argocd autoscaler sidecar mTLS VirtualService DestinationRule PeerAuthentication FastMCP LangGraph CrewAI MutatingWebhook ValidatingWebhook chargeback sigv4 minReplica maxReplica prune selfHeal consecutiveGatewayErrors maxEjectionPercent baseEjectionTime outlierDetection traceparent httpx
6+
---
7+
8+
## Summary
9+
10+
This blueprint covers end-to-end observability for AI inference platforms running on Kubernetes. It addresses three distinct challenges that general-purpose Kubernetes observability guidance does not resolve: GPU hardware opacity, context propagation gaps across multi-agent AI system boundaries, and per-tenant cost attribution for shared GPU clusters.
11+
12+
**Target audience:** Platform engineers, MLOps teams, and SREs operating GPU-accelerated LLM inference workloads on managed Kubernetes (EKS, GKE, AKS).
13+
14+
**Applies when:**
15+
16+
- You run NVIDIA GPU-accelerated pods on Kubernetes and need visibility beyond what `kubectl top` and standard `node_exporter` provide.
17+
- You operate multi-agent AI systems (LangGraph, CrewAI, Strands Agents) where a single user request fans out across multiple LLM calls, tool invocations, and retrieval steps.
18+
- You need per-namespace or per-team cost attribution for GPU workloads in a shared cluster.
19+
20+
**Does not apply when:**
21+
22+
- Single-GPU developer workstations or single-node inference deployments.
23+
- CPU-only inference (ONNX Runtime, OpenVINO) — the GPU metrics layer is unnecessary.
24+
- Model training workloads — the tracing and autoscaling patterns differ significantly from inference.
25+
26+
---
27+
28+
## Common challenges
29+
30+
**Challenge 1 — GPU hardware opacity.** Kubernetes exposes GPU *allocation* (`nvidia.com/gpu: 1` in a pod spec) but not *utilization*. A pod holding an A100 may be running at 8% SM utilization or 95%. Without hardware-level metrics from NVIDIA DCGM (Data Center GPU Manager), autoscaling signals and cost attribution are meaningless. Standard `node_exporter` does not expose DCGM counters.
31+
32+
**Challenge 2 — Context propagation gaps across agent boundaries.** Agent frameworks orchestrate sequences of LLM calls, tool invocations, and RAG retrieval steps. Without explicit OTel instrumentation, each step is an opaque HTTP call: failures are silent at the orchestration layer, latency cannot be attributed to individual steps, and multi-hop tool chains (agent → MCP server → external API) have no trace correlation. LLM SDKs do not propagate `traceparent` headers across boundaries by default.
33+
34+
**Challenge 3 — Per-tenant cost attribution without label enforcement.** Attributing GPU cost to specific teams requires DCGM hardware metrics, Kubernetes namespace labels, and billing aggregations to be in alignment. Label consistency is not enforced at admission time. One pod deployed without a `team=` label produces GPU metrics that cannot be attributed, breaking chargeback for the entire namespace.
35+
36+
**Challenge 4 — High-cardinality GPU metrics at scale.** DCGM exposes 200+ counters per GPU per second. At 100-GPU scale this is approximately 1.2 million samples per minute. Without cardinality management at the collector, Prometheus TSDB exhaustion occurs within weeks and remote write to managed Prometheus services hits per-tenant ingestion limits.
37+
38+
**Challenge 5 — Disaggregated inference observability.** Modern LLM serving systems split inference into separate prefill workers (KV-cache computation, SM-bound) and decode workers (token generation, HBM-bound). Autoscaling both worker types on the same metric is incorrect. An observability pipeline that does not distinguish worker role produces misleading dashboards and misfires autoscaling.
39+
40+
---
41+
42+
## General guidelines
43+
44+
### 1. GPU metrics pipeline: DCGM → OTel Collector → Prometheus backend
45+
46+
**Challenges addressed:** Challenge 1, Challenge 4, Challenge 5
47+
48+
Deploy the DCGM Exporter as a DaemonSet. Route its `/metrics` endpoint through an OTel Collector DaemonSet that applies resource enrichment, cardinality filtering, and remote write to a Prometheus-compatible backend.
49+
50+
```mermaid
51+
flowchart LR
52+
GPU["NVIDIA GPU Hardware"]
53+
DCGM["dcgm-exporter\n(DaemonSet, port 9400)"]
54+
COLL["OTel Collector\n(DaemonSet)"]
55+
PROM["Prometheus backend\n(AMP / Thanos / Mimir)"]
56+
GRAF["Grafana"]
57+
58+
GPU --> DCGM
59+
DCGM -->|"prometheusreceiver\n/metrics scrape"| COLL
60+
COLL -->|"filterprocessor\nk8sattributesprocessor\nresourcedetectionprocessor"| COLL
61+
COLL -->|"prometheusremotewriteexporter"| PROM
62+
PROM --> GRAF
63+
```
64+
65+
Retain only the key DCGM counters using `filterprocessor` — drop the rest to control cardinality:
66+
67+
| Counter | Measures | Purpose |
68+
|---|---|---|
69+
| `DCGM_FI_PROF_GR_ENGINE_ACTIVE` | SM utilization (0–1) | Prefill autoscaling |
70+
| `DCGM_FI_DEV_FB_USED` / `FB_FREE` | HBM used / free bytes | Decode autoscaling |
71+
| `DCGM_FI_DEV_POWER_USAGE` | Power draw (watts) | Cost attribution |
72+
| `DCGM_FI_DEV_GPU_TEMP` | Temperature (°C) | Thermal alerting |
73+
| `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` | Tensor core utilization | Model efficiency |
74+
75+
**Checklist:**
76+
77+
- [ ] DCGM Exporter DaemonSet deployed on GPU node pools only (use `nodeSelector` with `nvidia.com/gpu.present: "true"`)
78+
- [ ] OTel Collector `k8sattributesprocessor` attached so every metric carries `k8s.namespace.name` and `k8s.pod.name`
79+
- [ ] `filterprocessor` configured to retain ≤ 15 DCGM counters
80+
- [ ] Remote write configured with per-namespace label (`namespace`) for cost query compatibility
81+
82+
### 2. KEDA autoscaling on DCGM signals
83+
84+
**Challenges addressed:** Challenge 1, Challenge 5
85+
86+
Use KEDA ScaledObjects with the Prometheus external scaler to autoscale inference workers on actual GPU utilization rather than proxy signals like queue depth.
87+
88+
Prefill workers are SM-bound — scale on SM utilization with a 70% threshold. Decode workers are HBM-bound — scale on HBM pressure with an 80% threshold. The asymmetry is deliberate: prefill latency degrades sharply above 70% SM saturation, while 80% HBM pressure is a safe operational ceiling before OOM risk.
89+
90+
Set `minReplicaCount: 2` for both worker types — LLM model loading time (loading weights into GPU HBM) ranges from 30 to 120 seconds for 7B+ parameter models, making scale-from-zero unacceptable for interactive inference.
91+
92+
**Checklist:**
93+
94+
- [ ] Separate ScaledObjects for prefill (SM threshold) and decode (HBM threshold)
95+
- [ ] `minReplicaCount: 2` to avoid cold-start latency spikes
96+
- [ ] `cooldownPeriod: 120` to prevent thrashing during bursty traffic
97+
- [ ] KEDA `TriggerAuthentication` using workload identity (not hardcoded credentials) for Prometheus access
98+
99+
**Documentation:**
100+
101+
- [KEDA Prometheus scaler](https://keda.sh/docs/scalers/prometheus/)
102+
- [CNCF KEDA PR #5315 — AMP scaler reference implementation](https://github.com/kedacore/keda/pull/5315)
103+
104+
### 3. Admission webhooks for label enforcement
105+
106+
**Challenges addressed:** Challenge 3
107+
108+
Deploy a `ValidatingWebhookConfiguration` that rejects GPU pod creation if the `team=` label is absent. Use `failurePolicy: Fail` — if the webhook is unavailable, GPU pod creation fails visibly rather than silently passing unlabeled pods through and corrupting attribution data.
109+
110+
A complementary `MutatingWebhookConfiguration` injects the `cost-attribution/key` annotation (a composite of namespace and team label) at admission time, enriching DCGM metrics for downstream recording rules without requiring application code changes.
111+
112+
```mermaid
113+
flowchart TD
114+
POD["kubectl apply\n(GPU pod)"]
115+
VWH["ValidatingWebhook"]
116+
MWH["MutatingWebhook"]
117+
OK["Scheduled"]
118+
DENY["Denied\n(missing team= label)"]
119+
120+
POD --> VWH
121+
VWH -->|"team= present"| MWH
122+
VWH -->|"team= absent\nGPU requested"| DENY
123+
MWH -->|"inject cost-attribution/key annotation"| OK
124+
```
125+
126+
**Checklist:**
127+
128+
- [ ] Webhook deployed with `podAntiAffinity` across 3+ nodes to minimize availability impact of `failurePolicy: Fail`
129+
- [ ] `namespaceSelector` scoped to namespaces with `gpu-chargeback: enabled` label — avoids blocking system namespace pods
130+
- [ ] Webhook certificate rotation automated via `cert-manager`
131+
- [ ] Integration test: verify that a GPU pod without `team=` is rejected with a descriptive error message
132+
133+
### 4. OTel instrumentation for multi-agent AI systems
134+
135+
**Challenges addressed:** Challenge 2
136+
137+
Instrument each agent node and tool call boundary with OTel spans using the Python or JavaScript SDK. Two boundary types require explicit instrumentation:
138+
139+
**Agent node spans** — wrap each LangGraph or CrewAI node with a span that records the node name, model ID, input token count, output token count, and number of tool calls. This makes per-node latency and token cost visible in traces.
140+
141+
**MCP server spans** — instrument the [Model Context Protocol](https://modelcontextprotocol.io) server boundary rather than each tool individually. MCP decouples tool implementations from agent frameworks, so instrumenting the MCP layer provides uniform coverage regardless of which agent framework calls the tool.
142+
143+
For context propagation across HTTP boundaries, use the OTel SDK's `requests` or `httpx` instrumentation libraries — these inject `traceparent` headers automatically. For MCP servers called via stdio transport, manually extract and inject the context at each boundary using `opentelemetry.propagate.inject` and `extract`.
144+
145+
**Checklist:**
146+
147+
- [ ] Span created for each agent node (`agent.node.<name>`) with `agent.input_tokens` and `agent.output_tokens` attributes
148+
- [ ] Span created for each MCP tool call (`mcp.tool.<name>`) with `k8s.namespace.name` attribute for cost correlation
149+
- [ ] `traceparent` propagation verified end-to-end with a test trace spanning agent → MCP server → LLM API
150+
- [ ] `SpanExporter` configured to send traces to the same OTel Collector used for GPU metrics — unified backend
151+
152+
**Documentation:**
153+
154+
- [OpenTelemetry Python SDK](https://opentelemetry.io/docs/languages/python/)
155+
- [Model Context Protocol specification](https://modelcontextprotocol.io/specification)
156+
157+
### 5. Per-namespace cost attribution with recording rules
158+
159+
**Challenges addressed:** Challenge 3, Challenge 4
160+
161+
Define Prometheus recording rules that aggregate DCGM power metrics into cost-ready time series at ingest time. This avoids expensive query-time joins and ensures FinOps dashboards remain fast at scale.
162+
163+
The rule chain: GPU power per pod → average per namespace (joined with `kube_pod_labels` on the `team` label) → hourly cost estimate → 24-hour rolling total for chargeback reports.
164+
165+
Expose a Grafana variable `$team` backed by `label_values(namespace:gpu_cost_usd:rate1h, team)` so FinOps teams can filter to their namespace without writing PromQL.
166+
167+
**Checklist:**
168+
169+
- [ ] Recording rules evaluate at 60s interval — faster evaluation does not improve precision and increases backend load
170+
- [ ] Cost-per-GPU-hour constant externalized as a Prometheus external label or AlertManager annotation, not hardcoded in the rule, so it can be updated as instance pricing changes
171+
- [ ] Alert on `namespace:gpu_cost_usd:sum24h > threshold` for budget guardrails
172+
- [ ] `kube_pod_labels` federation enabled if metrics and label data are in separate Prometheus instances
173+
174+
---
175+
176+
## Implementation
177+
178+
### Reference implementation
179+
180+
A complete reference implementation is available as an AWS Sample:
181+
182+
- **GitHub:** `github.com/aws-samples/ai-inference-reliability-lab`
183+
- `terraform/` — EKS cluster with GPU node groups, DCGM DaemonSet, OTel Collector
184+
- `manifests/` — KEDA ScaledObjects, ValidatingWebhookConfiguration, ArgoCD Applications
185+
- `mcp-server/` — FastMCP server with OTel-instrumented GPU observability tools (`get_gpu_utilization`, `get_agent_cost`)
186+
- `agents/` — LangGraph agent with OTel span decorators and MCP tool integration
187+
- `dashboards/` — Grafana dashboard JSON (FinOps, SRE, DevOps personas)
188+
189+
### Related resources
190+
191+
- [GPU Cost Attribution for Disaggregated LLM Inference with NVIDIA Dynamo](https://aws-observability.github.io/observability-best-practices/recipes/eks-gpu-cost-attribution/) — AWS Observability Best Practices portal
192+
- [End-to-End Observability for Multi-Agent AI Systems on Kubernetes](https://medium.com/@sivagurunath/end-to-end-observability-for-multi-agent-ai-systems-on-kubernetes-e4133dd111d6) — Medium, June 2026
193+
- [Per-Namespace GPU Cost Attribution on EKS with NVIDIA MIG](https://medium.com/@sivagurunath/per-namespace-gpu-cost-attribution-on-eks-with-nvidia-mig-9dde0f82b6e4) — Medium
194+
- [CNCF KEDA + AMP Prometheus scaler (PR #5315)](https://github.com/kedacore/keda/pull/5315) — merged, in production use
195+
196+
### Relationship to existing blueprints
197+
198+
| Blueprint | Relationship |
199+
|---|---|
200+
| Infrastructure and Processes in Non-K8s Environments | Complementary — this blueprint covers Kubernetes-specific GPU workloads |
201+
202+
---
203+
204+
## Appendix
205+
206+
### DCGM metric → autoscaling signal mapping
207+
208+
| Metric | Threshold | Action | Worker type |
209+
|---|---|---|---|
210+
| `DCGM_FI_PROF_GR_ENGINE_ACTIVE` | > 70% | Scale out | Prefill (SM-bound) |
211+
| HBM pressure (`FB_USED / (FB_USED + FB_FREE)`) | > 80% | Scale out | Decode (HBM-bound) |
212+
| `DCGM_FI_PROF_GR_ENGINE_ACTIVE` | < 20% for 10m | Scale in | Prefill |
213+
| HBM pressure | < 40% for 10m | Scale in | Decode |
214+
215+
### OTel span naming conventions
216+
217+
| Component | Span name | Key attributes |
218+
|---|---|---|
219+
| LangGraph / CrewAI node | `agent.node.<node_name>` | `agent.model`, `agent.input_tokens`, `agent.output_tokens` |
220+
| MCP tool call | `mcp.tool.<tool_name>` | `mcp.tool.name`, `k8s.namespace.name` |
221+
| RAG retrieval step | `rag.retrieve` | `rag.query`, `rag.document_count` |
222+
| LLM API call | `llm.completions` | `llm.model`, `llm.prompt_tokens`, `llm.completion_tokens` |
223+
224+
### Admission webhook decision matrix
225+
226+
| Condition | Webhook | Action | Reason |
227+
|---|---|---|---|
228+
| GPU pod, `team=` label present | Mutating | Inject `cost-attribution/key` annotation | Enrich for recording rules |
229+
| GPU pod, `team=` label absent | Validating | Deny with descriptive error | Attribution impossible |
230+
| Non-GPU pod | Both | Allow (bypass GPU check) | No attribution needed |
231+
| Webhook unavailable | Validating (`failurePolicy: Fail`) | Block GPU pod creation | Prefer visible failure over silent attribution gaps |

0 commit comments

Comments
 (0)