KEP-2837: Add instrumentation section and update GA criteria by ndixita · Pull Request #6180 · kubernetes/enhancements

ndixita · 2026-06-08T23:31:40Z

One-line PR description: Add instrumentation and update GA criteria

Issue link: Pod level resources #2837

ndixita · 2026-06-08T23:31:55Z

k8s-ci-robot · 2026-06-08T23:56:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ndixita
Once this PR has been reviewed and has the lgtm label, please assign mrunalp, soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tallclair · 2026-06-09T16:16:02Z

+#### Admission & API Validation
+These metrics track feature adoption, user intent, and validation friction within the control plane.
+
+##### `pod_level_resources_admission_total`


Suggested change

##### `pod_level_resources_admission_total`

##### `kubelet_pod_level_resources_admission_total`

+1. This should be the main adoption metric (ALPHA, removed in a future release). I recommend dropping the other adoption related metrics listed below.

Let's make a note here that this metrics is ALPHA, temporary, and removed in 2-3 release.

tallclair · 2026-06-09T16:22:21Z

+These metrics track feature adoption, user intent, and validation friction within the control plane.
+
+##### `pod_level_resources_admission_total`
+Total number of pods processed during Kubelet admission, categorized by resource configuration strategy.


How do you anticipate using this? I wonder whether kube-state-metrics could meet the use case?

This is to monitor the adoption and if there's any friction in the new validation.

tallclair · 2026-06-09T16:30:15Z

+
+This metric is recorded as a counter.
+
+##### `pod_level_resources_validation_errors_total`


Do we have any generic metrics tracking validation errors? I wonder whether we could count validation errors by field path (stripping out specific identifiers like index or key), or if the cardinality would be too high?

/cc @jpbetz @yongruilin

Good question — we don't have a generic one today; the existing apiserver_validation_* metrics only track declarative-vs-handwritten parity.

I think the catch is cardinality: raw field paths are unbounded. We could strip subscripts (spec.containers[].resources.limits[]) or label by field.Error.Origin (the rule kind) instead.

But it feels more like declarative-validation-framework infra though, not really this KEP — maybe a separate issue under sig/api-machinery? Keeping pod_level_resources_validation_errors_total here for now.

Cardinality is already high just for the individual fields across all versions of all APIs. If then we try to multiply that by any other labels it get's completely out of control (even resources are risky this way... consider the CRD cases). Better for feature owners to identify what is actually needed an add metrics for just those neesd.

tallclair · 2026-06-09T16:32:02Z

+
+This metric is recorded as a counter.
+
+##### `pod_level_resources_defaulting_total`


AFAIK we don't have any metrics measuring defaulting or validation decisions. I don't know if there's a technical reason for this or if we just haven't had a use case for it, but I'm hesitant to introduce the metrics here.

Yeah, I think we can skip this.

tallclair · 2026-06-09T16:32:45Z

+Tracks operation failures during the container lifecycle that are specific to the shared pod-level resource pool.
+
+##### `kubelet_pod_level_oom_kills_total`
+Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted. This metric is crucial for identifying cases where a container was killed even if it was under its own individual limit, but the pod's aggregate limit was reached.


How will this be measured / detected?

If you agree that this is worth tracking, we could measure by:

fetching memory.events for all containers and pod cgroup using cadvisorStatsPRovider

pod limit ooms = pod ooms - sum(container ooms)
IMO it is worth tracking as pod-level resources allow resource sharing among the containers. So this metric could help understand if pod-level limit is set correctly or not.

tallclair · 2026-06-09T16:34:26Z

+#### Resource State (State Metrics)
+These metrics expose the current state of Pod-level resource requests and limits, primarily for consumption by kube-state-metrics and observability dashboards.
+
+##### `kube_pod_level_resources_requests`


I think drop the level on these. Either kube_pod_resources_requests or kube_pod_spec_resources_requests would be more consistent with the other metrics.

tallclair · 2026-06-09T16:35:41Z

+- `namespace` - Namespace of the pod.
+- `uid` - Kubernetes UID of the pod.
+- `resource` - The resource type (e.g., `cpu`, `memory`).
+- `unit` - The unit of the resource (e.g., `core`, `bytes`).


The container level metrics also include a node label.

natasha41575 · 2026-06-09T21:29:27Z

+- `pod` - Name of the pod.
+- `namespace` - Namespace of the pod.
+- `uid` - Kubernetes UID of the pod.


This labels strike me as having extremely high cardinality. Or is there a reason we don't need to worry about that here?

+1. See my comment above about kube-state-metrics. I don't think we need these metrics at all.

jpbetz · 2026-06-11T15:09:24Z

+This section outlines the final list of metrics for the Pod-Level Resources feature, excluding Resource Manager extensions. These metrics are designed to provide deep observability into admission control, scheduling efficiency, and Kubelet-level execution.
+
+#### Admission & API Validation
+These metrics track feature adoption, user intent, and validation friction within the control plane.


@richabanker how are metrics for feature adoption handled? Do we keep them in alpha and then deprecate them after 2-3 release? This would be my preference. If so let's make notes here about that plan.

K8s already exposes kubernetes_feature_enabled metric showing whether a feature is enabled / disabled in a component, can that be used to track "feature adoption" in a way if all we care to know about is whether this feature was on/off ?

We want to track how many pods have pod-level resources set. Even after enabling the feature gate, it is required to explicitly set resources at pod-level in the spec to use this functionality. Does that make sense?

Then we would never want to deprecate this metric even when the feature graduates to GA right?

Can you track this with kube state metrics does have to be a metric? If we can use kube state metrics that's usually better for this sort of thing

Oh taking a step back, actually adding a permanent metric just to parse a resource's spec to capture presence/absence of a field will be an anti-pattern. We would generally want metrics if they help with determining operational health (health / availability / latency) of features , basically something actionable for cluster admins.

And regarding KSM, agree that if we want to expose metrics about a resource's spec/status, KSM would be the way to go. But since pod.spec.resource is not natively tracked in existing KSM metrics today, you'd have to add a new metric in the KSM repo (similar to kube_pod_level_resources_requests being proposed in this PR) that can parse this new field to convert that to a metric. Currently it only exposes these metrics for pod resource https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md

Another way to use KSM for this would be to add a label onto pod metadata with the value you want to track, then configure KSM with --metric-labels-allowlist=pods=[<label>] which will make the existing kube_pod_labels KSM metric to show how many pods have this label set. Also cc @dgrisonnet to confirm if thats the right way to go about feature adoption metrics depending on resource spec.

jpbetz · 2026-06-11T15:20:21Z

+This metric is recorded as a counter.
+
+#### Resource State (State Metrics)
+These metrics expose the current state of Pod-level resource requests and limits, primarily for consumption by kube-state-metrics and observability dashboards.


kube-state-metrics is informer based, not metric based.

Let's drop the metrics listed here for that purpose. If we need this information, we should open a PR against kube-state-metrics after this feature merges.

Specifically, I recommend dropping:

kube_pod_level_resource_requests

kube_pod_level_resource_limits

jpbetz · 2026-06-11T15:26:11Z

+- `pod` - Name of the pod.
+- `namespace` - Namespace of the pod.
+- `uid` - Kubernetes UID of the pod.


+1. See my comment above about kube-state-metrics. I don't think we need these metrics at all.

jpbetz · 2026-06-11T15:41:02Z

+#### The Kubelet (Execution Phase)
+Tracks operation failures during the container lifecycle that are specific to the shared pod-level resource pool.
+
+##### `kubelet_pod_level_oom_kills_total`


Suggested change

##### `kubelet_pod_level_oom_kills_total`

##### `kubelet_pod_oom_kills_total`

? (I don't think we need the feature name here, the fact that it's a pod oom is enough)

jpbetz · 2026-06-11T15:42:25Z

+
+This metric is recorded as a counter.
+
+#### The Kubelet (Execution Phase)


Recommend adding a "pod CPU throttling metric". Maybe pod_cpu_cfs_throttled_seconds_total ?

jpbetz · 2026-06-11T15:49:26Z

+#### Admission & API Validation
+These metrics track feature adoption, user intent, and validation friction within the control plane.
+
+##### `pod_level_resources_admission_total`


+1. This should be the main adoption metric (ALPHA, removed in a future release). I recommend dropping the other adoption related metrics listed below.

Let's make a note here that this metrics is ALPHA, temporary, and removed in 2-3 release.

Signed-off-by: Dixita <ndixita@google.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 8, 2026

k8s-ci-robot requested review from dchen1107 and mrunalp June 8, 2026 23:31

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 8, 2026

k8s-ci-robot assigned tallclair Jun 8, 2026

ndixita force-pushed the plr-ga branch from a9330ba to 76ac5a6 Compare June 8, 2026 23:56

ndixita force-pushed the plr-ga branch from 76ac5a6 to 2b1eda3 Compare June 9, 2026 01:28

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 9, 2026

tallclair reviewed Jun 9, 2026

View reviewed changes

k8s-ci-robot requested review from jpbetz and yongruilin June 9, 2026 16:36

natasha41575 reviewed Jun 9, 2026

View reviewed changes

ndixita force-pushed the plr-ga branch from 2b1eda3 to a15bd9a Compare June 10, 2026 00:51

jpbetz reviewed Jun 11, 2026

View reviewed changes

whtssub mentioned this pull request Jun 11, 2026

Pod level resources #2837

Open

23 tasks

ndixita force-pushed the plr-ga branch from 5ac2ecb to dc580d1 Compare June 12, 2026 20:26

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 12, 2026

Add instrumentation section and update GA criteria

929371e

Signed-off-by: Dixita <ndixita@google.com>

ndixita force-pushed the plr-ga branch from dc580d1 to 929371e Compare June 12, 2026 20:29

	##### `pod_level_resources_admission_total`
	##### `kubelet_pod_level_resources_admission_total`


		This metric is recorded as a counter.

		##### `pod_level_resources_validation_errors_total`


		This metric is recorded as a counter.

		##### `pod_level_resources_defaulting_total`

	##### `kubelet_pod_level_oom_kills_total`
	##### `kubelet_pod_oom_kills_total`


		This metric is recorded as a counter.

		#### The Kubelet (Execution Phase)

Conversation

ndixita commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndixita commented Jun 8, 2026

Uh oh!

k8s-ci-robot commented Jun 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpbetz Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richabanker Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richabanker Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ndixita commented Jun 8, 2026 •

edited

Loading

jpbetz Jun 11, 2026 •

edited

Loading

richabanker Jun 11, 2026 •

edited

Loading

richabanker Jun 11, 2026 •

edited

Loading