Skip to content

Commit a15bd9a

Browse files
committed
Add instrumentation section and update GA criteria
Signed-off-by: Dixita <ndixita@google.com>
1 parent b4454a8 commit a15bd9a

3 files changed

Lines changed: 115 additions & 8 deletions

File tree

keps/prod-readiness/sig-node/2837.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,6 @@ kep-number: 2837
22
alpha:
33
approver: "@jpbetz"
44
beta:
5+
approver: "@soltysh"
6+
stable:
57
approver: "@soltysh"

keps/sig-node/2837-pod-level-resource-spec/README.md

Lines changed: 109 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,17 @@
3939
- [[Scoped for Beta] HPA](#scoped-for-beta-hpa)
4040
- [Cluster Autoscaler](#cluster-autoscaler)
4141
- [VPA](#vpa)
42+
- [Instrumentation](#instrumentation)
43+
- [Admission &amp; API Validation](#admission--api-validation)
44+
- [<code>pod_level_resources_admission_total</code>](#pod_level_resources_admission_total)
45+
- [<code>pod_level_resources_defaulting_total</code>](#pod_level_resources_defaulting_total)
46+
- [<code>pod_level_resources_validation_errors_total</code>](#pod_level_resources_validation_errors_total)
47+
- [The Kubelet (Execution Phase)](#the-kubelet-execution-phase)
48+
- [<code>kubelet_pod_level_oom_kills_total</code>](#kubelet_pod_level_oom_kills_total)
49+
- [Resource State (State Metrics)](#resource-state-state-metrics)
50+
- [<code>kube_pod_level_resources_requests</code>](#kube_pod_level_resources_requests)
51+
- [<code>kube_pod_level_resources_limits</code>](#kube_pod_level_resources_limits)
52+
- [Regression Monitoring (Existing Metrics)](#regression-monitoring-existing-metrics)
4253
- [Test Plan](#test-plan)
4354
- [Unit tests](#unit-tests)
4455
- [Integration tests](#integration-tests)
@@ -69,6 +80,7 @@
6980
- [[Future KEP Consideration in 1.35] Topology Manager](#future-kep-consideration-in-135-topology-manager)
7081
- [[Future KEP Consideration in collaboration with sig-autoscaling] VPA](#future-kep-consideration-in-collaboration-with-sig-autoscaling-vpa)
7182
- [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey)
83+
- [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey-1)
7284
<!-- /toc -->
7385

7486

@@ -78,17 +90,17 @@
7890
Items marked with (R) are required *prior to targeting to a milestone / release*.
7991

8092
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
81-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
82-
- [ ] (R) Design details are appropriately documented
93+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
94+
- [x] (R) Design details are appropriately documented
8395
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
8496
- [ ] e2e Tests for all Beta API Operations (endpoints)
8597
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
8698
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
87-
- [ ] (R) Graduation criteria is in place
99+
- [x] (R) Graduation criteria is in place
88100
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
89101
- [ ] (R) Production readiness review completed
90102
- [ ] (R) Production readiness review approved
91-
- [ ] "Implementation History" section is up-to-date for milestone
103+
- [x] "Implementation History" section is up-to-date for milestone
92104
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
93105
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
94106

@@ -1272,6 +1284,77 @@ type RecommendedPodResources struct {
12721284
Note: Detailed KEP design is owned and being worked on by
12731285
sig-autoscaling: [#7571](https://github.com/kubernetes/autoscaler/issues/7571)
12741286
1287+
### Instrumentation
1288+
1289+
This section outlines the final list of metrics for the Pod-Level Resources feature, excluding Resource Manager extensions. These metrics are designed to provide deep observability into admission control, scheduling efficiency, and Kubelet-level execution.
1290+
1291+
#### Admission & API Validation
1292+
These metrics track feature adoption, user intent, and validation friction within the control plane.
1293+
1294+
##### `pod_level_resources_admission_total`
1295+
Total number of pods processed during Kubelet admission, categorized by resource configuration strategy.
1296+
1297+
Labels:
1298+
- `config_mode` - Possible values: `container_level`, `pod_level_only`, `pod_and_container_level`.
1299+
- `status` - Possible values: `admitted`, `rejected`.
1300+
- `qos_class` - Possible values: `guaranteed`, `burstable`, `best_effort`.
1301+
1302+
This metric is recorded as a counter.
1303+
1304+
##### `pod_level_resources_defaulting_total`
1305+
Total number of times pod-level resource fields were automatically populated by API server defaulting logic.
1306+
1307+
Labels:
1308+
- `resource` - Possible values: `cpu`, `memory`, `hugepages`.
1309+
- `strategy` - Possible values: `request_from_container_requests`, `request_from_pod_limit`, `hugepage_limit_from_container_limits`.
1310+
1311+
This metric is recorded as a counter.
1312+
1313+
##### `pod_level_resources_validation_errors_total`
1314+
Total number of pods rejected specifically due to validation errors in pod.spec.resources.
1315+
1316+
Labels:
1317+
- `reason` - Possible values: `sum_mismatch`, `negative_value`, `unsupported_resource`, `request_limit_mismatch`, `forbidden_field`.
1318+
1319+
This metric is recorded as a counter.
1320+
1321+
#### The Kubelet (Execution Phase)
1322+
Tracks operation failures during the container lifecycle that are specific to the shared pod-level resource pool.
1323+
1324+
##### `kubelet_pod_level_oom_kills_total`
1325+
Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted. This metric is crucial for identifying cases where a container was killed even if it was under its own individual limit, but the pod's aggregate limit was reached.
1326+
1327+
This metric is recorded as a counter.
1328+
1329+
#### Resource State (State Metrics)
1330+
These metrics expose the current state of Pod-level resource requests and limits, primarily for consumption by kube-state-metrics and observability dashboards.
1331+
1332+
##### `kube_pod_level_resources_requests`
1333+
Exposes the value of `pod.spec.resources.requests` for each resource type.
1334+
1335+
Labels:
1336+
- `pod` - Name of the pod.
1337+
- `namespace` - Namespace of the pod.
1338+
- `uid` - Kubernetes UID of the pod.
1339+
- `resource` - The resource type (e.g., `cpu`, `memory`).
1340+
- `unit` - The unit of the resource (e.g., `core`, `bytes`).
1341+
1342+
This metric is recorded as a gauge.
1343+
1344+
##### `kube_pod_level_resources_limits`
1345+
Exposes the value of `pod.spec.resources.limits` for each resource type.
1346+
1347+
Labels: Same as `kube_pod_level_resources_requests`.
1348+
1349+
This metric is recorded as a gauge.
1350+
1351+
#### Regression Monitoring (Existing Metrics)
1352+
While not new, the following metrics must be monitored to ensure no regressions in scheduling or node stability occur after adopting pod-level resource specifications.
1353+
1354+
- `schedule_attempts_total{result="error|unschedulable"}`: Monitored to detect spikes in unschedulable pods due to potential bugs in the new resource requirement calculation.
1355+
- `node_collector_evictions_total`: Monitored to ensure that the new pod eviction ranking logic based on pod-level requests behaves as expected and does not lead to an increase in unintended evictions.
1356+
- `started_pods_errors_total` / `started_containers_errors_total`: Monitored to detect failures in the Kubelet's ability to create and configure the pod-level cgroups and container sandboxes.
1357+
12751358
### Test Plan
12761359
12771360
[X] I/we understand the owners of the involved components may require updates to
@@ -1343,7 +1426,6 @@ feature gate and by setting the new `resources` fields in PodSpec at Pod level.
13431426
13441427
#### GA (stable)
13451428
1346-
* VPA integration of feature moved to beta.
13471429
* No major bugs reported for 3 months.
13481430
* Pod Level Resources Support With In Place Pod Vertical Scaling KEP is past alpha.
13491431
* User feedback (ideally from at least two distinct users) is green
@@ -1718,6 +1800,12 @@ Pick one more of these and delete the rest.
17181800
17191801
- [X] Metrics
17201802
- Metric name:
1803+
- `pod_level_resources_admission_total`: Total number of pods processed during Kubelet admission, categorized by resource configuration strategy.
1804+
- `pod_level_resources_defaulting_total`: Total number of times pod-level resource fields were automatically populated by API server defaulting logic.
1805+
- `pod_level_resources_validation_errors_total`: Total number of pods rejected specifically due to validation errors in pod.spec.resources.
1806+
- `kubelet_pod_level_oom_kills_total`: Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted.
1807+
- `kube_pod_level_resources_requests`: Exposes the value of `pod.spec.resources.requests` for each resource type.
1808+
- `kube_pod_level_resources_limits`: Exposes the value of `pod.spec.resources.limits` for each resource type.
17211809
- `apiserver_rejected_requests` will indicate any failures (`Bad Request` code=400) related to translation of new `resources` field in PodSpec.
17221810
- `schedule_attempts_total{result="error|unschedulable"}`
17231811
- `node_collector_evictions_total`: to check if a pod level resource setting is causing to evict more pods than normal
@@ -1927,6 +2015,7 @@ resource specs.
19272015
(#4678)[https://github.com/kubernetes/enhancements/pull/4678]
19282016
- **2025-06-18:** Revised KEP for Beta
19292017
- **2026-01-27:** Revised KEP for 1.36 to include fixes for issues 135082 and 136120.
2018+
- **2026-06-08:** Revised KEP for GA in 1.37.
19302019
19312020
## Drawbacks
19322021
@@ -2089,6 +2178,21 @@ recommendations have been proposed and require further discussion.
20892178
20902179
#### [Scoped for GA] User Experience Survey
20912180
2181+
Before promoting the feature to GA, we plan to conduct a UX survey to
2182+
understand user expectations for setting various combinations of requests and
2183+
limits at both the pod and container levels. This will help us gather use cases
2184+
for different combinations, enabling us to enhance the feature's usability. If we
2185+
identify the need for significant changes to the defaulting logic based on this
2186+
feedback, we'll release another Beta version of Pod-Level Resources to
2187+
incorporate those adjustments.boration with sig-autoscaling] VPA
2188+
Pod-Level Resources allows pod-level limits to be greater than aggregated container
2189+
limits to allow the containers to share idle resources among each other.
2190+
Integrating this functionality with VPA necessitates the development of a complex
2191+
new recommendation algorithm. Concepts such as proportionate pod and container level
2192+
recommendations have been proposed and require further discussion.
2193+
2194+
#### [Scoped for GA] User Experience Survey
2195+
20922196
Before promoting the feature to GA, we plan to conduct a UX survey to
20932197
understand user expectations for setting various combinations of requests and
20942198
limits at both the pod and container levels. This will help us gather use cases

keps/sig-node/2837-pod-level-resource-spec/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
title: KEP Template
1+
title: Pod Level Resource Specifications
22
kep-number: 2837
33
authors:
44
- "@ndixita"
@@ -25,17 +25,18 @@ see-also: []
2525
replaces: []
2626

2727
# The target maturity stage in the current dev cycle for this KEP.
28-
stage: beta
28+
stage: stable
2929

3030
# The most recent milestone for which work toward delivery of this KEP has been
3131
# done. This can be the current (upcoming) milestone, if it is being actively
3232
# worked on.
33-
latest-milestone: "v1.36"
33+
latest-milestone: "v1.37"
3434

3535
# The milestone at which this feature was, or is targeted to be, at each stage.
3636
milestone:
3737
alpha: "v1.33"
3838
beta: "v1.34"
39+
stable: "v1.37"
3940

4041
# The following PRR answers are required at alpha release
4142
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)