Skip to content

Commit 929371e

Browse files
committed
Add instrumentation section and update GA criteria
Signed-off-by: Dixita <ndixita@google.com>
1 parent b4454a8 commit 929371e

3 files changed

Lines changed: 77 additions & 10 deletions

File tree

keps/prod-readiness/sig-node/2837.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,6 @@ kep-number: 2837
22
alpha:
33
approver: "@jpbetz"
44
beta:
5-
approver: "@soltysh"
5+
approver: "@soltysh"
6+
stable:
7+
approver: "@jpbetz"

keps/sig-node/2837-pod-level-resource-spec/README.md

Lines changed: 70 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,13 @@
3939
- [[Scoped for Beta] HPA](#scoped-for-beta-hpa)
4040
- [Cluster Autoscaler](#cluster-autoscaler)
4141
- [VPA](#vpa)
42+
- [Instrumentation](#instrumentation)
43+
- [Admission &amp; API Validation](#admission--api-validation)
44+
- [<code>kubelet_pod_level_resources_admission_total</code>](#kubelet_pod_level_resources_admission_total)
45+
- [The Kubelet (Execution Phase)](#the-kubelet-execution-phase)
46+
- [<code>kubelet_pod_oom_kills_total</code>](#kubelet_pod_oom_kills_total)
47+
- [<code>pod_cpu_cfs_throttled_seconds_total</code>](#pod_cpu_cfs_throttled_seconds_total)
48+
- [Regression Monitoring (Existing Metrics)](#regression-monitoring-existing-metrics)
4249
- [Test Plan](#test-plan)
4350
- [Unit tests](#unit-tests)
4451
- [Integration tests](#integration-tests)
@@ -69,6 +76,7 @@
6976
- [[Future KEP Consideration in 1.35] Topology Manager](#future-kep-consideration-in-135-topology-manager)
7077
- [[Future KEP Consideration in collaboration with sig-autoscaling] VPA](#future-kep-consideration-in-collaboration-with-sig-autoscaling-vpa)
7178
- [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey)
79+
- [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey-1)
7280
<!-- /toc -->
7381

7482

@@ -78,17 +86,17 @@
7886
Items marked with (R) are required *prior to targeting to a milestone / release*.
7987

8088
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
81-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
82-
- [ ] (R) Design details are appropriately documented
89+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
90+
- [x] (R) Design details are appropriately documented
8391
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
8492
- [ ] e2e Tests for all Beta API Operations (endpoints)
8593
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
8694
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
87-
- [ ] (R) Graduation criteria is in place
95+
- [x] (R) Graduation criteria is in place
8896
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
8997
- [ ] (R) Production readiness review completed
9098
- [ ] (R) Production readiness review approved
91-
- [ ] "Implementation History" section is up-to-date for milestone
99+
- [x] "Implementation History" section is up-to-date for milestone
92100
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
93101
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
94102

@@ -1272,6 +1280,45 @@ type RecommendedPodResources struct {
12721280
Note: Detailed KEP design is owned and being worked on by
12731281
sig-autoscaling: [#7571](https://github.com/kubernetes/autoscaler/issues/7571)
12741282
1283+
### Instrumentation
1284+
1285+
This section outlines the final list of metrics for the Pod-Level Resources feature, excluding Resource Manager extensions. These metrics are designed to provide deep observability into admission control, scheduling efficiency, and Kubelet-level execution.
1286+
1287+
#### Admission & API Validation
1288+
These metrics track feature adoption, user intent, and validation friction within the control plane.
1289+
1290+
##### `kubelet_pod_level_resources_admission_total`
1291+
Total number of pods processed during Kubelet admission, categorized by resource configuration strategy.
1292+
1293+
**Note:** This metric is **ALPHA** and temporary. It is intended to track feature adoption while the feature is new and is scheduled to be removed 2-3 releases after the Pod-Level Resources feature reaches General Availability (GA).
1294+
1295+
Labels:
1296+
- `config_mode` - Possible values: `container_level`, `pod_level_only`, `pod_and_container_level`.
1297+
- `status` - Possible values: `admitted`, `rejected`.
1298+
- `qos_class` - Possible values: `guaranteed`, `burstable`, `best_effort`.
1299+
1300+
This metric is recorded as a counter.
1301+
1302+
#### The Kubelet (Execution Phase)
1303+
Tracks operation failures during the container lifecycle that are specific to the shared pod-level resource pool.
1304+
1305+
##### `kubelet_pod_oom_kills_total`
1306+
Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted. This metric is crucial for identifying cases where a container was killed even if it was under its own individual limit, but the pod's aggregate limit was reached.
1307+
1308+
This metric is recorded as a counter.
1309+
1310+
##### `pod_cpu_cfs_throttled_seconds_total`
1311+
Total time in seconds that containers in a pod were throttled due to exceeding pod-level CPU limits. This metric helps identify pods that are consistently hitting their aggregate CPU limits.
1312+
1313+
This metric is recorded as a counter.
1314+
1315+
#### Regression Monitoring (Existing Metrics)
1316+
While not new, the following metrics must be monitored to ensure no regressions in scheduling or node stability occur after adopting pod-level resource specifications.
1317+
1318+
- `schedule_attempts_total{result="error|unschedulable"}`: Monitored to detect spikes in unschedulable pods due to potential bugs in the new resource requirement calculation.
1319+
- `node_collector_evictions_total`: Monitored to ensure that the new pod eviction ranking logic based on pod-level requests behaves as expected and does not lead to an increase in unintended evictions.
1320+
- `started_pods_errors_total` / `started_containers_errors_total`: Monitored to detect failures in the Kubelet's ability to create and configure the pod-level cgroups and container sandboxes.
1321+
12751322
### Test Plan
12761323
12771324
[X] I/we understand the owners of the involved components may require updates to
@@ -1343,7 +1390,6 @@ feature gate and by setting the new `resources` fields in PodSpec at Pod level.
13431390
13441391
#### GA (stable)
13451392
1346-
* VPA integration of feature moved to beta.
13471393
* No major bugs reported for 3 months.
13481394
* Pod Level Resources Support With In Place Pod Vertical Scaling KEP is past alpha.
13491395
* User feedback (ideally from at least two distinct users) is green
@@ -1718,7 +1764,9 @@ Pick one more of these and delete the rest.
17181764
17191765
- [X] Metrics
17201766
- Metric name:
1721-
- `apiserver_rejected_requests` will indicate any failures (`Bad Request` code=400) related to translation of new `resources` field in PodSpec.
1767+
- `kubelet_pod_level_resources_admission_total` (**ALPHA, Temporary**): Total number of pods processed during Kubelet admission, categorized by resource configuration strategy. Scheduled for removal 2-3 releases after GA.
1768+
- `kubelet_pod_oom_kills_total`: Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted.
1769+
- `pod_cpu_cfs_throttled_seconds_total`: Total time in seconds that containers in a pod were throttled due to exceeding pod-level CPU limits.
17221770
- `schedule_attempts_total{result="error|unschedulable"}`
17231771
- `node_collector_evictions_total`: to check if a pod level resource setting is causing to evict more pods than normal
17241772
- `started_pods_errors_total`: exposed by kubelet to check if large number of pods are failing unusually
@@ -1927,6 +1975,7 @@ resource specs.
19271975
(#4678)[https://github.com/kubernetes/enhancements/pull/4678]
19281976
- **2025-06-18:** Revised KEP for Beta
19291977
- **2026-01-27:** Revised KEP for 1.36 to include fixes for issues 135082 and 136120.
1978+
- **2026-06-08:** Revised KEP for GA in 1.37.
19301979
19311980
## Drawbacks
19321981
@@ -2089,6 +2138,21 @@ recommendations have been proposed and require further discussion.
20892138
20902139
#### [Scoped for GA] User Experience Survey
20912140
2141+
Before promoting the feature to GA, we plan to conduct a UX survey to
2142+
understand user expectations for setting various combinations of requests and
2143+
limits at both the pod and container levels. This will help us gather use cases
2144+
for different combinations, enabling us to enhance the feature's usability. If we
2145+
identify the need for significant changes to the defaulting logic based on this
2146+
feedback, we'll release another Beta version of Pod-Level Resources to
2147+
incorporate those adjustments.boration with sig-autoscaling] VPA
2148+
Pod-Level Resources allows pod-level limits to be greater than aggregated container
2149+
limits to allow the containers to share idle resources among each other.
2150+
Integrating this functionality with VPA necessitates the development of a complex
2151+
new recommendation algorithm. Concepts such as proportionate pod and container level
2152+
recommendations have been proposed and require further discussion.
2153+
2154+
#### [Scoped for GA] User Experience Survey
2155+
20922156
Before promoting the feature to GA, we plan to conduct a UX survey to
20932157
understand user expectations for setting various combinations of requests and
20942158
limits at both the pod and container levels. This will help us gather use cases

keps/sig-node/2837-pod-level-resource-spec/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
title: KEP Template
1+
title: Pod Level Resource Specifications
22
kep-number: 2837
33
authors:
44
- "@ndixita"
@@ -25,17 +25,18 @@ see-also: []
2525
replaces: []
2626

2727
# The target maturity stage in the current dev cycle for this KEP.
28-
stage: beta
28+
stage: stable
2929

3030
# The most recent milestone for which work toward delivery of this KEP has been
3131
# done. This can be the current (upcoming) milestone, if it is being actively
3232
# worked on.
33-
latest-milestone: "v1.36"
33+
latest-milestone: "v1.37"
3434

3535
# The milestone at which this feature was, or is targeted to be, at each stage.
3636
milestone:
3737
alpha: "v1.33"
3838
beta: "v1.34"
39+
stable: "v1.37"
3940

4041
# The following PRR answers are required at alpha release
4142
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)