|
39 | 39 | - [[Scoped for Beta] HPA](#scoped-for-beta-hpa) |
40 | 40 | - [Cluster Autoscaler](#cluster-autoscaler) |
41 | 41 | - [VPA](#vpa) |
| 42 | + - [Instrumentation](#instrumentation) |
| 43 | + - [Admission & API Validation](#admission--api-validation) |
| 44 | + - [<code>kubelet_pod_level_resources_admission_total</code>](#kubelet_pod_level_resources_admission_total) |
| 45 | + - [The Kubelet (Execution Phase)](#the-kubelet-execution-phase) |
| 46 | + - [<code>kubelet_pod_oom_kills_total</code>](#kubelet_pod_oom_kills_total) |
| 47 | + - [<code>pod_cpu_cfs_throttled_seconds_total</code>](#pod_cpu_cfs_throttled_seconds_total) |
| 48 | + - [Regression Monitoring (Existing Metrics)](#regression-monitoring-existing-metrics) |
42 | 49 | - [Test Plan](#test-plan) |
43 | 50 | - [Unit tests](#unit-tests) |
44 | 51 | - [Integration tests](#integration-tests) |
|
69 | 76 | - [[Future KEP Consideration in 1.35] Topology Manager](#future-kep-consideration-in-135-topology-manager) |
70 | 77 | - [[Future KEP Consideration in collaboration with sig-autoscaling] VPA](#future-kep-consideration-in-collaboration-with-sig-autoscaling-vpa) |
71 | 78 | - [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey) |
| 79 | + - [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey-1) |
72 | 80 | <!-- /toc --> |
73 | 81 |
|
74 | 82 |
|
|
78 | 86 | Items marked with (R) are required *prior to targeting to a milestone / release*. |
79 | 87 |
|
80 | 88 | - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
81 | | -- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
82 | | -- [ ] (R) Design details are appropriately documented |
| 89 | +- [x] (R) KEP approvers have approved the KEP status as `implementable` |
| 90 | +- [x] (R) Design details are appropriately documented |
83 | 91 | - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) |
84 | 92 | - [ ] e2e Tests for all Beta API Operations (endpoints) |
85 | 93 | - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
86 | 94 | - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free |
87 | | -- [ ] (R) Graduation criteria is in place |
| 95 | +- [x] (R) Graduation criteria is in place |
88 | 96 | - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
89 | 97 | - [ ] (R) Production readiness review completed |
90 | 98 | - [ ] (R) Production readiness review approved |
91 | | -- [ ] "Implementation History" section is up-to-date for milestone |
| 99 | +- [x] "Implementation History" section is up-to-date for milestone |
92 | 100 | - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
93 | 101 | - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
94 | 102 |
|
@@ -1272,6 +1280,45 @@ type RecommendedPodResources struct { |
1272 | 1280 | Note: Detailed KEP design is owned and being worked on by |
1273 | 1281 | sig-autoscaling: [#7571](https://github.com/kubernetes/autoscaler/issues/7571) |
1274 | 1282 |
|
| 1283 | +### Instrumentation |
| 1284 | +
|
| 1285 | +This section outlines the final list of metrics for the Pod-Level Resources feature, excluding Resource Manager extensions. These metrics are designed to provide deep observability into admission control, scheduling efficiency, and Kubelet-level execution. |
| 1286 | +
|
| 1287 | +#### Admission & API Validation |
| 1288 | +These metrics track feature adoption, user intent, and validation friction within the control plane. |
| 1289 | +
|
| 1290 | +##### `kubelet_pod_level_resources_admission_total` |
| 1291 | +Total number of pods processed during Kubelet admission, categorized by resource configuration strategy. |
| 1292 | +
|
| 1293 | +**Note:** This metric is **ALPHA** and temporary. It is intended to track feature adoption while the feature is new and is scheduled to be removed 2-3 releases after the Pod-Level Resources feature reaches General Availability (GA). |
| 1294 | +
|
| 1295 | +Labels: |
| 1296 | +- `config_mode` - Possible values: `container_level`, `pod_level_only`, `pod_and_container_level`. |
| 1297 | +- `status` - Possible values: `admitted`, `rejected`. |
| 1298 | +- `qos_class` - Possible values: `guaranteed`, `burstable`, `best_effort`. |
| 1299 | +
|
| 1300 | +This metric is recorded as a counter. |
| 1301 | +
|
| 1302 | +#### The Kubelet (Execution Phase) |
| 1303 | +Tracks operation failures during the container lifecycle that are specific to the shared pod-level resource pool. |
| 1304 | +
|
| 1305 | +##### `kubelet_pod_oom_kills_total` |
| 1306 | +Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted. This metric is crucial for identifying cases where a container was killed even if it was under its own individual limit, but the pod's aggregate limit was reached. |
| 1307 | +
|
| 1308 | +This metric is recorded as a counter. |
| 1309 | +
|
| 1310 | +##### `pod_cpu_cfs_throttled_seconds_total` |
| 1311 | +Total time in seconds that containers in a pod were throttled due to exceeding pod-level CPU limits. This metric helps identify pods that are consistently hitting their aggregate CPU limits. |
| 1312 | +
|
| 1313 | +This metric is recorded as a counter. |
| 1314 | +
|
| 1315 | +#### Regression Monitoring (Existing Metrics) |
| 1316 | +While not new, the following metrics must be monitored to ensure no regressions in scheduling or node stability occur after adopting pod-level resource specifications. |
| 1317 | +
|
| 1318 | +- `schedule_attempts_total{result="error|unschedulable"}`: Monitored to detect spikes in unschedulable pods due to potential bugs in the new resource requirement calculation. |
| 1319 | +- `node_collector_evictions_total`: Monitored to ensure that the new pod eviction ranking logic based on pod-level requests behaves as expected and does not lead to an increase in unintended evictions. |
| 1320 | +- `started_pods_errors_total` / `started_containers_errors_total`: Monitored to detect failures in the Kubelet's ability to create and configure the pod-level cgroups and container sandboxes. |
| 1321 | +
|
1275 | 1322 | ### Test Plan |
1276 | 1323 |
|
1277 | 1324 | [X] I/we understand the owners of the involved components may require updates to |
@@ -1343,7 +1390,6 @@ feature gate and by setting the new `resources` fields in PodSpec at Pod level. |
1343 | 1390 |
|
1344 | 1391 | #### GA (stable) |
1345 | 1392 |
|
1346 | | -* VPA integration of feature moved to beta. |
1347 | 1393 | * No major bugs reported for 3 months. |
1348 | 1394 | * Pod Level Resources Support With In Place Pod Vertical Scaling KEP is past alpha. |
1349 | 1395 | * User feedback (ideally from at least two distinct users) is green |
@@ -1718,7 +1764,9 @@ Pick one more of these and delete the rest. |
1718 | 1764 |
|
1719 | 1765 | - [X] Metrics |
1720 | 1766 | - Metric name: |
1721 | | - - `apiserver_rejected_requests` will indicate any failures (`Bad Request` code=400) related to translation of new `resources` field in PodSpec. |
| 1767 | + - `kubelet_pod_level_resources_admission_total` (**ALPHA, Temporary**): Total number of pods processed during Kubelet admission, categorized by resource configuration strategy. Scheduled for removal 2-3 releases after GA. |
| 1768 | + - `kubelet_pod_oom_kills_total`: Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted. |
| 1769 | + - `pod_cpu_cfs_throttled_seconds_total`: Total time in seconds that containers in a pod were throttled due to exceeding pod-level CPU limits. |
1722 | 1770 | - `schedule_attempts_total{result="error|unschedulable"}` |
1723 | 1771 | - `node_collector_evictions_total`: to check if a pod level resource setting is causing to evict more pods than normal |
1724 | 1772 | - `started_pods_errors_total`: exposed by kubelet to check if large number of pods are failing unusually |
@@ -1927,6 +1975,7 @@ resource specs. |
1927 | 1975 | (#4678)[https://github.com/kubernetes/enhancements/pull/4678] |
1928 | 1976 | - **2025-06-18:** Revised KEP for Beta |
1929 | 1977 | - **2026-01-27:** Revised KEP for 1.36 to include fixes for issues 135082 and 136120. |
| 1978 | +- **2026-06-08:** Revised KEP for GA in 1.37. |
1930 | 1979 |
|
1931 | 1980 | ## Drawbacks |
1932 | 1981 |
|
@@ -2089,6 +2138,21 @@ recommendations have been proposed and require further discussion. |
2089 | 2138 |
|
2090 | 2139 | #### [Scoped for GA] User Experience Survey |
2091 | 2140 |
|
| 2141 | +Before promoting the feature to GA, we plan to conduct a UX survey to |
| 2142 | +understand user expectations for setting various combinations of requests and |
| 2143 | +limits at both the pod and container levels. This will help us gather use cases |
| 2144 | +for different combinations, enabling us to enhance the feature's usability. If we |
| 2145 | +identify the need for significant changes to the defaulting logic based on this |
| 2146 | +feedback, we'll release another Beta version of Pod-Level Resources to |
| 2147 | +incorporate those adjustments.boration with sig-autoscaling] VPA |
| 2148 | +Pod-Level Resources allows pod-level limits to be greater than aggregated container |
| 2149 | +limits to allow the containers to share idle resources among each other. |
| 2150 | +Integrating this functionality with VPA necessitates the development of a complex |
| 2151 | +new recommendation algorithm. Concepts such as proportionate pod and container level |
| 2152 | +recommendations have been proposed and require further discussion. |
| 2153 | +
|
| 2154 | +#### [Scoped for GA] User Experience Survey |
| 2155 | +
|
2092 | 2156 | Before promoting the feature to GA, we plan to conduct a UX survey to |
2093 | 2157 | understand user expectations for setting various combinations of requests and |
2094 | 2158 | limits at both the pod and container levels. This will help us gather use cases |
|
0 commit comments