|
39 | 39 | - [[Scoped for Beta] HPA](#scoped-for-beta-hpa) |
40 | 40 | - [Cluster Autoscaler](#cluster-autoscaler) |
41 | 41 | - [VPA](#vpa) |
| 42 | + - [Instrumentation](#instrumentation) |
| 43 | + - [Admission & API Validation](#admission--api-validation) |
| 44 | + - [<code>pod_level_resources_admission_total</code>](#pod_level_resources_admission_total) |
| 45 | + - [<code>pod_level_resources_defaulting_total</code>](#pod_level_resources_defaulting_total) |
| 46 | + - [<code>pod_level_resources_validation_errors_total</code>](#pod_level_resources_validation_errors_total) |
| 47 | + - [The Kubelet (Execution Phase)](#the-kubelet-execution-phase) |
| 48 | + - [<code>kubelet_pod_level_oom_kills_total</code>](#kubelet_pod_level_oom_kills_total) |
| 49 | + - [Resource State (State Metrics)](#resource-state-state-metrics) |
| 50 | + - [<code>kube_pod_level_resources_requests</code>](#kube_pod_level_resources_requests) |
| 51 | + - [<code>kube_pod_level_resources_limits</code>](#kube_pod_level_resources_limits) |
| 52 | + - [Regression Monitoring (Existing Metrics)](#regression-monitoring-existing-metrics) |
42 | 53 | - [Test Plan](#test-plan) |
43 | 54 | - [Unit tests](#unit-tests) |
44 | 55 | - [Integration tests](#integration-tests) |
|
69 | 80 | - [[Future KEP Consideration in 1.35] Topology Manager](#future-kep-consideration-in-135-topology-manager) |
70 | 81 | - [[Future KEP Consideration in collaboration with sig-autoscaling] VPA](#future-kep-consideration-in-collaboration-with-sig-autoscaling-vpa) |
71 | 82 | - [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey) |
| 83 | + - [[Scoped for GA] User Experience Survey](#scoped-for-ga-user-experience-survey-1) |
72 | 84 | <!-- /toc --> |
73 | 85 |
|
74 | 86 |
|
|
78 | 90 | Items marked with (R) are required *prior to targeting to a milestone / release*. |
79 | 91 |
|
80 | 92 | - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
81 | | -- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
82 | | -- [ ] (R) Design details are appropriately documented |
| 93 | +- [x] (R) KEP approvers have approved the KEP status as `implementable` |
| 94 | +- [x] (R) Design details are appropriately documented |
83 | 95 | - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) |
84 | 96 | - [ ] e2e Tests for all Beta API Operations (endpoints) |
85 | 97 | - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
86 | 98 | - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free |
87 | | -- [ ] (R) Graduation criteria is in place |
| 99 | +- [x] (R) Graduation criteria is in place |
88 | 100 | - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
89 | 101 | - [ ] (R) Production readiness review completed |
90 | 102 | - [ ] (R) Production readiness review approved |
91 | | -- [ ] "Implementation History" section is up-to-date for milestone |
| 103 | +- [x] "Implementation History" section is up-to-date for milestone |
92 | 104 | - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
93 | 105 | - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
94 | 106 |
|
@@ -1272,6 +1284,77 @@ type RecommendedPodResources struct { |
1272 | 1284 | Note: Detailed KEP design is owned and being worked on by |
1273 | 1285 | sig-autoscaling: [#7571](https://github.com/kubernetes/autoscaler/issues/7571) |
1274 | 1286 |
|
| 1287 | +### Instrumentation |
| 1288 | +
|
| 1289 | +This section outlines the final list of metrics for the Pod-Level Resources feature, excluding Resource Manager extensions. These metrics are designed to provide deep observability into admission control, scheduling efficiency, and Kubelet-level execution. |
| 1290 | +
|
| 1291 | +#### Admission & API Validation |
| 1292 | +These metrics track feature adoption, user intent, and validation friction within the control plane. |
| 1293 | +
|
| 1294 | +##### `pod_level_resources_admission_total` |
| 1295 | +Total number of pods processed during Kubelet admission, categorized by resource configuration strategy. |
| 1296 | +
|
| 1297 | +Labels: |
| 1298 | +- `config_mode` - Possible values: `container_level`, `pod_level_only`, `pod_and_container_level`. |
| 1299 | +- `status` - Possible values: `admitted`, `rejected`. |
| 1300 | +- `qos_class` - Possible values: `guaranteed`, `burstable`, `best_effort`. |
| 1301 | +
|
| 1302 | +This metric is recorded as a counter. |
| 1303 | +
|
| 1304 | +##### `pod_level_resources_defaulting_total` |
| 1305 | +Total number of times pod-level resource fields were automatically populated by API server defaulting logic. |
| 1306 | +
|
| 1307 | +Labels: |
| 1308 | +- `resource` - Possible values: `cpu`, `memory`, `hugepages`. |
| 1309 | +- `strategy` - Possible values: `request_from_container_requests`, `request_from_pod_limit`, `hugepage_limit_from_container_limits`. |
| 1310 | +
|
| 1311 | +This metric is recorded as a counter. |
| 1312 | +
|
| 1313 | +##### `pod_level_resources_validation_errors_total` |
| 1314 | +Total number of pods rejected specifically due to validation errors in pod.spec.resources. |
| 1315 | +
|
| 1316 | +Labels: |
| 1317 | +- `reason` - Possible values: `sum_mismatch`, `negative_value`, `unsupported_resource`, `request_limit_mismatch`, `forbidden_field`. |
| 1318 | +
|
| 1319 | +This metric is recorded as a counter. |
| 1320 | +
|
| 1321 | +#### The Kubelet (Execution Phase) |
| 1322 | +Tracks operation failures during the container lifecycle that are specific to the shared pod-level resource pool. |
| 1323 | +
|
| 1324 | +##### `kubelet_pod_level_oom_kills_total` |
| 1325 | +Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted. This metric is crucial for identifying cases where a container was killed even if it was under its own individual limit, but the pod's aggregate limit was reached. |
| 1326 | +
|
| 1327 | +This metric is recorded as a counter. |
| 1328 | +
|
| 1329 | +#### Resource State (State Metrics) |
| 1330 | +These metrics expose the current state of Pod-level resource requests and limits, primarily for consumption by kube-state-metrics and observability dashboards. |
| 1331 | +
|
| 1332 | +##### `kube_pod_level_resources_requests` |
| 1333 | +Exposes the value of `pod.spec.resources.requests` for each resource type. |
| 1334 | +
|
| 1335 | +Labels: |
| 1336 | +- `pod` - Name of the pod. |
| 1337 | +- `namespace` - Namespace of the pod. |
| 1338 | +- `uid` - Kubernetes UID of the pod. |
| 1339 | +- `resource` - The resource type (e.g., `cpu`, `memory`). |
| 1340 | +- `unit` - The unit of the resource (e.g., `core`, `bytes`). |
| 1341 | +
|
| 1342 | +This metric is recorded as a gauge. |
| 1343 | +
|
| 1344 | +##### `kube_pod_level_resources_limits` |
| 1345 | +Exposes the value of `pod.spec.resources.limits` for each resource type. |
| 1346 | +
|
| 1347 | +Labels: Same as `kube_pod_level_resources_requests`. |
| 1348 | +
|
| 1349 | +This metric is recorded as a gauge. |
| 1350 | +
|
| 1351 | +#### Regression Monitoring (Existing Metrics) |
| 1352 | +While not new, the following metrics must be monitored to ensure no regressions in scheduling or node stability occur after adopting pod-level resource specifications. |
| 1353 | +
|
| 1354 | +- `schedule_attempts_total{result="error|unschedulable"}`: Monitored to detect spikes in unschedulable pods due to potential bugs in the new resource requirement calculation. |
| 1355 | +- `node_collector_evictions_total`: Monitored to ensure that the new pod eviction ranking logic based on pod-level requests behaves as expected and does not lead to an increase in unintended evictions. |
| 1356 | +- `started_pods_errors_total` / `started_containers_errors_total`: Monitored to detect failures in the Kubelet's ability to create and configure the pod-level cgroups and container sandboxes. |
| 1357 | +
|
1275 | 1358 | ### Test Plan |
1276 | 1359 |
|
1277 | 1360 | [X] I/we understand the owners of the involved components may require updates to |
@@ -1343,7 +1426,6 @@ feature gate and by setting the new `resources` fields in PodSpec at Pod level. |
1343 | 1426 |
|
1344 | 1427 | #### GA (stable) |
1345 | 1428 |
|
1346 | | -* VPA integration of feature moved to beta. |
1347 | 1429 | * No major bugs reported for 3 months. |
1348 | 1430 | * Pod Level Resources Support With In Place Pod Vertical Scaling KEP is past alpha. |
1349 | 1431 | * User feedback (ideally from at least two distinct users) is green |
@@ -1718,6 +1800,12 @@ Pick one more of these and delete the rest. |
1718 | 1800 |
|
1719 | 1801 | - [X] Metrics |
1720 | 1802 | - Metric name: |
| 1803 | + - `pod_level_resources_admission_total`: Total number of pods processed during Kubelet admission, categorized by resource configuration strategy. |
| 1804 | + - `pod_level_resources_defaulting_total`: Total number of times pod-level resource fields were automatically populated by API server defaulting logic. |
| 1805 | + - `pod_level_resources_validation_errors_total`: Total number of pods rejected specifically due to validation errors in pod.spec.resources. |
| 1806 | + - `kubelet_pod_level_oom_kills_total`: Total number of OOM kills triggered specifically because the shared pod-level memory pool was exhausted. |
| 1807 | + - `kube_pod_level_resources_requests`: Exposes the value of `pod.spec.resources.requests` for each resource type. |
| 1808 | + - `kube_pod_level_resources_limits`: Exposes the value of `pod.spec.resources.limits` for each resource type. |
1721 | 1809 | - `apiserver_rejected_requests` will indicate any failures (`Bad Request` code=400) related to translation of new `resources` field in PodSpec. |
1722 | 1810 | - `schedule_attempts_total{result="error|unschedulable"}` |
1723 | 1811 | - `node_collector_evictions_total`: to check if a pod level resource setting is causing to evict more pods than normal |
@@ -1927,6 +2015,7 @@ resource specs. |
1927 | 2015 | (#4678)[https://github.com/kubernetes/enhancements/pull/4678] |
1928 | 2016 | - **2025-06-18:** Revised KEP for Beta |
1929 | 2017 | - **2026-01-27:** Revised KEP for 1.36 to include fixes for issues 135082 and 136120. |
| 2018 | +- **2026-06-08:** Revised KEP for GA in 1.37. |
1930 | 2019 |
|
1931 | 2020 | ## Drawbacks |
1932 | 2021 |
|
@@ -2089,6 +2178,21 @@ recommendations have been proposed and require further discussion. |
2089 | 2178 |
|
2090 | 2179 | #### [Scoped for GA] User Experience Survey |
2091 | 2180 |
|
| 2181 | +Before promoting the feature to GA, we plan to conduct a UX survey to |
| 2182 | +understand user expectations for setting various combinations of requests and |
| 2183 | +limits at both the pod and container levels. This will help us gather use cases |
| 2184 | +for different combinations, enabling us to enhance the feature's usability. If we |
| 2185 | +identify the need for significant changes to the defaulting logic based on this |
| 2186 | +feedback, we'll release another Beta version of Pod-Level Resources to |
| 2187 | +incorporate those adjustments.boration with sig-autoscaling] VPA |
| 2188 | +Pod-Level Resources allows pod-level limits to be greater than aggregated container |
| 2189 | +limits to allow the containers to share idle resources among each other. |
| 2190 | +Integrating this functionality with VPA necessitates the development of a complex |
| 2191 | +new recommendation algorithm. Concepts such as proportionate pod and container level |
| 2192 | +recommendations have been proposed and require further discussion. |
| 2193 | +
|
| 2194 | +#### [Scoped for GA] User Experience Survey |
| 2195 | +
|
2092 | 2196 | Before promoting the feature to GA, we plan to conduct a UX survey to |
2093 | 2197 | understand user expectations for setting various combinations of requests and |
2094 | 2198 | limits at both the pod and container levels. This will help us gather use cases |
|
0 commit comments