Which component are you using?:
/area cluster-autoscaler
What version of the component are you using?:
v1.35.3-gke.2190000, and also in the latest (commit 18bcb5e)
Component version:
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.35.3-gke.2190000
What environment is this in?:
- GKE, Standard cluster
- an autoscaling GPU node pool with 0 min nodes and 8 max nodes, each node with 8 GPUs
- Kueue configured with TAS
- a stream of mixed 8-GPU and 1-GPU jobs, managed by Kueue
- the stream of jobs variates in time so the node pool scales up and down, so
- AdmissionCheck and ProvisioningRequest set to the class best-effort-atomic-scale-up.autoscaling.x-k8s.io - to make TAS work correctly with autoscaling
- Currently the cluster has 7 nodes ready, with 8 GPUs in use and 48 GPUs free
- Arrives a batch of 256 1-GPU jobs.
What did you expect to happen?:
- Kueue reserves 48 Workloads in cluster-queue and admits 48 jobs
- In a few minutes 48 new jobs are running, no GPUs remain unused
What happened instead?:
- After 3 minutes only 26 jobs are running, 22 GPUs remain unused
- Kueue reserved 48 Workloads in cluster-queue and created 48 ProvisioningRequests (all Accepted True).
- 26 ProvisioningRequests have Provisioned True, while 22 have it False with reason: CapacityIsNotFound, message: Capacity is not found, CA will try to find it later.
- After 14 minutes: 38 PR Provisioned True, 10 Provisioned Free
- After 22 minutes: 43 PR Provisioned True, 5 Provisioned Free
- After 35 minutes: all 48 PR Provisioned True, all 48 jobs are running, no GPUs remain unused
How to reproduce it (as minimally and precisely as possible):
- Node with room for 2 CPUs.
- Create a best-effort-atomic PR for one 1-CPU pod; let it reach Provisioned=True.
- Create a pod for this PR, with 1-CPU
- Create another best-effort-atomic PR for one 1-CPU pod
- The new PR reports no capacity though one CPU is idle
Which component are you using?:
/area cluster-autoscaler
What version of the component are you using?:
v1.35.3-gke.2190000, and also in the latest (commit 18bcb5e)
Component version:
What k8s version are you using (
kubectl version)?:kubectl versionOutputWhat environment is this in?:
What did you expect to happen?:
What happened instead?:
How to reproduce it (as minimally and precisely as possible):