Skip to content

best-effort-atomic-scale-up ProvisioningRequest wrongly reports CapacityIsNotFound when capacity is really present #9805

@cvgenesis

Description

@cvgenesis

Which component are you using?:

/area cluster-autoscaler

What version of the component are you using?:

v1.35.3-gke.2190000, and also in the latest (commit 18bcb5e)

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.35.3-gke.2190000

What environment is this in?:

  • GKE, Standard cluster
  • an autoscaling GPU node pool with 0 min nodes and 8 max nodes, each node with 8 GPUs
  • Kueue configured with TAS
  • a stream of mixed 8-GPU and 1-GPU jobs, managed by Kueue
  • the stream of jobs variates in time so the node pool scales up and down, so
  • AdmissionCheck and ProvisioningRequest set to the class best-effort-atomic-scale-up.autoscaling.x-k8s.io - to make TAS work correctly with autoscaling
  • Currently the cluster has 7 nodes ready, with 8 GPUs in use and 48 GPUs free
  • Arrives a batch of 256 1-GPU jobs.

What did you expect to happen?:

  1. Kueue reserves 48 Workloads in cluster-queue and admits 48 jobs
  2. In a few minutes 48 new jobs are running, no GPUs remain unused

What happened instead?:

  1. After 3 minutes only 26 jobs are running, 22 GPUs remain unused
  2. Kueue reserved 48 Workloads in cluster-queue and created 48 ProvisioningRequests (all Accepted True).
  3. 26 ProvisioningRequests have Provisioned True, while 22 have it False with reason: CapacityIsNotFound, message: Capacity is not found, CA will try to find it later.
  4. After 14 minutes: 38 PR Provisioned True, 10 Provisioned Free
  5. After 22 minutes: 43 PR Provisioned True, 5 Provisioned Free
  6. After 35 minutes: all 48 PR Provisioned True, all 48 jobs are running, no GPUs remain unused

How to reproduce it (as minimally and precisely as possible):

  1. Node with room for 2 CPUs.
  2. Create a best-effort-atomic PR for one 1-CPU pod; let it reach Provisioned=True.
  3. Create a pod for this PR, with 1-CPU
  4. Create another best-effort-atomic PR for one 1-CPU pod
  5. The new PR reports no capacity though one CPU is idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerIssues or PRs related to the Cluster Autoscaler componentkind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions