Skip to content

cluster-autoscaler: allow MachinePool scale-down when selected node has no matching Machine #9681

@mnaissiou

Description

@mnaissiou

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

cluster-autoscaler v1.33.4

Component version:

What k8s version are you using (kubectl version)?:

Client Version: v1.33.0
Server Version: v1.33.3

What environment is this in?:
Cluster API-based environment using MachinePools on Azure.
cloudProvider: clusterapi
Azure AKS v1.33.3

What did you expect to happen?:
When cluster-autoscaler selects a node belonging to a Cluster API MachinePool for scale-down, I expect scale-down to still work even if no backing Machine object can be resolved for that node, by falling back to MachinePool replica decrement.

In other words:

if a matching Machine exists, use the normal targeted deletion path
if no matching Machine exists, but the node belongs to a MachinePool, decrease the MachinePool replica count instead of blocking scale-down

What happened instead?:

Without the fallback, scale-down was blocked in the case where the selected node belonged to a MachinePool but no corresponding Machine could be found for that node.

After applying a local patch implementing a fallback to replica decrement, scale-down succeeded.

Relevant logs:

I... nodegroup <pool-name> has 3 nodes: [<provider-id-1> <provider-id-2> <provider-id-3>]
W... No Machine found for node "<provider-id-2>" in MachinePool "MachinePool/<namespace>/<pool-name>", falling back to replica decrement only
I... Event(...): type: 'Normal' reason: 'ScaleDown' Scale-down: node <node-name> removed with drain

This suggests that for some MachinePool-based implementations, requiring a resolvable Machine for the selected node prevents a valid scale-down operation.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy cluster-autoscaler with the Cluster API provider enabled.
  2. Use a Cluster API MachinePool-backed node group.
  3. Ensure the autoscaler can discover the MachinePool and read its /scale subresource successfully.
  4. Create a situation where:
    • a node in the MachinePool becomes a valid scale-down candidate
    • but cluster-autoscaler cannot resolve that specific node to a backing Machine object
  5. Observe that scale-down is blocked unless a fallback to MachinePool replica decrement is implemented.
    In our case, once fallback-to-replica-decrement logic was added, the autoscaler was able to:
  • drain the selected node
  • reduce the MachinePool size
  • complete scale-down successfully

Anything else we need to know?:

Additional anonymized observations:

  • Management cluster access was working correctly.
  • MachinePool/scale GET requests succeeded.
  • Node group discovery succeeded.
  • The issue was not caused by authentication failures.
  • Earlier in the investigation, some scale-down attempts were also legitimately prevented by workload constraints (CPU requests / PodDisruptionBudget), but once a node became removable, the remaining issue was specifically the missing Node -> Machine mapping.

Example logs showing management-cluster access and successful nodegroup discovery:

I... discovered node group: MachinePool/<namespace>/<pool-name> (min: 1, max: 3, replicas: 3)
I... GET ... /apis/cluster.x-k8s.io/v1beta1/namespaces/<namespace>/machinepools/<pool-name>/scale

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerIssues or PRs related to the Cluster Autoscaler componentarea/provider/cluster-apiIssues or PRs related to Cluster API providerkind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions