Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
cluster-autoscaler v1.33.4
Component version:
What k8s version are you using (kubectl version)?:
Client Version: v1.33.0
Server Version: v1.33.3
What environment is this in?:
Cluster API-based environment using MachinePools on Azure.
cloudProvider: clusterapi
Azure AKS v1.33.3
What did you expect to happen?:
When cluster-autoscaler selects a node belonging to a Cluster API MachinePool for scale-down, I expect scale-down to still work even if no backing Machine object can be resolved for that node, by falling back to MachinePool replica decrement.
In other words:
if a matching Machine exists, use the normal targeted deletion path
if no matching Machine exists, but the node belongs to a MachinePool, decrease the MachinePool replica count instead of blocking scale-down
What happened instead?:
Without the fallback, scale-down was blocked in the case where the selected node belonged to a MachinePool but no corresponding Machine could be found for that node.
After applying a local patch implementing a fallback to replica decrement, scale-down succeeded.
Relevant logs:
I... nodegroup <pool-name> has 3 nodes: [<provider-id-1> <provider-id-2> <provider-id-3>]
W... No Machine found for node "<provider-id-2>" in MachinePool "MachinePool/<namespace>/<pool-name>", falling back to replica decrement only
I... Event(...): type: 'Normal' reason: 'ScaleDown' Scale-down: node <node-name> removed with drain
This suggests that for some MachinePool-based implementations, requiring a resolvable Machine for the selected node prevents a valid scale-down operation.
How to reproduce it (as minimally and precisely as possible):
- Deploy cluster-autoscaler with the Cluster API provider enabled.
- Use a Cluster API MachinePool-backed node group.
- Ensure the autoscaler can discover the MachinePool and read its /scale subresource successfully.
- Create a situation where:
- a node in the MachinePool becomes a valid scale-down candidate
- but cluster-autoscaler cannot resolve that specific node to a backing Machine object
- Observe that scale-down is blocked unless a fallback to MachinePool replica decrement is implemented.
In our case, once fallback-to-replica-decrement logic was added, the autoscaler was able to:
- drain the selected node
- reduce the MachinePool size
- complete scale-down successfully
Anything else we need to know?:
Additional anonymized observations:
- Management cluster access was working correctly.
- MachinePool/scale GET requests succeeded.
- Node group discovery succeeded.
- The issue was not caused by authentication failures.
- Earlier in the investigation, some scale-down attempts were also legitimately prevented by workload constraints (CPU requests / PodDisruptionBudget), but once a node became removable, the remaining issue was specifically the missing Node -> Machine mapping.
Example logs showing management-cluster access and successful nodegroup discovery:
I... discovered node group: MachinePool/<namespace>/<pool-name> (min: 1, max: 3, replicas: 3)
I... GET ... /apis/cluster.x-k8s.io/v1beta1/namespaces/<namespace>/machinepools/<pool-name>/scale
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
cluster-autoscaler v1.33.4
Component version:
What k8s version are you using (
kubectl version)?:Client Version: v1.33.0
Server Version: v1.33.3
What environment is this in?:
Cluster API-based environment using MachinePools on Azure.
cloudProvider: clusterapi
Azure AKS v1.33.3
What did you expect to happen?:
When cluster-autoscaler selects a node belonging to a Cluster API MachinePool for scale-down, I expect scale-down to still work even if no backing Machine object can be resolved for that node, by falling back to MachinePool replica decrement.
In other words:
if a matching Machine exists, use the normal targeted deletion path
if no matching Machine exists, but the node belongs to a MachinePool, decrease the MachinePool replica count instead of blocking scale-down
What happened instead?:
Without the fallback, scale-down was blocked in the case where the selected node belonged to a MachinePool but no corresponding Machine could be found for that node.
After applying a local patch implementing a fallback to replica decrement, scale-down succeeded.
Relevant logs:
This suggests that for some MachinePool-based implementations, requiring a resolvable Machine for the selected node prevents a valid scale-down operation.
How to reproduce it (as minimally and precisely as possible):
In our case, once fallback-to-replica-decrement logic was added, the autoscaler was able to:
Anything else we need to know?:
Additional anonymized observations:
Example logs showing management-cluster access and successful nodegroup discovery: