A customer's user base increased and they noticed unused Nodes not being scaled down. Cluster-autoScaler logs show pattens like:
2022-11-07T20:57:07.240907279Z I1107 20:57:07.240885 1 static_autoscaler.go:492] ip-112-69-165-28.us-west-1.compute.internal is unneeded since 2022-11-07 19:15:55.669078933 +0000 UTC m=+2844575.632124653 duration 1h41m7.067752862s
2022-11-07T20:57:07.240934601Z I1107 20:57:07.240894 1 static_autoscaler.go:503] Scale down status: unneededOnly=true lastScaleUpTime=2022-11-07 20:54:51.136809083 +0000 UTC m=+2850511.099854771 lastScaleDownDeleteTime=2022-11-04 09:57:14.258673522 +0000 UTC m=+2551854.221719218 lastScaleDownFailTime=2022-11-01 04:40:41.548130457 +0000 UTC m=+2273661.511176156 scaleDownForbidden=true isDeleteInProgress=false scaleDownInCooldown=true
They manually removed many unneeded nodes and restarted the cluster-autoscaler pod, but eventually noticed the scaleDownInCooldown flag in the logs and lack of scale down again.
The logs indicate there are unneeded nodes but they are not being scaled down due to
scaleDownInCooldown=true. Reasons for this behavior are described in the cluster-autoscaler's FAQ at https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why . In this case the following was true due to the increase in user activity:
there was a scale-up in the last 10 min (configurable by
The default value for
--scale-down-delay-after-add is 10 minutes. Particular nodes at the customer had been unneeded for longer than the default ten minutes, but there were so many runs and workspaces being started that scale-ups were occurring every 8 or 9 minutes, thus making the "cooldown" phase almost permanent because the ten minute delay never elapsed.
scaleDownInCooldown relates to a "cooldown" phase in which the autoscaler logic tries not to scale things down it thinks scale-ups are "currently" needed, obviously this may need to be tweaked to find a happy medium based on activity.
Per the above FAQ the fix was to set
--scale-down-delay-after-add to just 5 minutes rather than 10. This was determined via experimentation and by trying to assess how frequently users were triggering scale-ups (based on autoscaler logs).
scaleDownForbidden=true flag can be set when the cluster-autoscaler thinks some pod should be scheduled, but might not be able to be scheduled for a variety of reasons. This is normally a transitory state as things go/up down, but if it persists it is generally occurs due to something like capacity problems, wrong labeling, or GPU operators not starting.
Version/Environment (if relevant):
No specific version