Version/Environment (if relevant):
This applies to Domino versions < 5.3.0.
Issue:
Autoscaling doesn't appear to be working correctly. Some compute nodes are unexpectedly terminated. As a result, active compute cluster workspaces get shut down while users are working on them.
Root Cause:
This is a known issue where execution pods for on-demand clusters -- such as Spark, Ray, Dask, and MPI -- get evicted by the autoscaler. This is due to compute cluster pods missing the annotation that prevents the Kubernetes cluster autoscaler from evicting them.
Resolution:
This bug is resolved in Domino version 5.3. The bug fix added the annotation to prevent the autoscaler from stopping any compute cluster-related execution pods, e.g. when scaling down. For reference: DOM-38970
Notes/Information:
As a temporary workaround until your deployment is upgraded to Domino 5.3 or higher, you can increase this value in AWS: Auto Scaling Groups > Instances > Minimum Capacity
Comments
0 comments
Please sign in to leave a comment.