Version/Environment (if relevant):
This applies to Domino versions < 5.3.0.
Autoscaling doesn't appear to be working correctly. Some compute nodes are unexpectedly terminated. As a result, active compute cluster workspaces get shut down while users are working on them.
This is a known issue where execution pods for on-demand clusters -- such as Spark, Ray, Dask, and MPI -- get evicted by the autoscaler. This is due to compute cluster pods missing the annotation that prevents the Kubernetes cluster autoscaler from evicting them.
This bug is resolved in Domino version 5.3. The bug fix added the annotation to prevent the autoscaler from stopping any compute cluster-related execution pods, e.g. when scaling down. For reference: DOM-38970
As a temporary workaround until your deployment is upgraded to Domino 5.3 or higher, you can increase this value in AWS: Auto Scaling Groups > Instances > Minimum Capacity