Version:
Not version specific.
Issue:
If you have switched a workspace to a GPU tier and are unable to start your workspace, it's stuck in assigning, check the execution logs and see if the following type of error exists.
pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 2 max node group size reached, 2 Insufficient nvidia.com/gpu
This message means five nodes where in the wrong Availability Zone (AZ). Two of the AZs have maxed out the number of available nodes, and two are just not GPU. So what does that mean and what can I do about it.
Root Cause:
Workspaces in Domino create persistent volumes (PV). These are not recreated over time, they are created and are available for the lifetime of your workspace. This allows you to not lose data through stops and starts of the workspace. Cloud providers use Availability Zones (AZ) to create redundancy and efficiency. Your volume is created in one of the AZs and will never move AZs.
An execution with a PV in zone b, for example, can only be placed on host instances in the same AZ, b. When you switch the hardware tier of your execution, to GPU or CPU, if a host instance is not available in the AZ where your PV was created your execution will not start.
So why isn't one always available? There are a few reasons. Hardware tiers are configured in all available zones to minimize this affinity issue. But there are maximums per AZ. If all of that hardware tier's instances in zone b have been used, your existing execution with a PV in zone b will not start. GPUs are more sensitive to this issue because there are usually smaller maximums set by local admins due to cost. Additionally cloud providers do not always have large numbers of GPU hosts available in every zone, sometimes none. And very often GPUs do not allow more than one execution to start on them at a time. These reasons make the problem of hitting maximums and PV affinity more common with GPU tiers.
Resolution:
So what can you do if your workspace is stuck?
- Create a new workspace. This is typically the best and easiest route on a busy system. This will create a new persistent volume in whichever AZ has a host instance available. If you see this
2 max node group size reached
, it means that the maximums have been reached on 2 zones. For some larger tiers that may be all there are, but there are usually 3 or more. So you may need to wait it out and try again later when there is more availability. But first try a new workspace. - Switch to a different hardware tier. Typically the smaller the tier the better the availability, so try something else if the sizing suits your workload.
- If you had unsaved data and need the existing workspace to start, switch the hardware tier to something with current availability and sync your work. The tier you use for this can be any size, the intent is just to start the workspace and sync everything. Once your work is saved to your GIT repos and/or to the local Domino File System you can start a new workspace. Again, a new workspace has the best chance to succeed on a busy system.
- Talk to your local admin and let them know that you are struggling to start executions due to unavailability of hardware tiers. They may be able to increase the maximums available for the deployment and mitigate this issue to a degree.
Notes/Information:
Additional information on auto scaler issues is available here.
How to Troubleshoot Autoscaling(ASG) Issues
Comments
0 comments
Please sign in to leave a comment.