Problem:
When GPU-based runs or workspaces cause a new AWS node to be auto-scaled, the node might encounter a specific failure during initialization which results in the workspace/run UI stuck on 'Assigning' perpetually.
Pods events show:
2022-05-03 12:44:27 : pod triggered scale-up: [{ddl-nvidia_worker_a 0->1 (max: 2)}]
2022-05-03 01:16:28 : pod triggered scale-up: [{ddl-nvidia_worker_a 0->1 (max: 2)}]
Background/symptoms:
Starting April 27, 2022, NVIDIA has rotated the signing key used for the CUDA and GPU driver repositories. In Domino-managed deploys these are used for Rancher/AWS deployments.
https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772
Workloads that require additional nodes in GPU node pools will not be schedulable as those new nodes will not provision due to a failure in initialization scripts stemming from launch templates or launch configurations.
This problem can manifest in various ways depending on how you provision and initialize your nodes.
In Domino-managed deployments, when viewing the bootstrap logs of newly provisioned nodes in /tmp/domino-bootstrap.log
, the following error can be seen indicating that the new signing key is not trusted:
GPG error: http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 InRelease: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
Testing/validation:
To test, in addition to triggering scale-up thru workspace/run activity you can also run the following, then validate the new node has joined the Kubernetes cluster either through the Domino UI, kubectl get nodes
:
$ aws autoscaling update-auto-scaling-group --auto-scaling-group-name [deploy_id]-nvidia_worker_a --desired-capacity 1
Resolution:
If you witness this problem and your deployment is Domino-managed, please contact support@dominodatalab.com. If not, read on.
One option is to upgrade your nodes' OS to a version (in Ubuntu's case Ubuntu 20.4) that moves from the NVIDIA CUDA repositories to using upstream Ubuntu repositories to manage the NVIDIA driver.
Alternatively you can modify your existing GPU autoscaling groups' initialization scripts or directives to use the new NVIDIA signing key.
For example, Domino-managed deployments often involve launch templates or configurations using user-data (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html), but perhaps your scripts are baked into your images (AMI) or other.
To correct user-data in Domino-managed deployments we needed to replace:
apt-key adv --fetch-keys "$nvidia_repo/7fa2af80.pub"
with:
apt-key adv --fetch-keys "$nvidia_repo/3bf863cc.pub"
Note: In case you are using launch templates or launch configurations similar to ours, we've attached a script which will update user-data, swapping the specific .pub for the apt-key command (asg_update.sh). This has been attached only to help foster your own ideas for your update.
Comments
0 comments
Please sign in to leave a comment.