Version: All Domino versions
Issue: When starting a GPU workspace the following or similar error can be observed in the support bundle events.json log file.
"description" : "Error: failed to start container \"run\":
Error response from daemon: OCI runtime create failed: container_linux.go:346:
starting container process caused \"process_linux.go:449: container init caused \\\
"process_linux.go:432: running prestart hook 0 caused \\\\\\\
"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli:
requirement error: unsatisfied condition: cuda>=11.0,
please update your driver to a newer version, or use an earlier cuda container\\\\\\\\n
\\\\\\\"\\\"\": unknown",
Root cause: This problem indicates that the kubernetes node that your container is trying to start on, has a different version of CUDA compared to the version installed in your container. Both the node and the container
are required to have matching CUDA and driver versions.
Troubleshooting: A way to check the exact versions running on the node is to login to it and run "nvidia-smi" . This will print the available GPU hardware along with the CUDA and driver versions.
Example:
[root@host ~]# nvidia-smi
Fri Sep 10 10:07:00 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:17.0 Off | 0 |
| N/A 36C P0 42W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:18.0 Off | 0 |
| N/A 36C P0 44W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Once you have the version running on your node you can then inspect the version present in your docker image.
Since a workspace fails when running the specific GPU environment with GPU hardware tier you can start a
workspace on a normal hardware tier using the same environment. This way you will be able to check the exact
versions you have installed in the image . As we are testing the environment on a non-GPU hardware tier the actual
nvidia driver will not be running thus you can check the versions by checking the installed packages.
Example:
ubuntu@run-<ID>:/mnt$ dpkg -l | egrep -i "nvidia|driver|cuda"
ii cuda-compat-11-0 450.51.06-1 amd64 CUDA Compatibility Platform
ii cuda-cudart-11-0 11.0.221-1 amd64 CUDA Runtime native Libraries
ii cuda-libraries-11-0 11.0.3-1 amd64 CUDA Libraries 11.0 meta-package
ii cuda-nvrtc-11-0 11.0.221-1 amd64 NVRTC native runtime libraries
ii cuda-nvtx-11-0 11.0.167-1 amd64 NVIDIA Tools Extension
ii cups 2.2.7-1ubuntu2.8 amd64 Common UNIX Printing System(tm) - PPD/driver support, web interface
ii cups-core-drivers 2.2.7-1ubuntu2.8 amd64 Common UNIX Printing System(tm) - driverless printing
ii cups-filters-core-drivers 1.20.2-0ubuntu3.1 amd64 OpenPrinting CUPS Filters - Driverless printing
ii libasound2-data 1.1.3-5ubuntu0.5 all Configuration files and profiles for ALSA drivers
hi libcudnn8 8.0.2.39-1+cuda11.0 amd64 cuDNN runtime libraries
ii libcusolver-11-0 10.6.0.245-1 amd64 CUDA solver native runtime libraries
hi libnccl2 2.7.8-1+cuda11.0 amd64 NVIDIA Collectives Communication Library (NCCL) Runtim
As we can see in this example the CUDA packages are for version 11.0 and the nvidia driver is 450.51 which is
higher than what we have on the node and as such causing the problem .
Resolution: You will need to make sure the image you are running does match the exact versions
on your "hardware" nodes. You can use one of the Domino provided images here:
https://quay.io/repository/domino/base?tab=tags&tag=cuda
Or in the case you need a different version, then you can edit the environment and install the exact
versions needed.
Comments
0 comments
Please sign in to leave a comment.