Problem
If you are running the following tensorflow test (or similar) in a Jupyter notebook of whether tensorflow sees GPUs...
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
try:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
except:
print('GPU device not found')
print('TensorFlow version: ', tf.__version__)
...and the GPU devices are not being found, you may be missing some libraries from your compute environment.
[] GPU device not found TensorFlow version: 2.4.0
Troubleshooting
Try moving to a terminal session as the Jupyter notebook can mask some errors from you. As a first step in the terminal execute 'nvidia-smi'.
ubuntu@run-623a4ca33902662cbe298791-2hqn6 :/mnt$ nvidia-smi
Tue Mar 22 22:25:25 2022
+----------------------------------------------------------- ------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+---- ------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+==== ==================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 27C P0 22W / 300W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+---- ------------------+
+----------------------------------------------------------- ------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================== ==================|
| No running processes found |
+----------------------------------------------------------- ------------------+
This will let you know that GPUs are at least found and the version of CUDA in use, in this case 11.0.
If you do not see any GPUs here or see other failures such as
Failed to initialize NVML: Driver/library version mismatch
you might not be using an appropriate environment for Domino 4.x. Check Domino documentation for links to appropriate base images that support GPU in Domino 4.x.
If you are already using an appropriate base image, start a python session in your terminal and import tensorflow...
>>> import tensorflow
2022-03-22 22:52:57.961895: W tensorflow/stream_executor/platform/default/dso_loader. cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/ local/nvidia/lib:/usr/local/ nvidia/lib64:/opt/oracle/ instantclient_12_1:/usr/lib/ jvm/java-8-openjdk-amd64/jre/ lib/amd64/server:
2022-03-22 22:52:57.961963: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
If you see this error or a missing CUDnn library, you may just need to install CUDA and/or CUDnn in your environment. In Domino 4.x nvidia-docker runtime passes through GPU drivers and CUDA versions directly from the executor instance. Most packages do not need CUDA installed in the environment itself. Tensorflow appears to want these libraries locally in the environment however.
To install CUDA 11 use the following syntax...
# wget http://developer.download.nvidia.com/compute/cuda/11.0. 2/local_installers/cuda_11.0. 2_450.51.05_linux.run --no-check-certificate
# sudo sh cuda_11.0.2_450.51.05_linux.run --silent --toolkit
To install CUDnn use...
sudo apt update && apt install libcudnn8
To adapt these to be added directly to the dockerfile definition of your environment use the following...
Comments
0 comments
Please sign in to leave a comment.