We often see confusing errors when users are first getting set up to use GPU machines. Here are a few examples:
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Failed to initialize NVML: Driver/library version mismatch
14 node(s) didn't match node selector, 20 Insufficient nvidia.com/gpu.
First you'll want to check that the necessary devices and drivers are installed, then check that the versions are compatible. Below are some helpful commands that can be used to collect the info you'll need:
Validating the Device
ls /dev | grep nvidia - Look for nvidia0
lspci | grep nvidia - View that GPU device is available on the Host.
Validating the Driver
cat /proc/driver/nvidia/version (Note: This will always output the driver on the Host, even if you run within container)
apt list --installed | grep nvidia (Note: This will show the driver installed in the Container)
pip freeze | grep tensorflow (Make sure that tensorflow-gpu is installed and not the CPU version)
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Use the following matrices to verify whether the versions present on your machine are compatible with each other. If not, install compatible versions in the environment you are using.
Nvidia Compatibility Matrix:
Tensorflow Compatibility Matrix:
Here's a link to our admin docs that discuss GPU installation: https://admin.dominodatalab.com/en/latest/environments/environment-caching-eks.html#install-nvidia-docker-2-0-gpu-amis-only