mxnet: CUDA Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination

227 views Asked by At

As mxnet nvidia froums and github does not answer questions properly and also they don't have good community like stackoverflow I ask this question here.

Host:

Linux XYZ 5.15.0-76-generic #83~20.04.1-Ubuntu SMP Wed Jun 21 20:23:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Host Nvidia Driver Version:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   35C    P8     6W /  75W |    265MiB /  4096MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1063      G   /usr/lib/xorg/Xorg                 92MiB |
|    0   N/A  N/A      1338      G   /usr/bin/gnome-shell               26MiB |
|    0   N/A  N/A      2078      G   /usr/lib/firefox/firefox          144MiB |
+-----------------------------------------------------------------------------+

and cuda version is 11.6

USER@XYZ:~$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

Problem:

I installed nvidia-container-toolkit to run containers by enabling nvidia. I have pulled mxnet/python:1.9.1_gpu_cu112_py3 docker image. Then I want to check if mxnet uses my gpu?. So I run docker container

docker run -it --runtime=nvidia --gpus all mxnet/python:1.9.1_gpu_cu112_py3 /bin/bash

Then by checking nvidia driver version in container to be ensure that container runs with gpus:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   34C    P8     6W /  75W |    278MiB /  4096MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Which shows cuda version is 11.2 . Now, in docker container I check mxnet by running these codes in python3.7:

import mxnet as mx
mx.context.num_gpus()

But, I have got this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/mxnet/context.py", line 275, in num_gpus
    check_call(_LIB.MXGetGPUCount(ctypes.byref(count)))
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../include/mxnet/base.h", line 458
CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination

So, mxnet could not identifying my GPUs.

It seems that there is a problem with cuda version in host and container.

By googling I found a solution to resolve this problem: system has unsupported display driver / cuda driver combination. I remove all libcuda.so and it's symlink in container in the directory /usr/local/cuda-11.2/compat/ and finally testing mxnet successfully:

Python 3.7.13 (default, Apr 24 2022, 01:04:09) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx.context.num_gpus()
1

My question:

By removing libcuda.so in container cuda version in container is 11.6 :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   35C    P8     6W /  75W |    282MiB /  4096MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

It seems that nvidia-container-toolkit binds libcuda.so from host. But, it makes a dependency between host and my container and it is not desired.

  1. How could I remove this dependency?

  2. Is there any way to edit configuration file of nvidia-container-toolkit to not bind libcuda.so with host?

  3. Is this problem specific to mxnet framework?

Thanks in advance.

0

There are 0 answers