CUDA driver initialization failed, you might not have a CUDA gpu

74 views Asked by At

I don’t have sudo access and contacting sys-admin takes a non trivial amount of time.

Here is the output of nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

Output of nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:1C:00.0 Off |                  Off |
| 30%   32C    P8             19W /  300W |      23MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               Off |   00000000:1E:00.0 Off |                  Off |
| 30%   33C    P8             20W /  300W |      11MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A6000               Off |   00000000:3D:00.0 Off |                  Off |
| 30%   32C    P8             27W /  300W |      11MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A6000               Off |   00000000:3E:00.0 Off |                  Off |
| 30%   34C    P8             25W /  300W |      11MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A6000               Off |   00000000:3F:00.0 Off |                 Off* |
|ERR!   49C    P5            ERR! /  300W |      11MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX A6000               Off |   00000000:40:00.0 Off |                  Off |
| 30%   31C    P8              6W /  300W |      11MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX A6000               Off |   00000000:41:00.0 Off |                  Off |
| 30%   31C    P8             16W /  300W |      11MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX A6000               Off |   00000000:5E:00.0 Off |                  Off |
| 30%   29C    P8              6W /  300W |      11MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4216      G   /usr/libexec/Xorg                               9MiB |
|    0   N/A  N/A      4466      G   /usr/bin/gnome-shell                            4MiB |
|    1   N/A  N/A      4216      G   /usr/libexec/Xorg                               4MiB |
|    2   N/A  N/A      4216      G   /usr/libexec/Xorg                               4MiB |
|    3   N/A  N/A      4216      G   /usr/libexec/Xorg                               4MiB |
|    4   N/A  N/A      4216      G   /usr/libexec/Xorg                               4MiB |
|    5   N/A  N/A      4216      G   /usr/libexec/Xorg                               4MiB |
|    6   N/A  N/A      4216      G   /usr/libexec/Xorg                               4MiB |
|    7   N/A  N/A      4216      G   /usr/libexec/Xorg                               4MiB |
+-----------------------------------------------------------------------------------------+

when I try to run

cuda_available = torch.cuda.is_available()
print("CUDA Available:", cuda_available)
if cuda_available:
    print("CUDA version:", torch.version.cuda)
    print("cuDNN version:", torch.backends.cudnn.version())
else:
    print("CUDA not available")

I get the following error:

/home/user_name/anaconda3/envs/llm2/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
 return torch._C._cuda_getDeviceCount() > 0
 CUDA Available: False
 CUDA not available

Is it possible to fix this error without sudo access ? The two possible solutions are :-

  1. Update drivers
  2. Build pytorch for cuda 12.4 from source

IIRC both of these require sudo access

0

There are 0 answers