Slurm Python script with MPI4py Fail with GPU nodes

51 views Asked by user23395953 At 13 February 2024 at 08:57

I am experiencing errors in scripts that perform MPI assisted computing over python. The error I receive is, unfortunately, fairly generic and uninformative, from my searches

One script yields an error of:

g05:rank3.python: Failed to modify UD QP to INIT on mlx5_3: Operation not permitted

Another outputs:

An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements).

The slurm script loads the libraries

ml CuPy/12.3.0-foss-2023a-CUDA-12.1.1 PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 
matplotlib/3.7.2-gfbf-2023a mpi4py/3.1.4-gompi-2023a

And is then followed by a mpiexec python command. The python scripts performs collective communication, and there is a variant with blocking and a variant with nonblocking, and the errors appear the same.

There are no issues, however, when I run the same scripts but without using GPUs and replacing CuPy and NumPy

The sysadmin and I tried different combinations of specific gpu nodes as well as library combinations, but ultimately were unsuccessful in resolving the issue, all suggestions are welcome

Thank you

Original Q&A

TechQA.

Slurm Python script with MPI4py Fail with GPU nodes

There are 0 answers

Related Questions in PYTHON

Related Questions in GPU

Related Questions in MPI

Related Questions in SLURM

Related Questions in MPI4PY

Popular Questions

Trending Questions