I am experiencing errors in scripts that perform MPI assisted computing over python. The error I receive is, unfortunately, fairly generic and uninformative, from my searches
One script yields an error of:
g05:rank3.python: Failed to modify UD QP to INIT on mlx5_3: Operation not permitted
Another outputs:
An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements).
The slurm script loads the libraries
ml CuPy/12.3.0-foss-2023a-CUDA-12.1.1 PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
matplotlib/3.7.2-gfbf-2023a mpi4py/3.1.4-gompi-2023a
And is then followed by a mpiexec python command. The python scripts performs collective communication, and there is a variant with blocking and a variant with nonblocking, and the errors appear the same.
There are no issues, however, when I run the same scripts but without using GPUs and replacing CuPy and NumPy
The sysadmin and I tried different combinations of specific gpu nodes as well as library combinations, but ultimately were unsuccessful in resolving the issue, all suggestions are welcome
Thank you