MPI Ubuntu -- mpirun hangs

342 views Asked by At

I am trying to build a cluster between 2 Ubuntu servers. I installed mpi by running:

sudo apt install libopenmpi-dev

I can ssh through both servers without password and have created a share NFS between the 2 servers. The issue is when I am trying to run a simple code to check if mpi is working: For example on my master node, if I run:

   
   mpirun -np 2 hostname
   

or just

   mpirun

the command hangs indefinitely without any error message.

I read it might come from my firewall so I disabled it:


   sudo ufw status
   Status: inactive

but still the problem remains.

I used solution proposed here sol

and ran:

   strace -f -- mpirun -np 1 localhost

The progam hangs at:

_flags=0}, 0) = 936 recvmsg(9, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1668671375, pid=1310460}, 0}, iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 close(9)
= 0 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 9 connect(9, {sa_family=AF_INET, sin_port=htons(6006), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockname(9, {sa_family=AF_INET, sin_port=htons(52429), sin_addr=inet_addr("127.0.0.1")}, [28->16]) = 0 close(9)
= 0 socket(AF_INET6, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 9 connect(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(9, {sa_family=AF_INET6, sin6_port=htons(42893), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [28]) = 0 close(9)
= 0 socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 9 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(9, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 connect(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getpeername(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [124->28]) = 0 uname({sysname="Linux", nodename="dcilda1872", ...}) = 0 access("/home/e177338/.Xauthority", R_OK) = 0 openat(AT_FDCWD, "/home/e177338/.Xauthority", O_RDONLY) = 10 fstat(10, {st_mode=S_IFREG|0600, st_size=1120, ...}) = 0 read(10, "\1\0\0\ndcilda1872\0\00213\0\22MIT-MAGIC-CO"..., 4096) = 1120 read(10, "", 4096) = 0 close(10)
= 0 fcntl(9, F_GETFL) = 0x2 (flags O_RDWR) fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl(9, F_SETFD, FD_CLOEXEC) = 0 poll([{fd=9, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=9, revents=POLLOUT}]) writev(9, [{iov_base="l\0\v\0\0\0\0\0\0\0\0\0", iov_len=12}, {iov_base="", iov_len=0}], 2) = 12 recvfrom(9, 0x558bd3c91e30, 8, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=9, events=POLLIN}], 1, -1

`

Would you have any idea ?

thanks in advance :)

0

There are 0 answers