Skip to content

MPI-App hangs when using mpirun -np 1 per host with multi-host setup #13522

@TroyMitchell911

Description

@TroyMitchell911

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.9

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

root@localhost:~/ompi# git submodule status
a84ed686ae84fb6a4b251b29b75ecc38f4621ad9 3rd-party/openpmix (v5.0.9)
2e893392405afd914717a2c077accf1c1ec9ee55 3rd-party/prrte (v3.0.12)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running

  • Operating system/version: Ubuntu25.04
  • Computer hardware: RISC-V
  • Network type: TCP/IP

Details of the problem

I have a simple MPI test program as follows:

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank, size;
    char hostname[256];
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    gethostname(hostname, sizeof(hostname));

    printf("Hello from rank %d of %d on %s\n", rank, size, hostname);

    if (rank == 0) {
        int send_data = 123;
        int recv_data;

        MPI_Send(&send_data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
        MPI_Recv(&recv_data, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

        printf("Rank 0 received: %d from rank 1\n", recv_data);

    } else if (rank == 1) {
        int send_data = 456;
        int recv_data;

        MPI_Recv(&recv_data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        MPI_Send(&send_data, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);

        printf("Rank 1 received: %d from rank 0\n", recv_data);
    }
    MPI_Finalize();
    return 0;
}

My initial attempt was to run MPI with UCX enabled. The goal was to launch one process on host A and one on host B. The command I used was:

mpirun --allow-run-as-root -np 2 --hostfile hosts.txt \
    -x UCX_TLS=self,posix,tcp --mca pml ucx --mca pml_ucx_tls any \
    --mca pml_ucx_devices any ./test

The hostfile was:

10.0.90.205 slots=1
10.0.90.212 slots=1

At this point the program hangs and no further output appears:

root@localhost:~# mpirun --allow-run-as-root -np 2 --hostfile hosts.txt ./test
[1763452883.721634] [a:487463:0]     ucp_context.c:2339 UCX  WARN  UCP API version is incompatible: required >= 1.20, actual 1.19.0 (loaded from /opt/ucx/lib/libucp.so.0)
[1763452883.759193] [b:469680:0]     ucp_context.c:2339 UCX  WARN  UCP API version is incompatible: required >= 1.20, actual 1.19.0 (loaded from /opt/ucx/lib/libucp.so.0)
Hello from rank 0 of 2 on a
Hello from rank 1 of 2 on b
Rank 1 received: 123

I also tried running without UCX:

mpirun --allow-run-as-root -np 2 --hostfile hosts.txt ./test

The warnings still appear (coming from UCX being present in the environment), and the behavior is identical: the program prints once and then stops.

Next, I changed the hostfile to:

10.0.90.205 slots=2
10.0.90.212 slots=2

And ran with four processes:

mpirun --allow-run-as-root -np 4 --hostfile hosts.txt \
    -x UCX_TLS=self,posix,tcp --mca pml ucx --mca pml_ucx_tls any \
    --mca pml_ucx_devices any ./test

This time the program runs as expected and all ranks print their messages.

Running without UCX also works correctly when using four processes.

Given these results, it appears that the issue does not depend on whether UCX is enabled or not. The problematic behavior only occurs when exactly two processes are launched across two nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions