Skip to content

OpenMPI 4.0.1 crashing ... #6981

@miroi

Description

@miroi

Hello,

my OpenMPI applications are crashing on our cluster, we do not know if this is due to an old linux kernel. Here is the info:

OpenMPI installed as

milias@login.grid.umb.sk:~/bin/openmpi-4.0.1_suites/openmpi-4.0.1_Intel14_GNU6.3g++/../configure --prefix=$PWD CXX=g++ CC=icc F77=ifort FC=ifort

whith g++ 6.3, ifort/icc 14.01
milias@comp04:~/.uname -a
Linux comp04 2.6.32-754.2.1.el6.x86_64 #1 SMP Tue Jul 10 13:23:59 CDT 2018 x86_64 x86_64 x86_64 GNU/Linux
milias@comp04:~/.mpirun --version
mpirun (Open MPI) 4.0.1

Error by running application:

 Running mpirun -np 8 /home/milias/Work/qch/software/lammps/lammps_stable/src/lmp_mpi -in in.melt
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_segment.c at line 196
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 538
[comp04:20835] PMIX ERROR: ERROR in file dstore_base.c at line 2414
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_segment.c at line 196
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 538
[comp04:20835] PMIX ERROR: ERROR in file dstore_base.c at line 2414
[comp04:20853] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20853] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20858] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20858] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20854] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20854] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20852] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20852] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20856] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20856] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20855] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20855] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20857] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20857] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20851] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20851] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20857] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20854] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20858] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[comp04:20852] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20851] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20855] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[61382,1],0]
  Exit code:    1
--------------------------------------------------------------------------
[comp04:20835] 7 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[comp04:20835] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[comp04:20835] 7 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[comp04:20835] 7 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[comp04:20835] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199

and on comp04 node the g++ version is lower:

milias@comp04:~/.mpiCC --version
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions