-
Notifications
You must be signed in to change notification settings - Fork 935
Closed
Description
Hello,
my OpenMPI applications are crashing on our cluster, we do not know if this is due to an old linux kernel. Here is the info:
OpenMPI installed as
milias@login.grid.umb.sk:~/bin/openmpi-4.0.1_suites/openmpi-4.0.1_Intel14_GNU6.3g++/../configure --prefix=$PWD CXX=g++ CC=icc F77=ifort FC=ifort
whith g++ 6.3, ifort/icc 14.01
milias@comp04:~/.uname -a
Linux comp04 2.6.32-754.2.1.el6.x86_64 #1 SMP Tue Jul 10 13:23:59 CDT 2018 x86_64 x86_64 x86_64 GNU/Linux
milias@comp04:~/.mpirun --version
mpirun (Open MPI) 4.0.1
Error by running application:
Running mpirun -np 8 /home/milias/Work/qch/software/lammps/lammps_stable/src/lmp_mpi -in in.melt
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_segment.c at line 196
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 538
[comp04:20835] PMIX ERROR: ERROR in file dstore_base.c at line 2414
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_segment.c at line 196
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 538
[comp04:20835] PMIX ERROR: ERROR in file dstore_base.c at line 2414
[comp04:20853] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20853] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20858] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20858] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20854] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20854] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20852] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20852] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20856] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20856] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20855] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20855] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20857] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20857] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20851] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20851] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[comp04:20857] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[comp04:20854] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[comp04:20858] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[comp04:20852] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[comp04:20851] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[comp04:20855] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[61382,1],0]
Exit code: 1
--------------------------------------------------------------------------
[comp04:20835] 7 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[comp04:20835] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[comp04:20835] 7 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[comp04:20835] 7 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[comp04:20835] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
and on comp04 node the g++ version is lower:
milias@comp04:~/.mpiCC --version
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Metadata
Metadata
Assignees
Labels
No labels