[ET Trac] [Einstein Toolkit] #1476: Use mvapich2 instead of impi on Stampede
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Wed Dec 4 07:41:42 CST 2013
#1476: Use mvapich2 instead of impi on Stampede
-----------------------+----------------------------------------------------
Reporter: eschnett | Owner:
Type: defect | Status: review
Priority: major | Milestone:
Component: Other | Version: development version
Resolution: | Keywords:
-----------------------+----------------------------------------------------
Changes (by hinder):
* status: new => review
Comment:
I have had endless problems with Intel MPI, both on Stampede and on Datura
in the past. We now use OpenMPI on Datura. On Stampede, I recently tried
to start a 256 core job using Intel MPI, and got the error
{{{
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(658).................:
MPID_Init(195)........................: channel initialization failed
MPIDI_CH3_Init(104)...................:
dapl_rc_setup_all_connections_20(1272): generic failure with errno =
671092751
MPID_nem_dapl_get_from_bc(1239).......: Missing port or invalid host/port
description in business card
}}}
immediately after the ibrun command. This is intermittent; repeating the
same run worked fine. Looking back through my email history, I see that I
also had this error:
{{{
[59:c453-703][../../dapl_poll_rc.c:1360] Intel MPI fatal error:
ofa-v2-mlx4_0-1 DTO operation posted for [1:c411-103] completed with
error. status=0x8. cookie=0x0
Assertion failed in file ../../dapl_poll_rc.c at line 1362: 0
internal ABORT - process 59
}}}
which would cause the run to abort after several hours of runtime, which
Roland has also seen with SpEC.
In contrast, I have been running production runs with mvapich for many
months on stampede, and have never had any such problems. mvapich is
also, as pointed out by Erik, the system default on stampede.
I have just tested Intel MPI and mvapich with qc0-mclachlan, and a higher-
resolution version which uses about 80% of the memory on 256 cores. The
speed on 32 cores (low resolution) and on 256 cores (high resolution)
appears to be similar between the two MPI versions.
Since I have had so many problems with Intel MPI, I suggest that we change
the simfactory default to mvapich. The only reported problem that I can
find is that Roland saw memory usage increasing with time with SpEC, but
since we have not seen the same with Cactus, I don't think this should
influence the decision.
I have a tested optionlist and runscript ready to commit. I have also
made sure that you can switch to Intel MPI by just selecting the required
optionlist and runscript; the compilation and run are independent of the
module command, so the Intel MPI module does not need to be loaded in
envsetup (it doesn't do anything that isn't taken care of in the
optionlist and runscript anyway).
I have also run the ET testsuite with mvapich, and get only the expected
failures (ADM etc).
OK to change the default in simfactory to mvapich?
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1476#comment:3>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list