[ET Trac] [Einstein Toolkit] #1476: Use mvapich2 instead of impi on Stampede

Einstein Toolkit trac-noreply at einsteintoolkit.org
Wed Dec 4 07:41:42 CST 2013


#1476: Use mvapich2 instead of impi on Stampede
-----------------------+----------------------------------------------------
  Reporter:  eschnett  |       Owner:                     
      Type:  defect    |      Status:  review             
  Priority:  major     |   Milestone:                     
 Component:  Other     |     Version:  development version
Resolution:            |    Keywords:                     
-----------------------+----------------------------------------------------
Changes (by hinder):

  * status:  new => review


Comment:

 I have had endless problems with Intel MPI, both on Stampede and on Datura
 in the past.  We now use OpenMPI on Datura.  On Stampede, I recently tried
 to start a 256 core job using Intel MPI, and got the error

 {{{
 Fatal error in MPI_Init: Other MPI error, error stack:
 MPIR_Init_thread(658).................:
 MPID_Init(195)........................: channel initialization failed
 MPIDI_CH3_Init(104)...................:
 dapl_rc_setup_all_connections_20(1272): generic failure with errno =
 671092751
 MPID_nem_dapl_get_from_bc(1239).......: Missing port or invalid host/port
 description in business card
 }}}

 immediately after the ibrun command.  This is intermittent; repeating the
 same run worked fine.  Looking back through my email history, I see that I
 also had this error:

 {{{
 [59:c453-703][../../dapl_poll_rc.c:1360] Intel MPI fatal error:
 ofa-v2-mlx4_0-1 DTO operation posted for [1:c411-103] completed with
 error. status=0x8. cookie=0x0
 Assertion failed in file ../../dapl_poll_rc.c at line 1362: 0
 internal ABORT - process 59
 }}}

 which would cause the run to abort after several hours of runtime, which
 Roland has also seen with SpEC.

 In contrast, I have been running production runs with mvapich for many
 months on stampede, and have never had any such problems.  mvapich is
 also, as pointed out by Erik, the system default on stampede.

 I have just tested Intel MPI and mvapich with qc0-mclachlan, and a higher-
 resolution version which uses about 80% of the memory on 256 cores.  The
 speed on 32 cores (low resolution) and on 256 cores (high resolution)
 appears to be similar between the two MPI versions.

 Since I have had so many problems with Intel MPI, I suggest that we change
 the simfactory default to mvapich.  The only reported problem that I can
 find is that Roland saw memory usage increasing with time with SpEC, but
 since we have not seen the same with Cactus, I don't think this should
 influence the decision.

 I have a tested optionlist and runscript ready to commit.  I have also
 made sure that you can switch to Intel MPI by just selecting the required
 optionlist and runscript; the compilation and run are independent of the
 module command, so the Intel MPI module does not need to be loaded in
 envsetup (it doesn't do anything that isn't taken care of in the
 optionlist and runscript anyway).

 I have also run the ET testsuite with mvapich, and get only the expected
 failures (ADM etc).

 OK to change the default in simfactory to mvapich?

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1476#comment:3>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list