[ET Trac] [Einstein Toolkit] #1547: issues with stampede

Thu Feb 20 14:11:44 CST 2014

#1547: issues with stampede
----------------------------+-----------------------------------------------
 Reporter:  jchsma@…        |       Owner:            
     Type:  defect          |      Status:  new       
 Priority:  major           |   Milestone:            
Component:  Other           |     Version:  ET_2013_11
 Keywords:                  |  
----------------------------+-----------------------------------------------
 Over the past few months, I have been using RIT's LazEv code with only
 minor hiccups on stampede (particularly, the unreproducible 'dapl_conn_rc'
 crashes that I'm sure other stampede users are familiar with).  This
 checkout was of the previous release, ET_2013_05, and compiled with Intel
 MPI.  Most of the jobs I ran took advantage of some symmetry, and I was
 able to run on 12-16 nodes at about 50-60% memory usage.

 After the sync issue was backported, I checked out the new release,
 ET_2013_11 and immediately ran into problems.  The first issue was with
 the run performance and LoopControl, which, with the mailing lists help,
 we sorted out.  The second was with crashes and checkpointing.  With both
 Intel MPI and MVAPICH2 configurations, the code would hang (~50% of the
 time) when dumping checkpoints, and 100% when dumping a termination
 checkpoint.  Further, the crashes seem more frequent, and I couldn't get a
 simulation to run for a full 24 hours without crashing (either by stalling
 on checkpointing or otherwise).

 So, I checked out a clean version of the toolkit, with only toolkit
 thorns, and removed any thorns specific to RIT.  I compiled with both the
 Intel MPI and MVAPICH2 configurations in simfactory.

 In both cases, I can run the 'qc0-mclachlan.par' file to completion with
 no issues.  So I edited the qc0 parfile to update the grid, remove the
 symmetries, and update the initial data to match my test parameter file.
 I ran the job on 20 nodes, and with either configuration, I was not
 successful in running the job to completion on any of my numerous
 attempts.  Intel MPI runs die with the standard unhelpful "dapl_conn_rc"
 error at random times in the evolution, and the MVAPICH2 dies with:

 [c431-903.stampede.tacc.utexas.edu:mpispawn_7][readline] Unexpected End-
 Of-File on file descriptor 6. MPI process died?
 [c431-903.stampede.tacc.utexas.edu:mpispawn_7][mtpmi_processops] Error
 while reading PMI socket. MPI process died?
 [c431-903.stampede.tacc.utexas.edu:mpispawn_7][child_handler] MPI process
 (rank: 15, pid: 106620) terminated with signal 9 -> abort job
 [c429-501.stampede.tacc.utexas.edu:mpirun_rsh][process_mpispawn_connection]
 mpispawn_7 from node c431-903 aborted: Error while reading a PMI socket
 (4)

 The IMPI jobs died with the same dapl_conn_rc error at run times of 2
 hours, 8 hours, and 21 hours.  I also had one job that hung and did not
 exit until it was killed by the queue manager.  The MVAPICH2 jobs died at
 around 3 hours and 8 hours with the error above.

 We've been in contact with TACC and they said it was a Cactus issue, so I
 am sending this report.

 Attached is the parameter file I used for the tests.  They should work
 with a stock ET_2013_11 checkout.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1547>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit