[ET Trac] [Einstein Toolkit] #1547: issues with stampede
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Thu Feb 20 14:11:44 CST 2014
#1547: issues with stampede
----------------------------+-----------------------------------------------
Reporter: jchsma@… | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Other | Version: ET_2013_11
Keywords: |
----------------------------+-----------------------------------------------
Over the past few months, I have been using RIT's LazEv code with only
minor hiccups on stampede (particularly, the unreproducible 'dapl_conn_rc'
crashes that I'm sure other stampede users are familiar with). This
checkout was of the previous release, ET_2013_05, and compiled with Intel
MPI. Most of the jobs I ran took advantage of some symmetry, and I was
able to run on 12-16 nodes at about 50-60% memory usage.
After the sync issue was backported, I checked out the new release,
ET_2013_11 and immediately ran into problems. The first issue was with
the run performance and LoopControl, which, with the mailing lists help,
we sorted out. The second was with crashes and checkpointing. With both
Intel MPI and MVAPICH2 configurations, the code would hang (~50% of the
time) when dumping checkpoints, and 100% when dumping a termination
checkpoint. Further, the crashes seem more frequent, and I couldn't get a
simulation to run for a full 24 hours without crashing (either by stalling
on checkpointing or otherwise).
So, I checked out a clean version of the toolkit, with only toolkit
thorns, and removed any thorns specific to RIT. I compiled with both the
Intel MPI and MVAPICH2 configurations in simfactory.
In both cases, I can run the 'qc0-mclachlan.par' file to completion with
no issues. So I edited the qc0 parfile to update the grid, remove the
symmetries, and update the initial data to match my test parameter file.
I ran the job on 20 nodes, and with either configuration, I was not
successful in running the job to completion on any of my numerous
attempts. Intel MPI runs die with the standard unhelpful "dapl_conn_rc"
error at random times in the evolution, and the MVAPICH2 dies with:
[c431-903.stampede.tacc.utexas.edu:mpispawn_7][readline] Unexpected End-
Of-File on file descriptor 6. MPI process died?
[c431-903.stampede.tacc.utexas.edu:mpispawn_7][mtpmi_processops] Error
while reading PMI socket. MPI process died?
[c431-903.stampede.tacc.utexas.edu:mpispawn_7][child_handler] MPI process
(rank: 15, pid: 106620) terminated with signal 9 -> abort job
[c429-501.stampede.tacc.utexas.edu:mpirun_rsh][process_mpispawn_connection]
mpispawn_7 from node c431-903 aborted: Error while reading a PMI socket
(4)
The IMPI jobs died with the same dapl_conn_rc error at run times of 2
hours, 8 hours, and 21 hours. I also had one job that hung and did not
exit until it was killed by the queue manager. The MVAPICH2 jobs died at
around 3 hours and 8 hours with the error above.
We've been in contact with TACC and they said it was a Cactus issue, so I
am sending this report.
Attached is the parameter file I used for the tests. They should work
with a stock ET_2013_11 checkout.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1547>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list