[ET Trac] [Einstein Toolkit] #534: Checkpointing fails on LoneStar
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Fri Sep 9 22:57:10 CDT 2011
#534: Checkpointing fails on LoneStar
---------------------+------------------------------------------------------
Reporter: hinder | Owner: eschnett
Type: defect | Status: new
Priority: major | Milestone:
Component: Carpet | Version:
Resolution: | Keywords: CarpetIOHDF5
---------------------+------------------------------------------------------
Comment (by knarf):
Looking closer at my runs, I see that they sometimes (about once every one
or two days of runtime) stall for some time. I just catched one of these
times and attaching gdb revealed that the processes stalled in MPI calls
within reductions called from IOASCII. Typically these stalls eventually
continue after some time, but the "lost" time is clearly visible in a M/h
plot. I am not sure if this is I/O or MPI hanging, both are involved in
your and my case. I don't seem to have problems accessing the files
though. Maybe it's worth trying openmpi after all. Ian: would you be
interested to provide such a option list?
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/534#comment:13>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list