[ET Trac] [Einstein Toolkit] #1547: issues with stampede
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Fri May 2 09:07:11 CDT 2014
#1547: issues with stampede
-----------------------------+----------------------------------------------
Reporter: jchsma@… | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Other | Version: ET_2013_11
Resolution: | Keywords:
-----------------------------+----------------------------------------------
Comment (by hinder):
I am also having similar problems on Stampede. I am using mvapich2. I
posted
(http://lists.einsteintoolkit.org/pipermail/users/2014-May/003580.html) a
summary to the mailing list, and I include it here for reference.
I've had jobs die when checkpointing, and also mysteriously hanging for
no apparent reason. These might be separate problems. The checkpointing
issue occurred when I submitted several jobs and they all started
checkpointing at the same time after 3 hours. The hang happened after a
few hours of evolution, with GDB reporting
{{{
MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
296 for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
++i)
}}}
Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've
been in touch with support and they said the dying while checkpointing
coincided with the filesystems being hit hard by my jobs, which makes
sense, but they didn't see any problems in their logs, and they have no
idea about the mysterious hang. I repeated the hanging job and it ran
fine.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1547#comment:3>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list