[ET Trac] [Einstein Toolkit] #1547: issues with stampede

Einstein Toolkit trac-noreply at einsteintoolkit.org
Fri May 2 09:07:11 CDT 2014


#1547: issues with stampede
-----------------------------+----------------------------------------------
  Reporter:  jchsma@…        |       Owner:            
      Type:  defect          |      Status:  new       
  Priority:  major           |   Milestone:            
 Component:  Other           |     Version:  ET_2013_11
Resolution:                  |    Keywords:            
-----------------------------+----------------------------------------------

Comment (by hinder):

 I am also having similar problems on Stampede.  I am using mvapich2.  I
 posted
 (http://lists.einsteintoolkit.org/pipermail/users/2014-May/003580.html) a
 summary to the mailing list, and I include it here for reference.

   I've had jobs die when checkpointing, and also mysteriously hanging for
 no apparent reason.  These might be separate problems.  The checkpointing
 issue occurred when I submitted several jobs and they all started
 checkpointing at the same time after 3 hours.  The hang happened after a
 few hours of evolution, with GDB reporting

   {{{
 MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
   at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
 296         for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
   ++i)
 }}}

   Unfortunately I didn't ask for a backtrace. I'm using mvapich2.  I've
 been in touch with support and they said the dying while checkpointing
 coincided with the filesystems being hit hard by my jobs, which makes
 sense, but they didn't see any problems in their logs, and they have no
 idea about the mysterious hang.  I repeated the hanging job and it ran
 fine.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1547#comment:3>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list