[Users] Stampede

Ian Hinder ian.hinder at aei.mpg.de
Fri May 2 04:55:44 CDT 2014


Hi all,

Has anyone run into problems recently with Cactus jobs on Stampede?  I've had jobs die when checkpointing, and also mysteriously hanging for no apparent reason.  These might be separate problems.  The checkpointing issue occurred when I submitted several jobs and they all started checkpointing at the same time after 3 hours.  The hang happened after a few hours of evolution, with GDB reporting

> MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
>   at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
> 296	    for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
>   ++i)

Unfortunately I didn't ask for a backtrace. I'm using mvapich2.  I've been in touch with support and they said the dying while checkpointing coincided with the filesystems being hit hard by my jobs, which makes sense, but they didn't see any problems in their logs, and they have no idea about the mysterious hang.  I repeated the hanging job and it ran fine.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20140502/bbb9b3b9/attachment.html 


More information about the Users mailing list