<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hi all,<div><br></div><div>Has anyone run into problems recently with Cactus jobs on Stampede? I've had jobs die when checkpointing, and also mysteriously hanging for no apparent reason. These might be separate problems. The checkpointing issue occurred when I submitted several jobs and they all started checkpointing at the same time after 3 hours. The hang happened after a few hours of evolution, with GDB reporting</div><div><br></div><div><blockquote type="cite">MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)<br> at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296<br>296<span class="Apple-tab-span" style="white-space: pre;">        </span> for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;<br> ++i)</blockquote><br></div><div>Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've been in touch with support and they said the dying while checkpointing coincided with the filesystems being hit hard by my jobs, which makes sense, but they didn't see any problems in their logs, and they have no idea about the mysterious hang. I repeated the hanging job and it ran fine.</div><div><br><div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>-- </div><div>Ian Hinder</div><div><a href="http://numrel.aei.mpg.de/people/hinder">http://numrel.aei.mpg.de/people/hinder</a></div></div>
</div>
<br></div></body></html>