<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi<br>
<br>
I have been having problems running on Stampede for a long time. I
couldn't get the latest<br>
stable ET to run because during checkpointing, it would die. I had
to backtrack to <br>
the Orsted version (unfortunately, that has a bug in the way the
grid is set up, causing some of the<br>
intermediate levels to span both black holes, wasting a lot of
memory). Even with<br>
Orsted , stalling is a real issue. Currently, my "solution" is to
run for 4 hours at a time.<br>
This would have been OK on Lonestar or Ranger,<br>
because when I chained a bunch a runs, the next in line would
start<br>
almost right away, but on stampede the delay is quite substantial.
I believe Jim Healy opened<br>
a ticket concerning the RIT issues with running ET on stampede.<br>
<br>
<br>
On 05/02/2014 05:55 AM, Ian Hinder wrote:<br>
</div>
<blockquote
cite="mid:81013BDB-E1AB-4722-8218-E1D8A01D7ABA@aei.mpg.de"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
Hi all,
<div><br>
</div>
<div>Has anyone run into problems recently with Cactus jobs on
Stampede? I've had jobs die when checkpointing, and also
mysteriously hanging for no apparent reason. These might be
separate problems. The checkpointing issue occurred when I
submitted several jobs and they all started checkpointing at the
same time after 3 hours. The hang happened after a few hours of
evolution, with GDB reporting</div>
<div><br>
</div>
<div>
<blockquote type="cite">MPIDI_CH3I_MRAILI_Get_next_vbuf
(vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)<br>
at
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296<br>
296<span class="Apple-tab-span" style="white-space: pre;"> </span> for
(; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;<br>
++i)</blockquote>
<br>
</div>
<div>Unfortunately I didn't ask for a backtrace. I'm using
mvapich2. I've been in touch with support and they said the
dying while checkpointing coincided with the filesystems being
hit hard by my jobs, which makes sense, but they didn't see any
problems in their logs, and they have no idea about the
mysterious hang. I repeated the hanging job and it ran fine.</div>
<div><br>
<div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); letter-spacing: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
word-wrap: break-word; -webkit-nbsp-mode: space;
-webkit-line-break: after-white-space;">
<div>-- </div>
<div>Ian Hinder</div>
<div><a moz-do-not-send="true"
href="http://numrel.aei.mpg.de/people/hinder">http://numrel.aei.mpg.de/people/hinder</a></div>
</div>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a>
<a class="moz-txt-link-freetext" href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a>
</pre>
</blockquote>
<br>
</body>
</html>