<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On 02 May 2014, at 14:08, Yosef Zlochower <<a href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
<div bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi<br>
<br>
I have been having problems running on Stampede for a long time. I
couldn't get the latest<br>
stable ET to run because during checkpointing, it would die. </div></div></blockquote><div><br></div><div>OK that's very interesting. Has something changed in the code related to how checkpoint files are written?</div><div><br></div><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000"><div class="moz-cite-prefix">I had
to backtrack to <br>
the Orsted version (unfortunately, that has a bug in the way the
grid is set up, causing some of the<br>
intermediate levels to span both black holes, wasting a lot of
memory). </div></div></blockquote><div><br></div><div>That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (<a href="http://cactuscode.org/pipermail/users/2013-January/003290.html">http://cactuscode.org/pipermail/users/2013-January/003290.html</a>)</div><div><br></div><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000"><div class="moz-cite-prefix">Even with<br>
Orsted , stalling is a real issue. Currently, my "solution" is to
run for 4 hours at a time.<br>
This would have been OK on Lonestar or Ranger,<br>
because when I chained a bunch a runs, the next in line would
start<br>
almost right away, but on stampede the delay is quite substantial.
I believe Jim Healy opened<br>
a ticket concerning the RIT issues with running ET on stampede.<br></div></div></blockquote><div><br></div><div>I think this is the ticket: <a href="https://trac.einsteintoolkit.org/ticket/1547">https://trac.einsteintoolkit.org/ticket/1547</a>. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say.</div><div><br></div><div>I'm starting to think it might be a code problem as well. So the summary is:</div><div><br></div><div><span class="Apple-tab-span" style="white-space:pre">        </span>– Checkpointing causes jobs to die with code versions after Oersted</div><div><span class="Apple-tab-span" style="white-space:pre">        </span>– All versions lead to eventual hung jobs after a few hours</div><div><br></div><div>Since Stampede is the major "capability" resource in Xsede, we should put some effort into making sure the ET can run properly there.</div></div><div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>-- </div><div>Ian Hinder</div><div><a href="http://numrel.aei.mpg.de/people/hinder">http://numrel.aei.mpg.de/people/hinder</a></div></div>
</div>
<br></body></html>