<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On 02 May 2014, at 16:57, Yosef Zlochower <<a href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">On 05/02/2014 10:07 AM, Ian Hinder wrote:<br><blockquote type="cite"><br>On 02 May 2014, at 14:08, Yosef Zlochower <<a href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a><br><<a href="mailto:yosef@astro.rit.edu">mailto:yosef@astro.rit.edu</a>>> wrote:<br><br><blockquote type="cite">Hi<br><br>I have been having problems running on Stampede for a long time. I<br>couldn't get the latest<br>stable ET to run because during checkpointing, it would die.<br></blockquote><br>OK that's very interesting. Has something changed in the code related<br>to how checkpoint files are written?<br><br><blockquote type="cite">I had to backtrack to<br>the Orsted version (unfortunately, that has a bug in the way the grid<br>is set up, causing some of the<br>intermediate levels to span both black holes, wasting a lot of memory).<br></blockquote><br>That bug should have been fixed in a backport; are you sure you are<br>checking out the branch and not the tag? In any case, it can be worked<br>around by setting CarpetRegrid2::min_fraction = 1, assuming this is the<br>same bug I am thinking of<br>(<a href="http://cactuscode.org/pipermail/users/2013-January/003290.html">http://cactuscode.org/pipermail/users/2013-January/003290.html</a>)<br></blockquote><br>I was using an old executable so it wouldn't have had the backport<br>fix.<br><br><blockquote type="cite"><br><blockquote type="cite">Even with<br>Orsted , stalling is a real issue. Currently, my "solution" is to run<br>for 4 hours at a time.<br>This would have been OK on Lonestar or Ranger,<br> because when I chained a bunch a runs, the next in line would start<br>almost right away, but on stampede the delay is quite substantial. I<br>believe Jim Healy opened<br>a ticket concerning the RIT issues with running ET on stampede.<br></blockquote><br>I think this is the ticket:<br><a href="https://trac.einsteintoolkit.org/ticket/1547">https://trac.einsteintoolkit.org/ticket/1547</a>. I will add my information<br>there. The current queue wait time on stampede is more than a day, so<br>splitting into 3 hour chunks is not feasible, as you say.<br><br>I'm starting to think it might be a code problem as well. So the<br>summary is:<br><br>– Checkpointing causes jobs to die with code versions after Oersted<br>– All versions lead to eventual hung jobs after a few hours<br><br>Since Stampede is the major "capability" resource in Xsede, we should<br>put some effort into making sure the ET can run properly there.<br></blockquote><br>We find issues with runs stalling on our local cluster too. The hardware<br>setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on<br>top of a proprietary IB library). There's no guarantee that the issues<br>are the same, but we can try to run some tests locally (note that we<br>have no issues with runs failing to checkpoint).<br></blockquote><div><br></div><div>I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted.</div></div><br><div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>-- </div><div>Ian Hinder</div><div><a href="http://numrel.aei.mpg.de/people/hinder">http://numrel.aei.mpg.de/people/hinder</a></div></div>
</div>
<br></body></html>