[Users] Stampede
Ian Hinder
ian.hinder at aei.mpg.de
Fri May 2 09:07:59 CDT 2014
On 02 May 2014, at 14:08, Yosef Zlochower <yosef at astro.rit.edu> wrote:
> Hi
>
> I have been having problems running on Stampede for a long time. I couldn't get the latest
> stable ET to run because during checkpointing, it would die.
OK that's very interesting. Has something changed in the code related to how checkpoint files are written?
> I had to backtrack to
> the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the
> intermediate levels to span both black holes, wasting a lot of memory).
That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html)
> Even with
> Orsted , stalling is a real issue. Currently, my "solution" is to run for 4 hours at a time.
> This would have been OK on Lonestar or Ranger,
> because when I chained a bunch a runs, the next in line would start
> almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened
> a ticket concerning the RIT issues with running ET on stampede.
I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say.
I'm starting to think it might be a code problem as well. So the summary is:
– Checkpointing causes jobs to die with code versions after Oersted
– All versions lead to eventual hung jobs after a few hours
Since Stampede is the major "capability" resource in Xsede, we should put some effort into making sure the ET can run properly there.
--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20140502/632c9938/attachment.html
More information about the Users
mailing list