[Users] Stampede
Ian Hinder
ian.hinder at aei.mpg.de
Fri May 2 11:15:02 CDT 2014
On 02 May 2014, at 16:57, Yosef Zlochower <yosef at astro.rit.edu> wrote:
> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>>
>> On 02 May 2014, at 14:08, Yosef Zlochower <yosef at astro.rit.edu
>> <mailto:yosef at astro.rit.edu>> wrote:
>>
>>> Hi
>>>
>>> I have been having problems running on Stampede for a long time. I
>>> couldn't get the latest
>>> stable ET to run because during checkpointing, it would die.
>>
>> OK that's very interesting. Has something changed in the code related
>> to how checkpoint files are written?
>>
>>> I had to backtrack to
>>> the Orsted version (unfortunately, that has a bug in the way the grid
>>> is set up, causing some of the
>>> intermediate levels to span both black holes, wasting a lot of memory).
>>
>> That bug should have been fixed in a backport; are you sure you are
>> checking out the branch and not the tag? In any case, it can be worked
>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>> same bug I am thinking of
>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>
> I was using an old executable so it wouldn't have had the backport
> fix.
>
>>
>>> Even with
>>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>>> for 4 hours at a time.
>>> This would have been OK on Lonestar or Ranger,
>>> because when I chained a bunch a runs, the next in line would start
>>> almost right away, but on stampede the delay is quite substantial. I
>>> believe Jim Healy opened
>>> a ticket concerning the RIT issues with running ET on stampede.
>>
>> I think this is the ticket:
>> https://trac.einsteintoolkit.org/ticket/1547. I will add my information
>> there. The current queue wait time on stampede is more than a day, so
>> splitting into 3 hour chunks is not feasible, as you say.
>>
>> I'm starting to think it might be a code problem as well. So the
>> summary is:
>>
>> – Checkpointing causes jobs to die with code versions after Oersted
>> – All versions lead to eventual hung jobs after a few hours
>>
>> Since Stampede is the major "capability" resource in Xsede, we should
>> put some effort into making sure the ET can run properly there.
>
> We find issues with runs stalling on our local cluster too. The hardware
> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
> top of a proprietary IB library). There's no guarantee that the issues
> are the same, but we can try to run some tests locally (note that we
> have no issues with runs failing to checkpoint).
I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted.
--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20140502/60cc2ca0/attachment.html
More information about the Users
mailing list