[Users] Stampede

Mon Jul 28 05:14:18 CDT 2014

On 14 Jul 2014, at 22:14, Yosef Zlochower <yosef at astro.rit.edu> wrote:

> I tried a run on stampede today and it died during checkpoint with the
> error
> " send desc error
> send desc error
> [0] Abort: Got completion with error 12, vendor code=81, dest rank=
> at line 892 in file ../../ofa_poll.c"
> 
> Have you been having success running production runs on stampede?

I have seen errors when several runs checkpoint at the same time, as can happen if many jobs start simultaneously and dump a checkpoint after 3 hours. According to TACC support, there was nothing unusual in the system logs.  I thought it would be useful to add a "random" delay to the checkpoint code.  For example, in addition to telling it to checkpoint every 3 hours, you could say "checkpoint every 3 hours, plus a random number between -20 and +20 minutes".

The error message above suggests something to do with communication ("send desc").  Checkpointing itself shouldn't do any MPI communication, should it?  Does it perform consistency checks across processes, or otherwise do communication?  I also saw freezes during scalar reduction output (see quoted text below).  Maybe some of the processes are taking much longer to checkpoint than others, and the ones which finish time out while trying to communicate?  Maybe adding a barrier after checkpointing would make this clearer?

> 
> 
> On 05/02/2014 12:15 PM, Ian Hinder wrote:
>> 
>> On 02 May 2014, at 16:57, Yosef Zlochower <yosef at astro.rit.edu
>> <mailto:yosef at astro.rit.edu>> wrote:
>> 
>>> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>>>> 
>>>> On 02 May 2014, at 14:08, Yosef Zlochower <yosef at astro.rit.edu
>>>> <mailto:yosef at astro.rit.edu>
>>>> <mailto:yosef at astro.rit.edu>> wrote:
>>>> 
>>>>> Hi
>>>>> 
>>>>> I have been having problems running on Stampede for a long time. I
>>>>> couldn't get the latest
>>>>> stable ET to run because during checkpointing, it would die.
>>>> 
>>>> OK that's very interesting.  Has something changed in the code related
>>>> to how checkpoint files are written?
>>>> 
>>>>> I had to backtrack to
>>>>> the Orsted version (unfortunately, that has a bug in the way the grid
>>>>> is set up, causing some of the
>>>>> intermediate levels to span both black holes, wasting a lot of memory).
>>>> 
>>>> That bug should have been fixed in a backport; are you sure you are
>>>> checking out the branch and not the tag?  In any case, it can be worked
>>>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>>>> same bug I am thinking of
>>>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>>> 
>>> I was using an old executable so it wouldn't have had the backport
>>> fix.
>>> 
>>>> 
>>>>> Even with
>>>>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>>>>> for 4 hours at a time.
>>>>> This would have been  OK on Lonestar or Ranger,
>>>>> because when I chained a bunch a runs, the next in line would start
>>>>> almost right away, but on stampede the delay is quite substantial. I
>>>>> believe Jim Healy opened
>>>>> a ticket concerning the RIT issues with running ET on stampede.
>>>> 
>>>> I think this is the ticket:
>>>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>>>> there.  The current queue wait time on stampede is more than a day, so
>>>> splitting into 3 hour chunks is not feasible, as you say.
>>>> 
>>>> I'm starting to think it might be a code problem as well.  So the
>>>> summary is:
>>>> 
>>>> – Checkpointing causes jobs to die with code versions after Oersted
>>>> – All versions lead to eventual hung jobs after a few hours
>>>> 
>>>> Since Stampede is the major "capability" resource in Xsede, we should
>>>> put some effort into making sure the ET can run properly there.
>>> 
>>> We find issues with runs stalling on our local cluster too. The hardware
>>> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
>>> top of a proprietary IB library). There's no guarantee that the issues
>>> are the same, but we can try to run some tests locally (note that we
>>> have no issues with runs failing to checkpoint).
>> 
>> I resubmitted, and the new job hangs later on.  gdb says it is in
>> CarpetIOScalar while doing output of a maximum reduction.  I've disabled
>> this and resubmitted.
>> 
>> --
>> Ian Hinder
>> http://numrel.aei.mpg.de/people/hinder
>> 
> 
> 
> -- 
> Dr. Yosef Zlochower
> Center for Computational Relativity and Gravitation
> Associate Professor
> School of Mathematical Sciences
> Rochester Institute of Technology
> 85 Lomb Memorial Drive
> Rochester, NY 14623
> 
> Office:74-2067
> Phone: +1 585-475-6103
> 
> yosef at astro.rit.edu
> 
> CONFIDENTIALITY NOTE: The information transmitted, including
> attachments, is intended only for the person(s) or entity to which it
> is addressed and may contain confidential and/or privileged material.
> Any review, retransmission, dissemination or other use of, or taking
> of any action in reliance upon this information by persons or entities
> other than the intended recipient is prohibited. If you received this
> in error, please contact the sender and destroy any copies of this
> information.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder