[Users] Stampede

Mon Jul 14 15:14:14 CDT 2014

I tried a run on stampede today and it died during checkpoint with the
error
" send desc error
send desc error
[0] Abort: Got completion with error 12, vendor code=81, dest rank=
  at line 892 in file ../../ofa_poll.c"

Have you been having success running production runs on stampede?

On 05/02/2014 12:15 PM, Ian Hinder wrote:
>
> On 02 May 2014, at 16:57, Yosef Zlochower <yosef at astro.rit.edu
> <mailto:yosef at astro.rit.edu>> wrote:
>
>> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>>>
>>> On 02 May 2014, at 14:08, Yosef Zlochower <yosef at astro.rit.edu
>>> <mailto:yosef at astro.rit.edu>
>>> <mailto:yosef at astro.rit.edu>> wrote:
>>>
>>>> Hi
>>>>
>>>> I have been having problems running on Stampede for a long time. I
>>>> couldn't get the latest
>>>> stable ET to run because during checkpointing, it would die.
>>>
>>> OK that's very interesting.  Has something changed in the code related
>>> to how checkpoint files are written?
>>>
>>>> I had to backtrack to
>>>> the Orsted version (unfortunately, that has a bug in the way the grid
>>>> is set up, causing some of the
>>>> intermediate levels to span both black holes, wasting a lot of memory).
>>>
>>> That bug should have been fixed in a backport; are you sure you are
>>> checking out the branch and not the tag?  In any case, it can be worked
>>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>>> same bug I am thinking of
>>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>>
>> I was using an old executable so it wouldn't have had the backport
>> fix.
>>
>>>
>>>> Even with
>>>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>>>> for 4 hours at a time.
>>>> This would have been  OK on Lonestar or Ranger,
>>>> because when I chained a bunch a runs, the next in line would start
>>>> almost right away, but on stampede the delay is quite substantial. I
>>>> believe Jim Healy opened
>>>> a ticket concerning the RIT issues with running ET on stampede.
>>>
>>> I think this is the ticket:
>>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>>> there.  The current queue wait time on stampede is more than a day, so
>>> splitting into 3 hour chunks is not feasible, as you say.
>>>
>>> I'm starting to think it might be a code problem as well.  So the
>>> summary is:
>>>
>>> – Checkpointing causes jobs to die with code versions after Oersted
>>> – All versions lead to eventual hung jobs after a few hours
>>>
>>> Since Stampede is the major "capability" resource in Xsede, we should
>>> put some effort into making sure the ET can run properly there.
>>
>> We find issues with runs stalling on our local cluster too. The hardware
>> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
>> top of a proprietary IB library). There's no guarantee that the issues
>> are the same, but we can try to run some tests locally (note that we
>> have no issues with runs failing to checkpoint).
>
> I resubmitted, and the new job hangs later on.  gdb says it is in
> CarpetIOScalar while doing output of a maximum reduction.  I've disabled
> this and resubmitted.
>
> --
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>

-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yosef at astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.