[Users] Stampede

Yosef Zlochower yosef at astro.rit.edu
Fri May 2 09:57:09 CDT 2014


On 05/02/2014 10:07 AM, Ian Hinder wrote:
>
> On 02 May 2014, at 14:08, Yosef Zlochower <yosef at astro.rit.edu
> <mailto:yosef at astro.rit.edu>> wrote:
>
>> Hi
>>
>> I have been having problems running on Stampede for a long time. I
>> couldn't get the latest
>> stable ET to run because during checkpointing, it would die.
>
> OK that's very interesting.  Has something changed in the code related
> to how checkpoint files are written?
>
>> I had to backtrack to
>> the Orsted version (unfortunately, that has a bug in the way the grid
>> is set up, causing some of the
>> intermediate levels to span both black holes, wasting a lot of memory).
>
> That bug should have been fixed in a backport; are you sure you are
> checking out the branch and not the tag?  In any case, it can be worked
> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
> same bug I am thinking of
> (http://cactuscode.org/pipermail/users/2013-January/003290.html)

I was using an old executable so it wouldn't have had the backport
fix.

>
>> Even with
>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>> for 4 hours at a time.
>> This would have been  OK on Lonestar or Ranger,
>>  because when I chained a bunch a runs, the next in line would start
>> almost right away, but on stampede the delay is quite substantial. I
>> believe Jim Healy opened
>> a ticket concerning the RIT issues with running ET on stampede.
>
> I think this is the ticket:
> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
> there.  The current queue wait time on stampede is more than a day, so
> splitting into 3 hour chunks is not feasible, as you say.
>
> I'm starting to think it might be a code problem as well.  So the
> summary is:
>
> – Checkpointing causes jobs to die with code versions after Oersted
> – All versions lead to eventual hung jobs after a few hours
>
> Since Stampede is the major "capability" resource in Xsede, we should
> put some effort into making sure the ET can run properly there.

We find issues with runs stalling on our local cluster too. The hardware
setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
top of a proprietary IB library). There's no guarantee that the issues
are the same, but we can try to run some tests locally (note that we
have no issues with runs failing to checkpoint).

> --
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>


-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yosef at astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.


More information about the Users mailing list