<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Hi<br>

      <br>

      I have been having problems running on Stampede for a long time. I

      couldn't get the latest<br>

      stable ET to run because during checkpointing, it would die. I had

      to backtrack to <br>

      the Orsted version (unfortunately, that has a bug in the way the

      grid is set up, causing some of the<br>

      intermediate levels to span both black holes, wasting a lot of

      memory). Even with<br>

      Orsted , stalling is a real issue. Currently, my "solution" is to

      run for 4 hours at a time.<br>

      This would have been&nbsp; OK on Lonestar or Ranger,<br>

      &nbsp;because when I chained a bunch a runs, the next in line would

      start<br>

      almost right away, but on stampede the delay is quite substantial.

      I believe Jim Healy opened<br>

      a ticket concerning the RIT issues with running ET on stampede.<br>

      <br>

      <br>

      On 05/02/2014 05:55 AM, Ian Hinder wrote:<br>

    </div>

    <blockquote

      cite="mid:81013BDB-E1AB-4722-8218-E1D8A01D7ABA@aei.mpg.de"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=ISO-8859-1">

      Hi all,

      <div><br>

      </div>

      <div>Has anyone run into problems recently with Cactus jobs on

        Stampede? &nbsp;I've had jobs die when checkpointing, and also

        mysteriously hanging for no apparent reason. &nbsp;These might be

        separate problems. &nbsp;The checkpointing issue occurred when I

        submitted several jobs and they all started checkpointing at the

        same time after 3 hours. &nbsp;The hang happened after a few hours of

        evolution, with GDB reporting</div>

      <div><br>

      </div>

      <div>

        <blockquote type="cite">MPIDI_CH3I_MRAILI_Get_next_vbuf

          (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)<br>

          &nbsp;&nbsp;at

          src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296<br>

          296<span class="Apple-tab-span" style="white-space: pre;"> </span>&nbsp;&nbsp;&nbsp;&nbsp;for

          (; i &lt; mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;<br>

          &nbsp;&nbsp;++i)</blockquote>

        <br>

      </div>

      <div>Unfortunately I didn't ask for a backtrace. I'm using

        mvapich2. &nbsp;I've been in touch with support and they said the

        dying while checkpointing coincided with the filesystems being

        hit hard by my jobs, which makes sense, but they didn't see any

        problems in their logs, and they have no idea about the

        mysterious hang. &nbsp;I repeated the hanging job and it ran fine.</div>

      <div><br>

        <div apple-content-edited="true">

          <div style="color: rgb(0, 0, 0); letter-spacing: normal;

            orphans: auto; text-align: start; text-indent: 0px;

            text-transform: none; white-space: normal; widows: auto;

            word-spacing: 0px; -webkit-text-stroke-width: 0px;

            word-wrap: break-word; -webkit-nbsp-mode: space;

            -webkit-line-break: after-white-space;">

            <div>--&nbsp;</div>

            <div>Ian Hinder</div>

            <div><a moz-do-not-send="true"

                href="http://numrel.aei.mpg.de/people/hinder">http://numrel.aei.mpg.de/people/hinder</a></div>

          </div>

        </div>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a>

<a class="moz-txt-link-freetext" href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>