<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 07/28/2012 07:38 AM, Jakob Hansen
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
      type="cite">Hi all, <br>
      <br>
      Thanks for fast replys ^^<br>
      <br>
      <div class="gmail_quote">
        <div class="im">2012/7/27 Yosef Zlochower <span dir="ltr">&lt;<a
              moz-do-not-send="true" href="mailto:yosef@astro.rit.edu"
              target="_blank">yosef@astro.rit.edu</a>&gt;</span><br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            Hi,<br>
            <br>
            &nbsp;SphericalHarmonicDecomp should not be writing output at the<br>
            same time as a checkpoint. Are you using an NFS mount? </blockquote>
        </div>
        <div><br>
          No, we're using Lustre filesystem.<br>
          &nbsp;<br>
          <br>
        </div>
        <div class="im">
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            I noticed<br>
            issues with NFS servers becoming unresponsive (due to a
            large number<br>
            of blocking io operations) during a checkpoint. Perhaps
            right after<br>
            a checkpoint, the server is still too busy.<br>
            <br>
          </blockquote>
        </div>
        <div><br>
          Well, indeed this happens just after checkpointing, however
          not at every checkpoint and not for all output. I experienced
          this error twice on two different simulations, once for each
          simulation. In each case it affected the
          metric_obs_0_Decomp.h5 right after checkpointing :<br>
          <br>
          Case 1 : Output from ascii_output &gt; gxx.asc :<br>
          2.7627600000000001e+02 -7.1882256506489145e-05
          1.5746413856874709e-05<br>
          2.7640800000000002e+02 -7.1881310717781865e-05
          1.5759232641588166e-05<br>
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nan 0.0000000000000000e+00
          0.0000000000000000e+00<br>
          2.7667200000000003e+02 -7.1875929421365471e-05
          1.5784711415784759e-05<br>
          2.7680400000000003e+02 -7.1871563990008602e-05
          1.5797542778149928e-05<br>
          <br>
          In this case there was a checkpoint at time 276.408 : INFO
          (CarpetIOHDF5): Dumping periodic checkpoint at iteration
          268032, simulation time 276.408<br>
          <br>
        </div>
      </div>
    </blockquote>
    The NaN is just the way ascii_ouput let's you know it couldn't read
    the data.<br>
    <br>
    <blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
      type="cite">
      <div class="gmail_quote">
        <div><br>
          Case 2: Output from ascii_output &gt; gxx.asc :<br>
          3.2010000000000002e+02 -6.7572912816444132e-05
          2.1144268516557760e-05<br>
          3.2023200000000003e+02 -6.7570803118733184e-05
          2.1156692762978387e-05<br>
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nan 0.0000000000000000e+00
          0.0000000000000000e+00<br>
          3.2049600000000004e+02 -6.7568973330671772e-05
          2.1182079214383173e-05<br>
          3.2062800000000004e+02 -6.7569118827631724e-05
          2.1194791416536239e-05<br>
          <br>
          With a checkpoint at time 320.232 :&nbsp; INFO (CarpetIOHDF5):
          Dumping periodic checkpoint at iteration 310528, simulation
          time 320.232<br>
          <br>
          <br>
          Also, in both cases it only affected the
          metric_obs_0_Decomp.h5 file, the other detection radius files,
          _1 and _2, had all data.<br>
          &nbsp;<br>
          <br>
        </div>
      </div>
    </blockquote>
    Do the&nbsp; Caltech fixes help?&nbsp; If that doesn't work, then it may be<br>
    that your IO system is saturated. Perhaps then a crude workaround
    would be<br>
    to put a delay in after a checkpoint to give the IO system time to
    process its<br>
    backlog of IO requests.<br>
    <blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
      type="cite">
      <div class="gmail_quote">
        <div>&lt;snip&gt;<br>
          <br>
          <br>
          <br>
          <div class="gmail_quote">2012/7/28 Erik Schnetter <span
              dir="ltr">&lt;<a moz-do-not-send="true"
                href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>&gt;</span><br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">Jakob
              <div class="im">
                <div><br>
                </div>
                <div>I have not heard about such a problem before.</div>
                <div><br>
                </div>
              </div>
              <div class="im">
                <div>When an HDF5 file is not properly closed, its
                  content may be corrupted. (This will be addressed in
                  the next major release.) There may be two reasons for
                  this: either the file is not closed (which would be an
                  error in the code), or there is a write error (e.g.
                  you run out of disk space). The latter is the major
                  reason for people encountering corrupted HDF5 files.
                  Since you don't see error messages, this is either not
                  the case, or these HDF5 output routines suppress these
                  errors.</div>
                <div><br>
                </div>
                <div>The thorn&nbsp;SphericalHarmonicDecomp implements its
                  own HDF5 output routines and does not use Cactus. I
                  see that it uses a non-standard way to determine
                  whether the file exists, and that it does not check
                  for errors when writing or closing. I think that HDF5
                  errors should cause prominent warnings in stdout and
                  stderr (did you check?), and if you don't see these,
                  the writing should have succeeded.</div>
                <div><br>
                </div>
              </div>
            </blockquote>
            <div><br>
              The errors I see on stderr are the ones I mentioned in my
              first mail :
              <div class="im"><br>
                <br>
                &nbsp;<br>
                HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread
                0:<br>
                #000: H5F.c line 1509 in H5Fopen(): unable to open file<br>
                major: File accessability<br>
                minor: Unable to open file<br>
                #001: H5F.c line 1300 in H5F_open(): unable to read
                superblock<br>
                major: File accessability<br>
                minor: Read failed<br>
                <br>
              </div>
              .... etc.<br>
              <br>
              stdout does not produce any errors or warnings related to
              this.<br>
              <br>
              <br>
            </div>
            <div class="im">
              <blockquote class="gmail_quote" style="margin:0 0 0
                .8ex;border-left:1px #ccc solid;padding-left:1ex">
                <div>You mention checkpointing. Are you experiencing
                  these problems right after recovery, i.e. during the
                  first SphericalHarmonicDecomp HDF5 output afterwards?
                </div>
              </blockquote>
            </div>
            <div><br>
              No, this happened during the first run, not related to
              recovery.<br>
              &nbsp;</div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div class="im">
                <div>In this case, did you maybe switch to a new
                  directory where this file doesn't exist?</div>
                <div><br>
                </div>
              </div>
              <div class="im">
                <div>If not, then it may be the non-standard way in
                  which the code determines whether the file already
                  exists, combined with something that may be special
                  about your file system.</div>
                <div><br>
                </div>
                <div>(The "standard" way operates as follows: open the
                  file as if it existed; if this fails, open it by
                  creating it. The code works differently: it opens the
                  file as binary file. If this fails, the HDF5 file is
                  created; if it succeeds, the file is closed and
                  re-openend as HDF5 file. Maybe the quick
                  closing-then-reopening causes problems?)</div>
                <div><br>
                </div>
              </div>
              <div>-erik</div>
            </blockquote>
          </div>
          <br>
          <br>
          <br>
          <div class="gmail_quote">
            <div class="im">2012/7/28 Roland Haas <span dir="ltr">&lt;<a
                  moz-do-not-send="true"
                  href="mailto:roland.haas@physics.gatech.edu"
                  target="_blank">roland.haas@physics.gatech.edu</a>&gt;</span><br>
              <blockquote class="gmail_quote" style="margin:0 0 0
                .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello
                all,<br>
                <div><br>
                  &gt;&gt; I have not heard about such a problem before.<br>
                </div>
                I believe Nick Taylor at Caltech had similar issues.
                Bela has since<br>
                fixed some bugs but had trouble actually committing them
                (he just saw<br>
                your emails). I'll grab his changes and commit them.<br>
                <br>
              </blockquote>
            </div>
            <div><br>
              Sounds interesting, I'm looking forward to appying the
              changes and see if the problem disappears.<br>
              <br>
              Cheers,<br>
              Jakob</div>
          </div>
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a>
<a class="moz-txt-link-freetext" href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a>
</pre>
    </blockquote>
    <br>
    <br>
    <pre class="moz-signature" cols="72">-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Assistant Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

<a class="moz-txt-link-abbreviated" href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a>

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
</pre>
  </body>
</html>