<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">I wonder if this hack may help.<br>
      <br>
      In SphericalHarmonicDecomp_DumpMetric<br>
      add a blocking IO operation before the Decompose3D calls.<br>
      Perhaps something like:<br>
      <br>
      {<br>
      &nbsp;&nbsp; const char *outdir = *out_dir ? out_dir : io_out_dir;<br>
      &nbsp;&nbsp; char filename[BUFFSIZE];<br>
      &nbsp;&nbsp; snprintf(filename, sizeof filename,<br>
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "%s/obs_%d_test_io_ready", dir, obs);<br>
      &nbsp;&nbsp; FILE *file = fopen(filename, "a");<br>
      &nbsp;&nbsp; assert (file);<br>
      &nbsp;&nbsp; fprintf(file, "test_if_ready\n");<br>
      &nbsp;&nbsp; fflush(file); &nbsp; <br>
      &nbsp;&nbsp; fclose(file)<br>
      }<br>
      <br>
      If that doesn't help, then perhaps you can set
      SphericalHamonicDecomp<br>
      to abort the run when this happens. <br>
      <br>
      On 07/29/2012 04:31 PM, Yosef Zlochower wrote:<br>
    </div>
    <blockquote cite="mid:50159DBD.2030004@astro.rit.edu" type="cite">
      <meta content="text/html; charset=ISO-8859-1"
        http-equiv="Content-Type">
      <div class="moz-cite-prefix">On 07/28/2012 07:38 AM, Jakob Hansen
        wrote:<br>
      </div>
      <blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
        type="cite">Hi all, <br>
        <br>
        Thanks for fast replys ^^<br>
        <br>
        <div class="gmail_quote">
          <div class="im">2012/7/27 Yosef Zlochower <span dir="ltr">&lt;<a
                moz-do-not-send="true" href="mailto:yosef@astro.rit.edu"
                target="_blank">yosef@astro.rit.edu</a>&gt;</span><br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi,<br>
              <br>
              &nbsp;SphericalHarmonicDecomp should not be writing output at
              the<br>
              same time as a checkpoint. Are you using an NFS mount? </blockquote>
          </div>
          <div><br>
            No, we're using Lustre filesystem.<br>
            &nbsp;<br>
            <br>
          </div>
          <div class="im">
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex"> I
              noticed<br>
              issues with NFS servers becoming unresponsive (due to a
              large number<br>
              of blocking io operations) during a checkpoint. Perhaps
              right after<br>
              a checkpoint, the server is still too busy.<br>
              <br>
            </blockquote>
          </div>
          <div><br>
            Well, indeed this happens just after checkpointing, however
            not at every checkpoint and not for all output. I
            experienced this error twice on two different simulations,
            once for each simulation. In each case it affected the
            metric_obs_0_Decomp.h5 right after checkpointing :<br>
            <br>
            Case 1 : Output from ascii_output &gt; gxx.asc :<br>
            2.7627600000000001e+02 -7.1882256506489145e-05
            1.5746413856874709e-05<br>
            2.7640800000000002e+02 -7.1881310717781865e-05
            1.5759232641588166e-05<br>
            &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nan 0.0000000000000000e+00
            0.0000000000000000e+00<br>
            2.7667200000000003e+02 -7.1875929421365471e-05
            1.5784711415784759e-05<br>
            2.7680400000000003e+02 -7.1871563990008602e-05
            1.5797542778149928e-05<br>
            <br>
            In this case there was a checkpoint at time 276.408 : INFO
            (CarpetIOHDF5): Dumping periodic checkpoint at iteration
            268032, simulation time 276.408<br>
            <br>
          </div>
        </div>
      </blockquote>
      The NaN is just the way ascii_ouput let's you know it couldn't
      read the data.<br>
      <br>
      <blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
        type="cite">
        <div class="gmail_quote">
          <div><br>
            Case 2: Output from ascii_output &gt; gxx.asc :<br>
            3.2010000000000002e+02 -6.7572912816444132e-05
            2.1144268516557760e-05<br>
            3.2023200000000003e+02 -6.7570803118733184e-05
            2.1156692762978387e-05<br>
            &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; nan 0.0000000000000000e+00
            0.0000000000000000e+00<br>
            3.2049600000000004e+02 -6.7568973330671772e-05
            2.1182079214383173e-05<br>
            3.2062800000000004e+02 -6.7569118827631724e-05
            2.1194791416536239e-05<br>
            <br>
            With a checkpoint at time 320.232 :&nbsp; INFO (CarpetIOHDF5):
            Dumping periodic checkpoint at iteration 310528, simulation
            time 320.232<br>
            <br>
            <br>
            Also, in both cases it only affected the
            metric_obs_0_Decomp.h5 file, the other detection radius
            files, _1 and _2, had all data.<br>
            &nbsp;<br>
            <br>
          </div>
        </div>
      </blockquote>
      Do the&nbsp; Caltech fixes help?&nbsp; If that doesn't work, then it may be<br>
      that your IO system is saturated. Perhaps then a crude workaround
      would be<br>
      to put a delay in after a checkpoint to give the IO system time to
      process its<br>
      backlog of IO requests.<br>
      <blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
        type="cite">
        <div class="gmail_quote">
          <div>&lt;snip&gt;<br>
            <br>
            <br>
            <br>
            <div class="gmail_quote">2012/7/28 Erik Schnetter <span
                dir="ltr">&lt;<a moz-do-not-send="true"
                  href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>&gt;</span><br>
              <blockquote class="gmail_quote" style="margin:0 0 0
                .8ex;border-left:1px #ccc solid;padding-left:1ex">Jakob
                <div class="im">
                  <div><br>
                  </div>
                  <div>I have not heard about such a problem before.</div>
                  <div><br>
                  </div>
                </div>
                <div class="im">
                  <div>When an HDF5 file is not properly closed, its
                    content may be corrupted. (This will be addressed in
                    the next major release.) There may be two reasons
                    for this: either the file is not closed (which would
                    be an error in the code), or there is a write error
                    (e.g. you run out of disk space). The latter is the
                    major reason for people encountering corrupted HDF5
                    files. Since you don't see error messages, this is
                    either not the case, or these HDF5 output routines
                    suppress these errors.</div>
                  <div><br>
                  </div>
                  <div>The thorn&nbsp;SphericalHarmonicDecomp implements its
                    own HDF5 output routines and does not use Cactus. I
                    see that it uses a non-standard way to determine
                    whether the file exists, and that it does not check
                    for errors when writing or closing. I think that
                    HDF5 errors should cause prominent warnings in
                    stdout and stderr (did you check?), and if you don't
                    see these, the writing should have succeeded.</div>
                  <div><br>
                  </div>
                </div>
              </blockquote>
              <div><br>
                The errors I see on stderr are the ones I mentioned in
                my first mail :
                <div class="im"><br>
                  <br>
                  &nbsp;<br>
                  HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1)
                  thread 0:<br>
                  #000: H5F.c line 1509 in H5Fopen(): unable to open
                  file<br>
                  major: File accessability<br>
                  minor: Unable to open file<br>
                  #001: H5F.c line 1300 in H5F_open(): unable to read
                  superblock<br>
                  major: File accessability<br>
                  minor: Read failed<br>
                  <br>
                </div>
                .... etc.<br>
                <br>
                stdout does not produce any errors or warnings related
                to this.<br>
                <br>
                <br>
              </div>
              <div class="im">
                <blockquote class="gmail_quote" style="margin:0 0 0
                  .8ex;border-left:1px #ccc solid;padding-left:1ex">
                  <div>You mention checkpointing. Are you experiencing
                    these problems right after recovery, i.e. during the
                    first SphericalHarmonicDecomp HDF5 output
                    afterwards? </div>
                </blockquote>
              </div>
              <div><br>
                No, this happened during the first run, not related to
                recovery.<br>
                &nbsp;</div>
              <blockquote class="gmail_quote" style="margin:0 0 0
                .8ex;border-left:1px #ccc solid;padding-left:1ex">
                <div class="im">
                  <div>In this case, did you maybe switch to a new
                    directory where this file doesn't exist?</div>
                  <div><br>
                  </div>
                </div>
                <div class="im">
                  <div>If not, then it may be the non-standard way in
                    which the code determines whether the file already
                    exists, combined with something that may be special
                    about your file system.</div>
                  <div><br>
                  </div>
                  <div>(The "standard" way operates as follows: open the
                    file as if it existed; if this fails, open it by
                    creating it. The code works differently: it opens
                    the file as binary file. If this fails, the HDF5
                    file is created; if it succeeds, the file is closed
                    and re-openend as HDF5 file. Maybe the quick
                    closing-then-reopening causes problems?)</div>
                  <div><br>
                  </div>
                </div>
                <div>-erik</div>
              </blockquote>
            </div>
            <br>
            <br>
            <br>
            <div class="gmail_quote">
              <div class="im">2012/7/28 Roland Haas <span dir="ltr">&lt;<a
                    moz-do-not-send="true"
                    href="mailto:roland.haas@physics.gatech.edu"
                    target="_blank">roland.haas@physics.gatech.edu</a>&gt;</span><br>
                <blockquote class="gmail_quote" style="margin:0 0 0
                  .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello

                  all,<br>
                  <div><br>
                    &gt;&gt; I have not heard about such a problem
                    before.<br>
                  </div>
                  I believe Nick Taylor at Caltech had similar issues.
                  Bela has since<br>
                  fixed some bugs but had trouble actually committing
                  them (he just saw<br>
                  your emails). I'll grab his changes and commit them.<br>
                  <br>
                </blockquote>
              </div>
              <div><br>
                Sounds interesting, I'm looking forward to appying the
                changes and see if the problem disappears.<br>
                <br>
                Cheers,<br>
                Jakob</div>
            </div>
          </div>
        </div>
        <br>
        <fieldset class="mimeAttachmentHeader"></fieldset>
        <br>
        <pre wrap="">_______________________________________________
Users mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a>
</pre>
      </blockquote>
      <br>
      <br>
      <pre class="moz-signature" cols="72">-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Assistant Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a>

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
</pre>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a>
<a class="moz-txt-link-freetext" href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a>
</pre>
    </blockquote>
    <br>
    <br>
    <pre class="moz-signature" cols="72">-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Assistant Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

<a class="moz-txt-link-abbreviated" href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a>

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
</pre>
  </body>
</html>