<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">I wonder if this hack may help.<br>
<br>
In SphericalHarmonicDecomp_DumpMetric<br>
add a blocking IO operation before the Decompose3D calls.<br>
Perhaps something like:<br>
<br>
{<br>
const char *outdir = *out_dir ? out_dir : io_out_dir;<br>
char filename[BUFFSIZE];<br>
snprintf(filename, sizeof filename,<br>
"%s/obs_%d_test_io_ready", dir, obs);<br>
FILE *file = fopen(filename, "a");<br>
assert (file);<br>
fprintf(file, "test_if_ready\n");<br>
fflush(file); <br>
fclose(file)<br>
}<br>
<br>
If that doesn't help, then perhaps you can set
SphericalHamonicDecomp<br>
to abort the run when this happens. <br>
<br>
On 07/29/2012 04:31 PM, Yosef Zlochower wrote:<br>
</div>
<blockquote cite="mid:50159DBD.2030004@astro.rit.edu" type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix">On 07/28/2012 07:38 AM, Jakob Hansen
wrote:<br>
</div>
<blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
type="cite">Hi all, <br>
<br>
Thanks for fast replys ^^<br>
<br>
<div class="gmail_quote">
<div class="im">2012/7/27 Yosef Zlochower <span dir="ltr"><<a
moz-do-not-send="true" href="mailto:yosef@astro.rit.edu"
target="_blank">yosef@astro.rit.edu</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi,<br>
<br>
SphericalHarmonicDecomp should not be writing output at
the<br>
same time as a checkpoint. Are you using an NFS mount? </blockquote>
</div>
<div><br>
No, we're using Lustre filesystem.<br>
<br>
<br>
</div>
<div class="im">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex"> I
noticed<br>
issues with NFS servers becoming unresponsive (due to a
large number<br>
of blocking io operations) during a checkpoint. Perhaps
right after<br>
a checkpoint, the server is still too busy.<br>
<br>
</blockquote>
</div>
<div><br>
Well, indeed this happens just after checkpointing, however
not at every checkpoint and not for all output. I
experienced this error twice on two different simulations,
once for each simulation. In each case it affected the
metric_obs_0_Decomp.h5 right after checkpointing :<br>
<br>
Case 1 : Output from ascii_output > gxx.asc :<br>
2.7627600000000001e+02 -7.1882256506489145e-05
1.5746413856874709e-05<br>
2.7640800000000002e+02 -7.1881310717781865e-05
1.5759232641588166e-05<br>
nan 0.0000000000000000e+00
0.0000000000000000e+00<br>
2.7667200000000003e+02 -7.1875929421365471e-05
1.5784711415784759e-05<br>
2.7680400000000003e+02 -7.1871563990008602e-05
1.5797542778149928e-05<br>
<br>
In this case there was a checkpoint at time 276.408 : INFO
(CarpetIOHDF5): Dumping periodic checkpoint at iteration
268032, simulation time 276.408<br>
<br>
</div>
</div>
</blockquote>
The NaN is just the way ascii_ouput let's you know it couldn't
read the data.<br>
<br>
<blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<div><br>
Case 2: Output from ascii_output > gxx.asc :<br>
3.2010000000000002e+02 -6.7572912816444132e-05
2.1144268516557760e-05<br>
3.2023200000000003e+02 -6.7570803118733184e-05
2.1156692762978387e-05<br>
nan 0.0000000000000000e+00
0.0000000000000000e+00<br>
3.2049600000000004e+02 -6.7568973330671772e-05
2.1182079214383173e-05<br>
3.2062800000000004e+02 -6.7569118827631724e-05
2.1194791416536239e-05<br>
<br>
With a checkpoint at time 320.232 : INFO (CarpetIOHDF5):
Dumping periodic checkpoint at iteration 310528, simulation
time 320.232<br>
<br>
<br>
Also, in both cases it only affected the
metric_obs_0_Decomp.h5 file, the other detection radius
files, _1 and _2, had all data.<br>
<br>
<br>
</div>
</div>
</blockquote>
Do the Caltech fixes help? If that doesn't work, then it may be<br>
that your IO system is saturated. Perhaps then a crude workaround
would be<br>
to put a delay in after a checkpoint to give the IO system time to
process its<br>
backlog of IO requests.<br>
<blockquote
cite="mid:CAKOdkk+-aKums3m97=qDFGzNfwg2CkrKNjtNE4EfULDwScFf3Q@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<div><snip><br>
<br>
<br>
<br>
<div class="gmail_quote">2012/7/28 Erik Schnetter <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Jakob
<div class="im">
<div><br>
</div>
<div>I have not heard about such a problem before.</div>
<div><br>
</div>
</div>
<div class="im">
<div>When an HDF5 file is not properly closed, its
content may be corrupted. (This will be addressed in
the next major release.) There may be two reasons
for this: either the file is not closed (which would
be an error in the code), or there is a write error
(e.g. you run out of disk space). The latter is the
major reason for people encountering corrupted HDF5
files. Since you don't see error messages, this is
either not the case, or these HDF5 output routines
suppress these errors.</div>
<div><br>
</div>
<div>The thorn SphericalHarmonicDecomp implements its
own HDF5 output routines and does not use Cactus. I
see that it uses a non-standard way to determine
whether the file exists, and that it does not check
for errors when writing or closing. I think that
HDF5 errors should cause prominent warnings in
stdout and stderr (did you check?), and if you don't
see these, the writing should have succeeded.</div>
<div><br>
</div>
</div>
</blockquote>
<div><br>
The errors I see on stderr are the ones I mentioned in
my first mail :
<div class="im"><br>
<br>
<br>
HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1)
thread 0:<br>
#000: H5F.c line 1509 in H5Fopen(): unable to open
file<br>
major: File accessability<br>
minor: Unable to open file<br>
#001: H5F.c line 1300 in H5F_open(): unable to read
superblock<br>
major: File accessability<br>
minor: Read failed<br>
<br>
</div>
.... etc.<br>
<br>
stdout does not produce any errors or warnings related
to this.<br>
<br>
<br>
</div>
<div class="im">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>You mention checkpointing. Are you experiencing
these problems right after recovery, i.e. during the
first SphericalHarmonicDecomp HDF5 output
afterwards? </div>
</blockquote>
</div>
<div><br>
No, this happened during the first run, not related to
recovery.<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">
<div>In this case, did you maybe switch to a new
directory where this file doesn't exist?</div>
<div><br>
</div>
</div>
<div class="im">
<div>If not, then it may be the non-standard way in
which the code determines whether the file already
exists, combined with something that may be special
about your file system.</div>
<div><br>
</div>
<div>(The "standard" way operates as follows: open the
file as if it existed; if this fails, open it by
creating it. The code works differently: it opens
the file as binary file. If this fails, the HDF5
file is created; if it succeeds, the file is closed
and re-openend as HDF5 file. Maybe the quick
closing-then-reopening causes problems?)</div>
<div><br>
</div>
</div>
<div>-erik</div>
</blockquote>
</div>
<br>
<br>
<br>
<div class="gmail_quote">
<div class="im">2012/7/28 Roland Haas <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:roland.haas@physics.gatech.edu"
target="_blank">roland.haas@physics.gatech.edu</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hello
all,<br>
<div><br>
>> I have not heard about such a problem
before.<br>
</div>
I believe Nick Taylor at Caltech had similar issues.
Bela has since<br>
fixed some bugs but had trouble actually committing
them (he just saw<br>
your emails). I'll grab his changes and commit them.<br>
<br>
</blockquote>
</div>
<div><br>
Sounds interesting, I'm looking forward to appying the
changes and see if the problem disappears.<br>
<br>
Cheers,<br>
Jakob</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a>
</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Assistant Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623
Office:74-2067
Phone: +1 585-475-6103
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a>
CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
</pre>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a>
<a class="moz-txt-link-freetext" href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a>
</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Assistant Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623
Office:74-2067
Phone: +1 585-475-6103
<a class="moz-txt-link-abbreviated" href="mailto:yosef@astro.rit.edu">yosef@astro.rit.edu</a>
CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
</pre>
</body>
</html>