<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On 28 Jul 2012, at 13:38, Jakob Hansen wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">Hi all, <br><br>Thanks for fast replys ^^<br><br><div class="gmail_quote"><div class="im">2012/7/27 Yosef Zlochower <span dir="ltr"><<a href="mailto:yosef@astro.rit.edu" target="_blank">yosef@astro.rit.edu</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi,<br>
<br>
SphericalHarmonicDecomp should not be writing output at the<br>
same time as a checkpoint. Are you using an NFS mount? </blockquote></div><div><br>No, we're using Lustre filesystem.<br></div></div></blockquote><div><br></div><div>I noticed similar problems on our lustre filesystem several months ago. The symptom was that shortly (a few minutes) after successfully writing a correct HDF5 checkpoint file, some other output would fail. In my case, the output was usually HDF5. The error messages look different though. The only unusual think about my run was that I was doing quite frequent 3D HDF5 output, so I was stressing the system more than usual.</div><div><br></div><br><blockquote type="cite"><div class="gmail_quote"><div> <br><br></div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I noticed<br>
issues with NFS servers becoming unresponsive (due to a large number<br>
of blocking io operations) during a checkpoint. Perhaps right after<br>
a checkpoint, the server is still too busy.<br>
<br></blockquote></div><div><br>Well, indeed this happens just after
checkpointing, however not at every checkpoint and not for all output. I
experienced this error twice on two different simulations, once for
each simulation. In each case it affected the metric_obs_0_Decomp.h5
right after checkpointing :<br>
<br>Case 1 : Output from ascii_output > gxx.asc :<br>2.7627600000000001e+02 -7.1882256506489145e-05 1.5746413856874709e-05<br>2.7640800000000002e+02 -7.1881310717781865e-05 1.5759232641588166e-05<br> nan 0.0000000000000000e+00 0.0000000000000000e+00<br>
2.7667200000000003e+02 -7.1875929421365471e-05 1.5784711415784759e-05<br>2.7680400000000003e+02 -7.1871563990008602e-05 1.5797542778149928e-05<br><br>In
this case there was a checkpoint at time 276.408 : INFO (CarpetIOHDF5):
Dumping periodic checkpoint at iteration 268032, simulation time
276.408<br>
<br><br>Case 2: Output from ascii_output > gxx.asc :<br>3.2010000000000002e+02 -6.7572912816444132e-05 2.1144268516557760e-05<br>3.2023200000000003e+02 -6.7570803118733184e-05 2.1156692762978387e-05<br> nan 0.0000000000000000e+00 0.0000000000000000e+00<br>
3.2049600000000004e+02 -6.7568973330671772e-05 2.1182079214383173e-05<br>3.2062800000000004e+02 -6.7569118827631724e-05 2.1194791416536239e-05<br><br>With
a checkpoint at time 320.232 : INFO (CarpetIOHDF5): Dumping periodic
checkpoint at iteration 310528, simulation time 320.232<br>
<br><br>Also, in both cases it only affected the metric_obs_0_Decomp.h5 file, the other detection radius files, _1 and _2, had all data.<br> <br><br><snip><br><br><br><br><div class="gmail_quote">2012/7/28 Erik Schnetter <span dir="ltr"><<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Jakob<div class="im"><div><br></div><div>I have not heard about such a problem before.</div><div><br></div></div><div class="im">
<div>When
an HDF5 file is not properly closed, its content may be corrupted.
(This will be addressed in the next major release.) There may be two
reasons for this: either the file is not closed (which would be an error
in the code), or there is a write error (e.g. you run out of disk
space). The latter is the major reason for people encountering corrupted
HDF5 files. Since you don't see error messages, this is either not the
case, or these HDF5 output routines suppress these errors.</div>
<div><br></div><div>The thorn SphericalHarmonicDecomp implements its own
HDF5 output routines and does not use Cactus. I see that it uses a
non-standard way to determine whether the file exists, and that it does
not check for errors when writing or closing. I think that HDF5 errors
should cause prominent warnings in stdout and stderr (did you check?),
and if you don't see these, the writing should have succeeded.</div>
<div><br></div></div></blockquote><div><br>The errors I see on stderr are the ones I mentioned in my first mail :<div class="im"><br><br> <br>HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0:<br>#000: H5F.c line 1509 in H5Fopen(): unable to open file<br>
major: File accessability<br>
minor: Unable to open file<br>#001: H5F.c line 1300 in H5F_open(): unable to read superblock<br>major: File accessability<br>minor: Read failed<br><br></div>.... etc.<br><br>stdout does not produce any errors or warnings related to this.<br>
<br><br></div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>You mention checkpointing. Are you experiencing
these problems right after recovery, i.e. during the first
SphericalHarmonicDecomp HDF5 output afterwards? </div></blockquote></div><div><br>No, this happened during the first run, not related to recovery.<br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">
<div>In this case, did you
maybe switch to a new directory where this file doesn't exist?</div>
<div><br></div></div><div class="im"><div>If not, then it may be the non-standard way in which
the code determines whether the file already exists, combined with
something that may be special about your file system.</div><div><br></div><div>(The
"standard" way operates as follows: open the file as if it existed; if
this fails, open it by creating it. The code works differently: it opens
the file as binary file. If this fails, the HDF5 file is created; if it
succeeds, the file is closed and re-openend as HDF5 file. Maybe the
quick closing-then-reopening causes problems?)</div>
<div><br></div></div><div>-erik</div></blockquote></div><br><br><br><div class="gmail_quote"><div class="im">2012/7/28 Roland Haas <span dir="ltr"><<a href="mailto:roland.haas@physics.gatech.edu" target="_blank">roland.haas@physics.gatech.edu</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello all,<br>
<div><br>
>> I have not heard about such a problem before.<br>
</div>I believe Nick Taylor at Caltech had similar issues. Bela has since<br>
fixed some bugs but had trouble actually committing them (he just saw<br>
your emails). I'll grab his changes and commit them.<br>
<br></blockquote></div><div><br>Sounds interesting, I'm looking forward to appying the changes and see if the problem disappears.<br><br>Cheers,<br>Jakob</div></div></div></div>
_______________________________________________<br>Users mailing list<br><a href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a><br><a href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a><br></blockquote></div><br><div>
<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>-- </div><div>Ian Hinder</div><div><a href="http://numrel.aei.mpg.de/people/hinder">http://numrel.aei.mpg.de/people/hinder</a></div></div></span></div></span></div></span></span>
</div>
<br></body></html>