[Users] Error to write PittNull during checkpointing

Jakob Hansen jakobidetsortehul at gmail.com
Sat Jul 28 06:38:41 CDT 2012


Hi all,

Thanks for fast replys ^^

2012/7/27 Yosef Zlochower <yosef at astro.rit.edu>

> Hi,
>
>  SphericalHarmonicDecomp should not be writing output at the
> same time as a checkpoint. Are you using an NFS mount?


No, we're using Lustre filesystem.


I noticed
> issues with NFS servers becoming unresponsive (due to a large number
> of blocking io operations) during a checkpoint. Perhaps right after
> a checkpoint, the server is still too busy.
>
>
Well, indeed this happens just after checkpointing, however not at every
checkpoint and not for all output. I experienced this error twice on two
different simulations, once for each simulation. In each case it affected
the metric_obs_0_Decomp.h5 right after checkpointing :

Case 1 : Output from ascii_output > gxx.asc :
2.7627600000000001e+02 -7.1882256506489145e-05 1.5746413856874709e-05
2.7640800000000002e+02 -7.1881310717781865e-05 1.5759232641588166e-05
                 nan 0.0000000000000000e+00 0.0000000000000000e+00
2.7667200000000003e+02 -7.1875929421365471e-05 1.5784711415784759e-05
2.7680400000000003e+02 -7.1871563990008602e-05 1.5797542778149928e-05

In this case there was a checkpoint at time 276.408 : INFO (CarpetIOHDF5):
Dumping periodic checkpoint at iteration 268032, simulation time 276.408


Case 2: Output from ascii_output > gxx.asc :
3.2010000000000002e+02 -6.7572912816444132e-05 2.1144268516557760e-05
3.2023200000000003e+02 -6.7570803118733184e-05 2.1156692762978387e-05
                 nan 0.0000000000000000e+00 0.0000000000000000e+00
3.2049600000000004e+02 -6.7568973330671772e-05 2.1182079214383173e-05
3.2062800000000004e+02 -6.7569118827631724e-05 2.1194791416536239e-05

With a checkpoint at time 320.232 :  INFO (CarpetIOHDF5): Dumping periodic
checkpoint at iteration 310528, simulation time 320.232


Also, in both cases it only affected the metric_obs_0_Decomp.h5 file, the
other detection radius files, _1 and _2, had all data.


<snip>



2012/7/28 Erik Schnetter <schnetter at cct.lsu.edu>

> Jakob
>
> I have not heard about such a problem before.
>
> When an HDF5 file is not properly closed, its content may be corrupted.
> (This will be addressed in the next major release.) There may be two
> reasons for this: either the file is not closed (which would be an error in
> the code), or there is a write error (e.g. you run out of disk space). The
> latter is the major reason for people encountering corrupted HDF5 files.
> Since you don't see error messages, this is either not the case, or these
> HDF5 output routines suppress these errors.
>
> The thorn SphericalHarmonicDecomp implements its own HDF5 output routines
> and does not use Cactus. I see that it uses a non-standard way to determine
> whether the file exists, and that it does not check for errors when writing
> or closing. I think that HDF5 errors should cause prominent warnings in
> stdout and stderr (did you check?), and if you don't see these, the writing
> should have succeeded.
>
>
The errors I see on stderr are the ones I mentioned in my first mail :



HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0:
#000: H5F.c line 1509 in H5Fopen(): unable to open file
major: File accessability
minor: Unable to open file
#001: H5F.c line 1300 in H5F_open(): unable to read superblock
major: File accessability
minor: Read failed

.... etc.

stdout does not produce any errors or warnings related to this.


You mention checkpointing. Are you experiencing these problems right after
> recovery, i.e. during the first SphericalHarmonicDecomp HDF5 output
> afterwards?
>

No, this happened during the first run, not related to recovery.


>  In this case, did you maybe switch to a new directory where this file
> doesn't exist?
>
> If not, then it may be the non-standard way in which the code determines
> whether the file already exists, combined with something that may be
> special about your file system.
>
> (The "standard" way operates as follows: open the file as if it existed;
> if this fails, open it by creating it. The code works differently: it opens
> the file as binary file. If this fails, the HDF5 file is created; if it
> succeeds, the file is closed and re-openend as HDF5 file. Maybe the quick
> closing-then-reopening causes problems?)
>
> -erik
>



2012/7/28 Roland Haas <roland.haas at physics.gatech.edu>

> Hello all,
>
> >> I have not heard about such a problem before.
> I believe Nick Taylor at Caltech had similar issues. Bela has since
> fixed some bugs but had trouble actually committing them (he just saw
> your emails). I'll grab his changes and commit them.
>
>
Sounds interesting, I'm looking forward to appying the changes and see if
the problem disappears.

Cheers,
Jakob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20120728/90f4f1d3/attachment-0001.html 


More information about the Users mailing list