[Users] Error to write PittNull during checkpointing

Yosef Zlochower yosef at astro.rit.edu
Mon Jul 30 14:01:34 CDT 2012


On 07/28/2012 07:38 AM, Jakob Hansen wrote:
> Hi all,
>
> Thanks for fast replys ^^

I just committed a new version of SphericalHarmonicDecomp that checks
for IO errors in the hdf5 files. There is a new option
SphericalHarmonicDecomp::action_on_hdf5_error which you can set to
"abort", which will kill the run on IO errors. Killing the run isn't
ideal, but at least you won't have missing timesteps.


>
> 2012/7/27 Yosef Zlochower <yosef at astro.rit.edu <mailto:yosef at astro.rit.edu>>
>
>     Hi,
>
>       SphericalHarmonicDecomp should not be writing output at the
>     same time as a checkpoint. Are you using an NFS mount?
>
>
> No, we're using Lustre filesystem.
>
>
>     I noticed
>     issues with NFS servers becoming unresponsive (due to a large number
>     of blocking io operations) during a checkpoint. Perhaps right after
>     a checkpoint, the server is still too busy.
>
>
> Well, indeed this happens just after checkpointing, however not at every
> checkpoint and not for all output. I experienced this error twice on two
> different simulations, once for each simulation. In each case it
> affected the metric_obs_0_Decomp.h5 right after checkpointing :
>
> Case 1 : Output from ascii_output > gxx.asc :
> 2.7627600000000001e+02 -7.1882256506489145e-05 1.5746413856874709e-05
> 2.7640800000000002e+02 -7.1881310717781865e-05 1.5759232641588166e-05
>                   nan 0.0000000000000000e+00 0.0000000000000000e+00
> 2.7667200000000003e+02 -7.1875929421365471e-05 1.5784711415784759e-05
> 2.7680400000000003e+02 -7.1871563990008602e-05 1.5797542778149928e-05
>
> In this case there was a checkpoint at time 276.408 : INFO
> (CarpetIOHDF5): Dumping periodic checkpoint at iteration 268032,
> simulation time 276.408
>
>
> Case 2: Output from ascii_output > gxx.asc :
> 3.2010000000000002e+02 -6.7572912816444132e-05 2.1144268516557760e-05
> 3.2023200000000003e+02 -6.7570803118733184e-05 2.1156692762978387e-05
>                   nan 0.0000000000000000e+00 0.0000000000000000e+00
> 3.2049600000000004e+02 -6.7568973330671772e-05 2.1182079214383173e-05
> 3.2062800000000004e+02 -6.7569118827631724e-05 2.1194791416536239e-05
>
> With a checkpoint at time 320.232 :  INFO (CarpetIOHDF5): Dumping
> periodic checkpoint at iteration 310528, simulation time 320.232
>
>
> Also, in both cases it only affected the metric_obs_0_Decomp.h5 file,
> the other detection radius files, _1 and _2, had all data.
>
>
> <snip>
>
>
>
> 2012/7/28 Erik Schnetter <schnetter at cct.lsu.edu
> <mailto:schnetter at cct.lsu.edu>>
>
>     Jakob
>
>     I have not heard about such a problem before.
>
>     When an HDF5 file is not properly closed, its content may be
>     corrupted. (This will be addressed in the next major release.) There
>     may be two reasons for this: either the file is not closed (which
>     would be an error in the code), or there is a write error (e.g. you
>     run out of disk space). The latter is the major reason for people
>     encountering corrupted HDF5 files. Since you don't see error
>     messages, this is either not the case, or these HDF5 output routines
>     suppress these errors.
>
>     The thorn SphericalHarmonicDecomp implements its own HDF5 output
>     routines and does not use Cactus. I see that it uses a non-standard
>     way to determine whether the file exists, and that it does not check
>     for errors when writing or closing. I think that HDF5 errors should
>     cause prominent warnings in stdout and stderr (did you check?), and
>     if you don't see these, the writing should have succeeded.
>
>
> The errors I see on stderr are the ones I mentioned in my first mail :
>
>
>
> HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0:
> #000: H5F.c line 1509 in H5Fopen(): unable to open file
> major: File accessability
> minor: Unable to open file
> #001: H5F.c line 1300 in H5F_open(): unable to read superblock
> major: File accessability
> minor: Read failed
>
> .... etc.
>
> stdout does not produce any errors or warnings related to this.
>
>
>     You mention checkpointing. Are you experiencing these problems right
>     after recovery, i.e. during the first SphericalHarmonicDecomp HDF5
>     output afterwards?
>
>
> No, this happened during the first run, not related to recovery.
>
>     In this case, did you maybe switch to a new directory where this
>     file doesn't exist?
>
>     If not, then it may be the non-standard way in which the code
>     determines whether the file already exists, combined with something
>     that may be special about your file system.
>
>     (The "standard" way operates as follows: open the file as if it
>     existed; if this fails, open it by creating it. The code works
>     differently: it opens the file as binary file. If this fails, the
>     HDF5 file is created; if it succeeds, the file is closed and
>     re-openend as HDF5 file. Maybe the quick closing-then-reopening
>     causes problems?)
>
>     -erik
>
>
>
>
> 2012/7/28 Roland Haas <roland.haas at physics.gatech.edu
> <mailto:roland.haas at physics.gatech.edu>>
>
>     Hello all,
>
>      >> I have not heard about such a problem before.
>     I believe Nick Taylor at Caltech had similar issues. Bela has since
>     fixed some bugs but had trouble actually committing them (he just saw
>     your emails). I'll grab his changes and commit them.
>
>
> Sounds interesting, I'm looking forward to appying the changes and see
> if the problem disappears.
>
> Cheers,
> Jakob
>
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users


-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Assistant Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yosef at astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.


More information about the Users mailing list