[Users] Error to write PittNull during checkpointing

Ian Hinder ian.hinder at aei.mpg.de
Thu Aug 2 07:59:33 CDT 2012


On 28 Jul 2012, at 13:38, Jakob Hansen wrote:

> Hi all, 
> 
> Thanks for fast replys ^^
> 
> 2012/7/27 Yosef Zlochower <yosef at astro.rit.edu>
> Hi,
> 
>  SphericalHarmonicDecomp should not be writing output at the
> same time as a checkpoint. Are you using an NFS mount?
> 
> No, we're using Lustre filesystem.

I noticed similar problems on our lustre filesystem several months ago.  The symptom was that shortly (a few minutes) after successfully writing a correct HDF5 checkpoint file, some other output would fail.  In my case, the output was usually HDF5.  The error messages look different though.  The only unusual think about my run was that I was doing quite frequent 3D HDF5 output, so I was stressing the system more than usual.


>  
> 
> I noticed
> issues with NFS servers becoming unresponsive (due to a large number
> of blocking io operations) during a checkpoint. Perhaps right after
> a checkpoint, the server is still too busy.
> 
> 
> Well, indeed this happens just after checkpointing, however not at every checkpoint and not for all output. I experienced this error twice on two different simulations, once for each simulation. In each case it affected the metric_obs_0_Decomp.h5 right after checkpointing :
> 
> Case 1 : Output from ascii_output > gxx.asc :
> 2.7627600000000001e+02 -7.1882256506489145e-05 1.5746413856874709e-05
> 2.7640800000000002e+02 -7.1881310717781865e-05 1.5759232641588166e-05
>                  nan 0.0000000000000000e+00 0.0000000000000000e+00
> 2.7667200000000003e+02 -7.1875929421365471e-05 1.5784711415784759e-05
> 2.7680400000000003e+02 -7.1871563990008602e-05 1.5797542778149928e-05
> 
> In this case there was a checkpoint at time 276.408 : INFO (CarpetIOHDF5): Dumping periodic checkpoint at iteration 268032, simulation time 276.408
> 
> 
> Case 2: Output from ascii_output > gxx.asc :
> 3.2010000000000002e+02 -6.7572912816444132e-05 2.1144268516557760e-05
> 3.2023200000000003e+02 -6.7570803118733184e-05 2.1156692762978387e-05
>                  nan 0.0000000000000000e+00 0.0000000000000000e+00
> 3.2049600000000004e+02 -6.7568973330671772e-05 2.1182079214383173e-05
> 3.2062800000000004e+02 -6.7569118827631724e-05 2.1194791416536239e-05
> 
> With a checkpoint at time 320.232 :  INFO (CarpetIOHDF5): Dumping periodic checkpoint at iteration 310528, simulation time 320.232
> 
> 
> Also, in both cases it only affected the metric_obs_0_Decomp.h5 file, the other detection radius files, _1 and _2, had all data.
>  
> 
> <snip>
> 
> 
> 
> 2012/7/28 Erik Schnetter <schnetter at cct.lsu.edu>
> Jakob
> 
> I have not heard about such a problem before.
> 
> When an HDF5 file is not properly closed, its content may be corrupted. (This will be addressed in the next major release.) There may be two reasons for this: either the file is not closed (which would be an error in the code), or there is a write error (e.g. you run out of disk space). The latter is the major reason for people encountering corrupted HDF5 files. Since you don't see error messages, this is either not the case, or these HDF5 output routines suppress these errors.
> 
> The thorn SphericalHarmonicDecomp implements its own HDF5 output routines and does not use Cactus. I see that it uses a non-standard way to determine whether the file exists, and that it does not check for errors when writing or closing. I think that HDF5 errors should cause prominent warnings in stdout and stderr (did you check?), and if you don't see these, the writing should have succeeded.
> 
> 
> The errors I see on stderr are the ones I mentioned in my first mail :
> 
> 
>  
> HDF5-DIAG: Error detected in HDF5 (1.8.5-patch1) thread 0:
> #000: H5F.c line 1509 in H5Fopen(): unable to open file
> major: File accessability
> minor: Unable to open file
> #001: H5F.c line 1300 in H5F_open(): unable to read superblock
> major: File accessability
> minor: Read failed
> 
> .... etc.
> 
> stdout does not produce any errors or warnings related to this.
> 
> 
> You mention checkpointing. Are you experiencing these problems right after recovery, i.e. during the first SphericalHarmonicDecomp HDF5 output afterwards?
> 
> No, this happened during the first run, not related to recovery.
>  
> In this case, did you maybe switch to a new directory where this file doesn't exist?
> 
> If not, then it may be the non-standard way in which the code determines whether the file already exists, combined with something that may be special about your file system.
> 
> (The "standard" way operates as follows: open the file as if it existed; if this fails, open it by creating it. The code works differently: it opens the file as binary file. If this fails, the HDF5 file is created; if it succeeds, the file is closed and re-openend as HDF5 file. Maybe the quick closing-then-reopening causes problems?)
> 
> -erik
> 
> 
> 
> 2012/7/28 Roland Haas <roland.haas at physics.gatech.edu>
> Hello all,
> 
> >> I have not heard about such a problem before.
> I believe Nick Taylor at Caltech had similar issues. Bela has since
> fixed some bugs but had trouble actually committing them (he just saw
> your emails). I'll grab his changes and commit them.
> 
> 
> Sounds interesting, I'm looking forward to appying the changes and see if the problem disappears.
> 
> Cheers,
> Jakob
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20120802/c35a3914/attachment-0001.html 


More information about the Users mailing list