[Users] Cactus & HDF5: checkpoint recovery failure with ?
ian.hinder at aei.mpg.de
Tue Feb 2 03:30:29 CST 2016
On 1 Feb 2016, at 21:39, Bernard Kelly <physicsbeany at gmail.com> wrote:
> Hi all. I'm having checkpoint/recovery issues with a particular simulation:
> An initial short run stopped some time after iteration 32000, leaving
> me with checkpoints at it 30000 & 32000. I found I couldn't recover
> from the later of these, but as the earlier one *did* allow recovery,
> I didn't worry too much about it.
> Now the recovered run went until some time after it 124000. I again
> have two sets of checkpoint data, from it 122000 and 124000. *Neither*
> of these work. I could imagine the later one being corrupted somehow
> because of disk space issues, but both?
> In each case, the error output in the STDERR consists of multiple
> instances of the message below.
> * Is this likely due to file corruption?
> * What's the best way to check CarpetIOHDF5 files for corruption?
There is a tool called h5check (Google for "What's the best way to check HDF5 files for corruption"):
> • h5check: A tool to check the validity of an HDF5 file.
> The HDF5 Format Checker, h5check, is a validation tool for verifying that an HDF5 file is encoded according to the HDF File Format Specification. Its purpose is to ensure data model integrity and long-term compatibility between evolving versions of the HDF5 library.
> Note that h5check is designed and implemented without any use of the HDF5 Library.
> Given a file, h5check scans through the encoded content, verifying it against the defined library format. If it finds any non-compliance, h5check prints the error and the reason behind the non-compliance; if possible, it continues the scanning. If h5check does not find any non-compliance, it prints an approval statement upon completion.
> By default, the file is verified against the latest version of the file format, but the format version can be specified.
I have used this successfully in the past.
> * Can I do anything about this particular run, apart from start
> (again) from the "good" 30000 checkpoint?
If the file is corrupt, then I doubt it. You might be able to add debugging code to work out which dataset is corrupt, and if it is not an important one, you might be able to create a new HDF5 file with a corrected version. But this is a lot of work, and if there is more than one corrupt dataset, it's unlikely to be practical. It's probably much more realistic to just repeat the run. However, if you got corruption twice already, I suspect you will get it again. It's probably a good idea to checkpoint more frequently, as you can probably expect more corruption.
The only legitimate reason for the files being corrupted is if you ran out of disk space during write (and this is only legitimate because HDF5 does not support journaled writing, which is disappointing in 2016). If that happened, I would expect to see evidence of it in stdout/stderr, which you didn't see. If you don't have abort_on_io_errors set, then Cactus would have happily continued on after the HDF5 disk write failed, but I think it would have crashed if it couldn't write the error message to stdout/stderr, so I don't think you ran out of disk space. If the files are corrupt, it would either be a problem with the filesystem, or a bug in Cactus.
You might want to run some filesystem-checking program to see if this can be reproduced in a test case, or ask the system admins to do so.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Users