[Users] Cactus & HDF5: checkpoint recovery failure with ?
physicsbeany at gmail.com
Tue Feb 2 21:48:11 CST 2016
Thanks, Ian. I got h5check (not part of the default HDF5
installation), and ran it. Each of the troublesome checkpoint does,
indeed, have at least one or two "non-compliant" files there.
Irritating, but I suppose it answers my question.
Roland, thanks for the abort_on_io_errors suggestion (though it might
not have helped here, given the lack of warnings).
I guess I'll be starting from the older checkpoint, then.
On 2 February 2016 at 04:30, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
> On 1 Feb 2016, at 21:39, Bernard Kelly <physicsbeany at gmail.com> wrote:
> Hi all. I'm having checkpoint/recovery issues with a particular simulation:
> An initial short run stopped some time after iteration 32000, leaving
> me with checkpoints at it 30000 & 32000. I found I couldn't recover
> from the later of these, but as the earlier one *did* allow recovery,
> I didn't worry too much about it.
> Now the recovered run went until some time after it 124000. I again
> have two sets of checkpoint data, from it 122000 and 124000. *Neither*
> of these work. I could imagine the later one being corrupted somehow
> because of disk space issues, but both?
> In each case, the error output in the STDERR consists of multiple
> instances of the message below.
> * Is this likely due to file corruption?
> * What's the best way to check CarpetIOHDF5 files for corruption?
> Hi Bernard,
> There is a tool called h5check (Google for "What's the best way to check
> HDF5 files for corruption"):
> • h5check: A tool to check the validity of an HDF5 file.
> The HDF5 Format Checker, h5check, is a validation tool for verifying that an
> HDF5 file is encoded according to the HDF File Format Specification. Its
> purpose is to ensure data model integrity and long-term compatibility
> between evolving versions of the HDF5 library.
> Note that h5check is designed and implemented without any use of the HDF5
> Given a file, h5check scans through the encoded content, verifying it
> against the defined library format. If it finds any non-compliance, h5check
> prints the error and the reason behind the non-compliance; if possible, it
> continues the scanning. If h5check does not find any non-compliance, it
> prints an approval statement upon completion.
> By default, the file is verified against the latest version of the file
> format, but the format version can be specified.
> I have used this successfully in the past.
> * Can I do anything about this particular run, apart from start
> (again) from the "good" 30000 checkpoint?
> If the file is corrupt, then I doubt it. You might be able to add debugging
> code to work out which dataset is corrupt, and if it is not an important
> one, you might be able to create a new HDF5 file with a corrected version.
> But this is a lot of work, and if there is more than one corrupt dataset,
> it's unlikely to be practical. It's probably much more realistic to just
> repeat the run. However, if you got corruption twice already, I suspect you
> will get it again. It's probably a good idea to checkpoint more frequently,
> as you can probably expect more corruption.
> The only legitimate reason for the files being corrupted is if you ran out
> of disk space during write (and this is only legitimate because HDF5 does
> not support journaled writing, which is disappointing in 2016). If that
> happened, I would expect to see evidence of it in stdout/stderr, which you
> didn't see. If you don't have abort_on_io_errors set, then Cactus would
> have happily continued on after the HDF5 disk write failed, but I think it
> would have crashed if it couldn't write the error message to stdout/stderr,
> so I don't think you ran out of disk space. If the files are corrupt, it
> would either be a problem with the filesystem, or a bug in Cactus.
> You might want to run some filesystem-checking program to see if this can be
> reproduced in a test case, or ask the system admins to do so.
> Ian Hinder
More information about the Users