[Users] Cactus & HDF5: checkpoint recovery failure with ?

Bernard Kelly physicsbeany at gmail.com
Mon Feb 1 16:04:14 CST 2016


Thanks, Frank.

I've looked through the combined STDOUT & STDERR file from the run
that generated the 122000 & 124000 checkpoints, and it looks totally
normal at that time; no complaints.

As I have N separate checkpoint files for a time step (works faster
than one large one), the h5ls takes a bit of time. I'm running it now
...

This machine has been pretty flaky with its nobackup filesystem in
recent weeks. I'd have been pretty unlucky to have been affected by it
both at 122000 *and* 124000 (a few hours later), but it's not
impossible.

Beany

On 1 February 2016 at 15:56, Frank Loeffler <knarf at cct.lsu.edu> wrote:
> Hi,
>
> On Mon, Feb 01, 2016 at 03:39:25PM -0500, Bernard Kelly wrote:
>> In each case, the error output in the STDERR consists of multiple
>> instances of the message below.
>>
>> * Is this likely due to file corruption?
>
> I would think so. A way to see would be to use another HDF5 tool to look
> at the file.
>
>> * What's the best way to check CarpetIOHDF5 files for corruption?
>
> I don't know of a 'best' way, but I would first try 'h5ls' and see if
> already that has problems. If this succeeds, 'h5dump' might be a
> quick-and-dirty solution. Dump the complete file to /dev/null and see if
> h5dump complains about something.
>
>> * Can I do anything about this particular run, apart from start
>> (again) from the "good" 30000 checkpoint?
>
> Assuming it is hdf5 file corruption, most likely not. Depending on how
> desperate you are you could try to see which parts are affected, and if
> the remaining 'good' parts are sufficient to restart your particular
> simulation. I wouldn't have high hopes though.
>
> Something else that I would like to know: do you still have stdout/err
> of the run producing these files? Did Carpet complain during checkpoint
> write? If so, it might be fine continuing (as it apparently did), but
> shouldn't delete the last-good checkpoint file - maybe even by deleting
> the known-to-be-bad attempt.
>
> Frank
>


More information about the Users mailing list