[Users] Cactus & HDF5: checkpoint recovery failure with ?

Frank Loeffler knarf at cct.lsu.edu
Mon Feb 1 14:56:34 CST 2016


Hi,

On Mon, Feb 01, 2016 at 03:39:25PM -0500, Bernard Kelly wrote:
> In each case, the error output in the STDERR consists of multiple
> instances of the message below.
> 
> * Is this likely due to file corruption?

I would think so. A way to see would be to use another HDF5 tool to look
at the file.

> * What's the best way to check CarpetIOHDF5 files for corruption?

I don't know of a 'best' way, but I would first try 'h5ls' and see if
already that has problems. If this succeeds, 'h5dump' might be a
quick-and-dirty solution. Dump the complete file to /dev/null and see if
h5dump complains about something.

> * Can I do anything about this particular run, apart from start
> (again) from the "good" 30000 checkpoint?

Assuming it is hdf5 file corruption, most likely not. Depending on how
desperate you are you could try to see which parts are affected, and if
the remaining 'good' parts are sufficient to restart your particular
simulation. I wouldn't have high hopes though.

Something else that I would like to know: do you still have stdout/err
of the run producing these files? Did Carpet complain during checkpoint
write? If so, it might be fine continuing (as it apparently did), but
shouldn't delete the last-good checkpoint file - maybe even by deleting
the known-to-be-bad attempt.

Frank

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20160201/ca938964/attachment.bin 


More information about the Users mailing list