[Users] Cactus & HDF5: checkpoint recovery failure with ?

Bernard Kelly physicsbeany at gmail.com
Mon Feb 1 14:39:25 CST 2016


Hi all. I'm having checkpoint/recovery issues with a particular simulation:

An initial short run stopped some time after iteration 32000, leaving
me with checkpoints at it 30000 & 32000. I found I couldn't recover
from the later of these, but as the earlier one *did* allow recovery,
I didn't worry too much about it.

Now the recovered run went until some time after it 124000. I again
have two sets of checkpoint data, from it 122000 and 124000. *Neither*
of these work. I could imagine the later one being corrupted somehow
because of disk space issues, but both?

In each case, the error output in the STDERR consists of multiple
instances of the message below.

* Is this likely due to file corruption?

* What's the best way to check CarpetIOHDF5 files for corruption?

* Can I do anything about this particular run, apart from start
(again) from the "good" 30000 checkpoint?

Thanks,

Bernard

----------------------------------------------------------------------------------------
HDF5-DIAG: Error detected in HDF5 (1.8.8) thread 0:
  #000: H5Gdeprec.c line 875 in H5Gget_objinfo(): cannot stat object
    major: Invalid arguments to routine
    minor: Unable to initialize object
  #001: H5Gdeprec.c line 1002 in H5G_get_objinfo(): name doesn't exist
    major: Symbol table
    minor: Object already exists
  #002: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #003: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #004: H5Gdeprec.c line 926 in H5G_get_objinfo_cb(): unable to get object info
    major: Object header
    minor: Can't get value
  #005: H5O.c line 2789 in H5O_get_info(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #006: H5O.c line 1682 in H5O_protect(): unable to load object header
    major: Object header
    minor: Unable to protect metadata
  #007: H5AC.c line 1322 in H5AC_protect(): H5C_protect() failed.
    major: Object cache
    minor: Unable to protect metadata
  #008: H5C.c line 3567 in H5C_protect(): can't load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #009: H5C.c line 7957 in H5C_load_entry(): unable to load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #010: H5Ocache.c line 275 in H5O_load(): bad object header version number
    major: Object header
    minor: Wrong version number
WARNING level 1 from host r421i1n0.p4.nas.nasa.gov process 20
  while executing schedule bin CCTK_RECOVER_VARIABLES, routine
IOUtil::IOUtil_RecoverGH
  in thorn CarpetIOHDF5, file
/nobackupp8/bjkelly1/codes/Cactus.ET_2015_05/configs/vacuum/build/CarpetIOHDF5/Input.cc:1235:
  -> HDF5 call 'H5Gget_objinfo (group, objectname, 0, &object_info)'
returned error code -1
--------------------------------------------------------------------------------------------------


More information about the Users mailing list