[Users] Error with CarpetIOHDF5 Checkpoint Iteration

Roland Haas rhaas at illinois.edu
Wed Oct 2 11:31:48 CDT 2024


Hello Wei,

I am afraid I have never seen this particular error. Some items in an
HDF5 are size limited, eg an "attribute" must no be larger than 64k or
so. This may be similar.

Of the CarpetIOHDF5 options I would not try and change any of the
chunking or similar options away from the the defaults since they may
not be well tested.

Could you perhaps attach the full parameter file (*.par file) and the
full stdout and stderr log file to the email? Also the backtrace.txt
file might be useful to see exactly what triggers the error.

Similar error messages can be found on the web eg

https://github.com/HDFGroup/hdf5/issues/3762

though the solution suggested there seems to be to modify and recompile
HDF5 itself.

The suggestion

H5Pset_libver_bounds(fapl, H5F_LIBVER_V110, H5F_LIBVER_V110);

is doable at runtime though, but needs to be done when writing to the
file it would seem (in CarpetIOHDF5 and I think there are already some
fapl set up when creating a file so this would be straightforward to
add).

Note that this may all not matter if one uses the default setup that
writes one file per MPI rank. I would also (strongly) suggest to not
append to HDF5 files (if you restart from a checkpoint) since, due to
HDF5 files being fragile while open, this can easily corrupt the whole
file and make it unreadable if the code crashes while writing.

I am somewhat surprised that the issue happens in a H5Cload and not a
write function.

Yours,
Roland


On Tue, 1 Oct 2024 19:35:20 -0400, Wei Sun wrote:
> Hi Roland,
> 
> Thank you for your clarification. I think I missed the true error, so the
> actual error is here:
> 
> *WARNING level 1 from host* x1005c1s1b0n1h0.chn.perlmutter.nersc.gov,
> process 0 in thorn CarpetIOHDF5, file
> /Cactus/configs/sim/build/CarpetIOHDF5/Output.cc:442: -> Values for
> DISTRIB=CONSTANT grid variable 'TERMINATIONTRIGGER::watchminutes'
> (timelevel 0) differ between processors 0 and 383; only the array from
> processor 0 will be stored.
> *cactus_sim: H5C.c:6732: H5C_load_entry: Assertion entry->size <
> ((size_t)(32 * 1024 * 1024)) failed.*
> Rank 0 with PID 1727399 received signal 6.
> Writing backtrace to /Multipole/backtrace.0.txt
> srun: error: nid004365: task 0: Aborted.
> 
> Then I modified my parameter file by setting IO::out_mode = "proc" to make
> the checkpoint.h5 file chunked, which fixed the aborted issue. However,
> this also changed my 3D output HDF5 file to chunked format as well.
> 
> Even after adding out_unchunked='yes' in the out_vars parameter, for
> example:
> CarpetIOHDF5::out3D_vars = "HydroBase::rho{out_unchunked='yes'}", it
> doesn't work.
> 
> Is it possible that I didn’t write it in the correct way?
> 
> 
> Thank you,
> 
> Wei

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20241002/24fc3fcf/attachment-0001.sig>


More information about the Users mailing list