[Users] question about checkpoints and number of procs

Mon Apr 15 08:19:13 CDT 2024

Hello Luciano ,

> I'm trying to restart a simulation with a different number of processors
> than the original run. Is there something in particular I need to do to
> make it work? When I do, it gets stuck for hours reading the checkpoints.
> The checkpoint is distributed in a number of files corresponding to the
> original number of procs I used, should I recombine them in a particular
> way?

The issue is that when changing the number of MPI ranks the data needs
to be reorganized and right now this means that each MPI rank will open
every single file to look for data, which can overwhelm the file system.

A quick workaround is often to set:

CarpetIOHDF5::open_one_input_file_at_a_time = "yes"

which reduces IO contention.

If that is still too slow (this has happened only with many hundreds of
MPI ranks though), then you can try the hacked version of CarpetIOHDF5
in the branch rhaas/map which contains an helper script that you can
run offline to parse all information in the checkpoint files into a
"map" file. At checkpoint recovery time the MPI ranks then read in the
map file which tells them exactly where they need to look for their
data, this significantly reduces IO issues. It is is not user friendly
though and was an emergency hack and will most likely require some
trial and error to get it right, so setting the parameter above would
be my first attempt.

Yours,
Roland

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20240415/c368f367/attachment.sig>