[Users] question about checkpoints and number of procs

Luciano Combi lcombi at perimeterinstitute.ca
Mon Apr 15 12:04:11 CDT 2024


I see, thanks, Roland!

As a matter of fact, I had that option already activated, otherwise it
would just give me a memory error.

I'm thinking of maybe restarting the simulation with openMP activated to
speed up the process, do you think it will help? Otherwise, I will try your
hack.

Cheers.
Luciano


On Mon, Apr 15, 2024 at 10:19 AM Roland Haas <rhaas at illinois.edu> wrote:

> Hello Luciano ,
>
> > I'm trying to restart a simulation with a different number of processors
> > than the original run. Is there something in particular I need to do to
> > make it work? When I do, it gets stuck for hours reading the checkpoints.
> > The checkpoint is distributed in a number of files corresponding to the
> > original number of procs I used, should I recombine them in a particular
> > way?
>
> The issue is that when changing the number of MPI ranks the data needs
> to be reorganized and right now this means that each MPI rank will open
> every single file to look for data, which can overwhelm the file system.
>
> A quick workaround is often to set:
>
> CarpetIOHDF5::open_one_input_file_at_a_time = "yes"
>
> which reduces IO contention.
>
> If that is still too slow (this has happened only with many hundreds of
> MPI ranks though), then you can try the hacked version of CarpetIOHDF5
> in the branch rhaas/map which contains an helper script that you can
> run offline to parse all information in the checkpoint files into a
> "map" file. At checkpoint recovery time the MPI ranks then read in the
> map file which tells them exactly where they need to look for their
> data, this significantly reduces IO issues. It is is not user friendly
> though and was an emergency hack and will most likely require some
> trial and error to get it right, so setting the parameter above would
> be my first attempt.
>
> Yours,
> Roland
>
> --
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://pgp.mit.edu .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20240415/d9f0aed4/attachment.htm>


More information about the Users mailing list