[Users] Avoiding writing one checkpoint file per MPI process

Fri Oct 14 11:23:04 CDT 2022

Hello Lorenzo,

If the issue is really "use MPI-IO" and not "use *parallel* file
access" then you can give the branch "rhaas/mpiio" of Carpet a try.

It introduces a new parameter CarpetIOHDF5::user_MPIIO that makes
Carpet instruct HDF5 to use MPIIO instead of just POSIX/Unix file IO
calls. This is *not* parallel (ie same number of output files, each
process acts independently) but it would bring in eg any awareness that
MPI-IO might have about Lustre that POSIX IO does not have.

This technically would address the issue that Frontera support seems to
have brought up, though I feel that is does not really address anything
(ie no parallel IO at all).

Yours,
Roland

> Hello Roland, all,
> could it be possible for you to have a quick look at the parameter file I
> am using (attached) to check if there is anything manifestly
> wrong/unsafe/unrecommended with checkpointing or with other I/O options? In
> case there are any issues, I can then take care of them and report back to
> the Frontera people.
> 
> Thank you very much in advance,
> Lorenzo
> 
> Il giorno gio 6 ott 2022 alle ore 15:29 Roland Haas <rhaas at illinois.edu> ha
> scritto:
> 
> > Hello Lorenzo,
> >
> > TACC saved a bit of money on the IO system on Frontera :-) and thus
> > they now need to fix bugs in documentation.
> >
> > Yours,
> > Roland
> >  
> > > Hi Roland,
> > > thank you, your suggestions are very useful. I was running one process  
> > per  
> > > core on more than 200 cores, so that may be part of the issue. Also, I  
> > will  
> > > try the one_file_per_group or one_file_per_rank options to reduce the
> > > performance impact.
> > >
> > > The cluster I'm running on is Frontera, and the guidelines to manage I/O
> > > operations properly on it are here
> > > <  
> > https://urldefense.com/v3/__https://portal.tacc.utexas.edu/tutorials/managingio__;!!DZ3fjg!-IpspO3XOAP7Iq90ewHLXhCVTP8zRBTZV3k6XCsoyypUMszoctdY8pqhv7lN-OrMXRv5iGAFRSV1bmjKrd05VwT-aw$  
> > > in case people are
> > > interested. I will follow them as closely as I can to avoid similar
> > > problems in the future.
> > >
> > > Thank you very much again,
> > > Lorenzo
> > >
> > > Il giorno gio 6 ott 2022 alle ore 12:52 Roland Haas <rhaas at illinois.edu>  
> > ha  
> > > scritto:
> > >  
> > > > Hello Lorenzo,
> > > >
> > > > Unfortunately, Carpet will always write one checkpoint file per MPI
> > > > rank, there is no way to change that.
> > > >
> > > > As you learned the option out_proc_every only affects the out3D_vars
> > > > output (and possible out_vars 3D output) but never checkpoints.
> > > >
> > > > In my opinion, you should be impossible to stress the file system, of a
> > > > reasonably provisioned cluster, with the checkpoints. Even when running
> > > > on 32k MPI ranks (and 4k nodes) on BW, checkpoint-recovery was very
> > > > quick (1min or so) and barely made a blip on the system monitoring
> > > > radar. Any cluster with sufficiently many nodes to run at scale at
> > > > 1 file per rank (for a sane number of ranks ie some OpenMP threads)
> > > > should have a file system capable of taking checkpoints. Of course 1
> > > > rank per core is no longer "sane" once you go beyond a couple hundred
> > > > cores.
> > > >
> > > > Now writing 1 file per output variable and per MPI rank may be a
> > > > different thing....
> > > > In that case out_proc_every should help with out3D_vars. I would also
> > > > suggest one_file_per_group or even one_file_per_rank for this (see
> > > > CarpetIOHDF5's param.ccl), which will have less of a performance (no
> > > > communication) impact than out_proc_every != 1.
> > > >
> > > > If the issue is opening many files (again, only for out3D_vars regular
> > > > output), then you may also see benefits from the different options in:
> > > >
> > > >  
> > https://urldefense.com/v3/__https://bitbucket.org/eschnett/carpet/pull-requests/34__;!!DZ3fjg!-IpspO3XOAP7Iq90ewHLXhCVTP8zRBTZV3k6XCsoyypUMszoctdY8pqhv7lN-OrMXRv5iGAFRSV1bmjKrd1fZiBWGw$
> >  
> > > >
> > > >  
> > https://urldefense.com/v3/__https://bitbucket.org/einsteintoolkit/tickets/issues/2364__;!!DZ3fjg!-IpspO3XOAP7Iq90ewHLXhCVTP8zRBTZV3k6XCsoyypUMszoctdY8pqhv7lN-OrMXRv5iGAFRSV1bmjKrd0HNoNnvA$
> >  
> > > >
> > > > Yours,
> > > > Roland
> > > >  
> > > > > Hello,
> > > > > In order to avoid stressing the filesystem on the cluster I'm  
> > running  
> > > > on, I  
> > > > > was suggested to avoid writing one output/checkpoint file per MPI  
> > process  
> > > > > and instead collecting data from multiple processes before
> > > > > outputting/checkpointing happens. I found the combination of  
> > parameters  
> > > > >
> > > > > IO::out_mode       = "np"
> > > > > IO::out_proc_every = 8
> > > > >
> > > > > does the job for output files, but I still have one checkpoint file  
> > per  
> > > > > process. Is there a similar parameter, or combination of  
> > parameters,  
> > > > which  
> > > > > can be used for checkpoint files?
> > > > >
> > > > > Thank you very much,
> > > > > Lorenzo Ennoggi  
> > > >
> > > >
> > > >
> > > > --
> > > > My email is as private as my paper mail. I therefore support encrypting
> > > > and signing email messages. Get my PGP key from  
> > https://urldefense.com/v3/__http://keys.gnupg.net__;!!DZ3fjg!-IpspO3XOAP7Iq90ewHLXhCVTP8zRBTZV3k6XCsoyypUMszoctdY8pqhv7lN-OrMXRv5iGAFRSV1bmjKrd1xKACaIQ$
> > .  
> > > >  
> >
> >
> >
> > --
> > My email is as private as my paper mail. I therefore support encrypting
> > and signing email messages. Get my PGP key from https://urldefense.com/v3/__http://keys.gnupg.net__;!!DZ3fjg!6-OWh_9-disVkk9WVMu-RD4jPB4ybPxx1Uq6gfq_DZyyONvkE0qnWh098RAXAEpLovR6-0fbMsSbFor4u04IgN7ZBQ$  .
> >  

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://keys.gnupg.net.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20221014/7372aa7f/attachment-0001.bin