[Users] Stopped making output

Roland Haas rhaas at illinois.edu
Fri Apr 9 12:09:46 CDT 2021


Hello all,

we have had issues with a bug in openmpi wrt to vader on "slow" systems.

See:

https://bitbucket.org/einsteintoolkit/tickets/issues/2287/add-openmpi-env-vars-to-notebook-to-avoid

for the ET ticket explaining this (the slow system being the tutorial
VM) and the OpenMPI ticket here:

https://github.com/open-mpi/ompi/issues/6568

Yours,
Roland

> On Fri, Apr 9, 2021 at 12:43 PM Hee Il Kim <heeilkim at gmail.com> wrote:
> >
> > Thanks Erik.
> >
> >
> > On Fri, Apr 9, 2021, 23:45 Erik Schnetter <schnetter at cct.lsu.edu> wrote:  
> >>
> >> Hee Il
> >>
> >> Yes, that has happened to me several times. Usually, the problem is
> >> either MPI or I/O.  
> >
> >
> > Ever experienced under UCX?  
> 
> No, but I think UCX and MPI are about equivalent in this context here.
> 
> -erik
> 
> >> It might be that there is a file system problem, and one process is
> >> trying to write to a file, but is blocked indefinitely. The other
> >> processes then also stop making progress since they wait on
> >> communication.
> >>
> >> It could also be that there is an MPI problem, either caused by a
> >> problem in the code, or by an error in the system, that makes MPI
> >> hang.  
> >
> >
> > I think I haven't seen the issue when I use 'sm' btl. At least vader was used for all the problematic runs.
> >  
> >>
> >> In both cases, restarting from a checkpoint might solve the problem.
> >> If the problem is reproducible, then it would make sense to dig deeper
> >> to find out what's wrong, and whether there is a work-around (e.g.
> >> changing the grid structure a bit to avoid triggering the bug).
> >>
> >> -erik  
> >
> >
> > Yes. restarting could solve the issue.
> >
> > Hee Il
> >
> >  
> >>
> >>
> >>
> >> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <heeilkim at gmail.com> wrote:  
> >> >
> >> > Hi,
> >> >
> >> > Though it might not be an issue of ET. Have you ever seen ET runs stopped making every output (even the stdout), even though the processes are running?
> >> >
> >> > I have seen this issue on new and old NVMe storages with various versions of OpenMPI. It happened in more than a day of runs.
> >> >
> >> > Oh, not all the processes are running. One process is in Dl state, so the every output stopped. Do you have any hints on this issue? There's no specific limits set for the files. The other write/read tasks on the disks are ok.
> >> >
> >> > Thanks for your help in advance.
> >> >
> >> > Hee Il
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Users mailing list
> >> > Users at einsteintoolkit.org
> >> > https://urldefense.com/v3/__http://lists.einsteintoolkit.org/mailman/listinfo/users__;!!DZ3fjg!tgd_rDKIJitACUnLixHB2PND01Yf7MisM-hbW7PEzYIVhk3Rao1sdCz_tOPj3NlB$   
> >>
> >>
> >>
> >> --
> >> Erik Schnetter <schnetter at cct.lsu.edu>
> >> https://urldefense.com/v3/__http://www.perimeterinstitute.ca/personal/eschnetter/__;!!DZ3fjg!tgd_rDKIJitACUnLixHB2PND01Yf7MisM-hbW7PEzYIVhk3Rao1sdCz_tIt3aW2E$   
> 
> 
> 


-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20210409/d3c1495b/attachment-0001.bin 


More information about the Users mailing list