[Users] Stopped making output

Hee Il Kim heeilkim at gmail.com
Fri Apr 9 11:43:19 CDT 2021


Thanks Erik.


On Fri, Apr 9, 2021, 23:45 Erik Schnetter <schnetter at cct.lsu.edu> wrote:

> Hee Il
>
> Yes, that has happened to me several times. Usually, the problem is
> either MPI or I/O.
>

Ever experienced under UCX?


> It might be that there is a file system problem, and one process is
> trying to write to a file, but is blocked indefinitely. The other
> processes then also stop making progress since they wait on
> communication.
>
> It could also be that there is an MPI problem, either caused by a
> problem in the code, or by an error in the system, that makes MPI
> hang.
>

I think I haven't seen the issue when I use 'sm' btl. At least vader was
used for all the problematic runs.


> In both cases, restarting from a checkpoint might solve the problem.
> If the problem is reproducible, then it would make sense to dig deeper
> to find out what's wrong, and whether there is a work-around (e.g.
> changing the grid structure a bit to avoid triggering the bug).
>
> -erik
>

Yes. restarting could solve the issue.

Hee Il



>
>
> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <heeilkim at gmail.com> wrote:
> >
> > Hi,
> >
> > Though it might not be an issue of ET. Have you ever seen ET runs
> stopped making every output (even the stdout), even though the processes
> are running?
> >
> > I have seen this issue on new and old NVMe storages with various
> versions of OpenMPI. It happened in more than a day of runs.
> >
> > Oh, not all the processes are running. One process is in Dl state, so
> the every output stopped. Do you have any hints on this issue? There's no
> specific limits set for the files. The other write/read tasks on the disks
> are ok.
> >
> > Thanks for your help in advance.
> >
> > Hee Il
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Users mailing list
> > Users at einsteintoolkit.org
> > http://lists.einsteintoolkit.org/mailman/listinfo/users
>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20210410/b62eedc7/attachment.html 


More information about the Users mailing list