[Users] Stopped making output
Hee Il Kim
heeilkim at gmail.com
Fri Apr 9 11:43:19 CDT 2021
On Fri, Apr 9, 2021, 23:45 Erik Schnetter <schnetter at cct.lsu.edu> wrote:
> Hee Il
> Yes, that has happened to me several times. Usually, the problem is
> either MPI or I/O.
Ever experienced under UCX?
> It might be that there is a file system problem, and one process is
> trying to write to a file, but is blocked indefinitely. The other
> processes then also stop making progress since they wait on
> It could also be that there is an MPI problem, either caused by a
> problem in the code, or by an error in the system, that makes MPI
I think I haven't seen the issue when I use 'sm' btl. At least vader was
used for all the problematic runs.
> In both cases, restarting from a checkpoint might solve the problem.
> If the problem is reproducible, then it would make sense to dig deeper
> to find out what's wrong, and whether there is a work-around (e.g.
> changing the grid structure a bit to avoid triggering the bug).
Yes. restarting could solve the issue.
> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <heeilkim at gmail.com> wrote:
> > Hi,
> > Though it might not be an issue of ET. Have you ever seen ET runs
> stopped making every output (even the stdout), even though the processes
> are running?
> > I have seen this issue on new and old NVMe storages with various
> versions of OpenMPI. It happened in more than a day of runs.
> > Oh, not all the processes are running. One process is in Dl state, so
> the every output stopped. Do you have any hints on this issue? There's no
> specific limits set for the files. The other write/read tasks on the disks
> are ok.
> > Thanks for your help in advance.
> > Hee Il
> > _______________________________________________
> > Users mailing list
> > Users at einsteintoolkit.org
> > http://lists.einsteintoolkit.org/mailman/listinfo/users
> Erik Schnetter <schnetter at cct.lsu.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Users