[Users] Stopped making output

Erik Schnetter schnetter at cct.lsu.edu
Fri Apr 9 09:45:44 CDT 2021


Hee Il

Yes, that has happened to me several times. Usually, the problem is
either MPI or I/O.

It might be that there is a file system problem, and one process is
trying to write to a file, but is blocked indefinitely. The other
processes then also stop making progress since they wait on
communication.

It could also be that there is an MPI problem, either caused by a
problem in the code, or by an error in the system, that makes MPI
hang.

In both cases, restarting from a checkpoint might solve the problem.
If the problem is reproducible, then it would make sense to dig deeper
to find out what's wrong, and whether there is a work-around (e.g.
changing the grid structure a bit to avoid triggering the bug).

-erik



On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <heeilkim at gmail.com> wrote:
>
> Hi,
>
> Though it might not be an issue of ET. Have you ever seen ET runs stopped making every output (even the stdout), even though the processes are running?
>
> I have seen this issue on new and old NVMe storages with various versions of OpenMPI. It happened in more than a day of runs.
>
> Oh, not all the processes are running. One process is in Dl state, so the every output stopped. Do you have any hints on this issue? There's no specific limits set for the files. The other write/read tasks on the disks are ok.
>
> Thanks for your help in advance.
>
> Hee Il
>
>
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users



-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/


More information about the Users mailing list