<div dir="auto"><div><div data-smartmail="gmail_signature">Thanks Erik.<br> </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Apr 9, 2021, 23:45 Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu" rel="noreferrer noreferrer" target="_blank">schnetter@cct.lsu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hee Il<br>
<br>
Yes, that has happened to me several times. Usually, the problem is<br>
either MPI or I/O.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Ever experienced under UCX?</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
It might be that there is a file system problem, and one process is<br>
trying to write to a file, but is blocked indefinitely. The other<br>
processes then also stop making progress since they wait on<br>
communication.<br>
<br>
It could also be that there is an MPI problem, either caused by a<br>
problem in the code, or by an error in the system, that makes MPI<br>
hang.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">I think I haven't seen the issue when I use 'sm' btl. At least vader was used for all the problematic runs.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
In both cases, restarting from a checkpoint might solve the problem.<br>
If the problem is reproducible, then it would make sense to dig deeper<br>
to find out what's wrong, and whether there is a work-around (e.g.<br>
changing the grid structure a bit to avoid triggering the bug).<br>
<br>
-erik<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Yes. restarting could solve the issue.</div><div dir="auto"><br></div><div dir="auto">Hee Il</div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
<br>
On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <<a href="mailto:heeilkim@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">heeilkim@gmail.com</a>> wrote:<br>
><br>
> Hi,<br>
><br>
> Though it might not be an issue of ET. Have you ever seen ET runs stopped making every output (even the stdout), even though the processes are running?<br>
><br>
> I have seen this issue on new and old NVMe storages with various versions of OpenMPI. It happened in more than a day of runs.<br>
><br>
> Oh, not all the processes are running. One process is in Dl state, so the every output stopped. Do you have any hints on this issue? There's no specific limits set for the files. The other write/read tasks on the disks are ok.<br>
><br>
> Thanks for your help in advance.<br>
><br>
> Hee Il<br>
><br>
><br>
><br>
><br>
><br>
><br>
> _______________________________________________<br>
> Users mailing list<br>
> <a href="mailto:Users@einsteintoolkit.org" rel="noreferrer noreferrer noreferrer" target="_blank">Users@einsteintoolkit.org</a><br>
> <a href="http://lists.einsteintoolkit.org/mailman/listinfo/users" rel="noreferrer noreferrer noreferrer noreferrer" target="_blank">http://lists.einsteintoolkit.org/mailman/listinfo/users</a><br>
<br>
<br>
<br>
-- <br>
Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu" rel="noreferrer noreferrer noreferrer" target="_blank">schnetter@cct.lsu.edu</a>><br>
<a href="http://www.perimeterinstitute.ca/personal/eschnetter/" rel="noreferrer noreferrer noreferrer noreferrer" target="_blank">http://www.perimeterinstitute.ca/personal/eschnetter/</a><br>
</blockquote></div></div></div>