[Users] Stopped making output
Erik Schnetter
schnetter at cct.lsu.edu
Fri Apr 9 11:55:32 CDT 2021
On Fri, Apr 9, 2021 at 12:43 PM Hee Il Kim <heeilkim at gmail.com> wrote:
>
> Thanks Erik.
>
>
> On Fri, Apr 9, 2021, 23:45 Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>
>> Hee Il
>>
>> Yes, that has happened to me several times. Usually, the problem is
>> either MPI or I/O.
>
>
> Ever experienced under UCX?
No, but I think UCX and MPI are about equivalent in this context here.
-erik
>> It might be that there is a file system problem, and one process is
>> trying to write to a file, but is blocked indefinitely. The other
>> processes then also stop making progress since they wait on
>> communication.
>>
>> It could also be that there is an MPI problem, either caused by a
>> problem in the code, or by an error in the system, that makes MPI
>> hang.
>
>
> I think I haven't seen the issue when I use 'sm' btl. At least vader was used for all the problematic runs.
>
>>
>> In both cases, restarting from a checkpoint might solve the problem.
>> If the problem is reproducible, then it would make sense to dig deeper
>> to find out what's wrong, and whether there is a work-around (e.g.
>> changing the grid structure a bit to avoid triggering the bug).
>>
>> -erik
>
>
> Yes. restarting could solve the issue.
>
> Hee Il
>
>
>>
>>
>>
>> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <heeilkim at gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Though it might not be an issue of ET. Have you ever seen ET runs stopped making every output (even the stdout), even though the processes are running?
>> >
>> > I have seen this issue on new and old NVMe storages with various versions of OpenMPI. It happened in more than a day of runs.
>> >
>> > Oh, not all the processes are running. One process is in Dl state, so the every output stopped. Do you have any hints on this issue? There's no specific limits set for the files. The other write/read tasks on the disks are ok.
>> >
>> > Thanks for your help in advance.
>> >
>> > Hee Il
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Users mailing list
>> > Users at einsteintoolkit.org
>> > http://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>>
>>
>> --
>> Erik Schnetter <schnetter at cct.lsu.edu>
>> http://www.perimeterinstitute.ca/personal/eschnetter/
--
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
More information about the Users
mailing list