[Users] Einstein Toolkit Meeting Reminder

Yosef Zlochower yosef at astro.rit.edu
Mon Jul 28 08:41:13 CDT 2014


On 07/28/2014 06:30 AM, Ian Hinder wrote:
>
> On 21 Jul 2014, at 21:52, Frank Loeffler <knarf at cct.lsu.edu> wrote:
>
>> - Josef: has problems on stampede, trouble with MPI and checkpointing
>>   suggestion: attaching debugger to see where it hangs. Something might
>>   have changed in recent ET releases that makes issues worse now, but it
>>   is not sure whom to blame. We might just now trigger a problem with
>>   the stampede system.
>
> If it is a hang, then I would look at the IOScalar reduction output.  I had a hang, and attached GDB, and got the following backtrace on 2-May-2014:

I ran into the issue that short runs never stalled. I tried a run that 
checkpointed every few iterations and it worked fine. Long runs
(> 4 hours) stall,
but I ran into the issue of catching the stall before the system killed
the run (not so easy when the queue time was up to 12 hours). I found
that increasing the number of nodes from 16 to 20 seem to "fix" the 
problem. The issue there is that 20 nodes is not ideal due to
scaling (the runs use about 12GB per node).


>
>> MPIDI_CH3I_MRAILI_Cq_poll (vbuf_handle=0x79c8c50, vc_req=0x1, receiving=-2076533752, is_blocking=1)
>>      at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:920
>> 920	    if (type == T_CHANNEL_NO_ARRIVE) {
>> Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 gsl-intel13-1.15-1.x86_64 libgcc-4.4.7-4.el6.x86_64 libibmad-1.3.8-1.x86_64 libibumad-1.3.7-1.x86_64 libibverbs-1.1.4-1.24.gb89d4d7.x86_64 libmlx4-1.0.1-1.20.g6771d22.x86_64 libmthca-1.0.5-0.1.gbe5eef3.x86_64 librdmacm-1.0.15-1.x86_64 libstdc++-4.4.7-4.el6.x86_64 nss-softokn-freebl-3.14.3-9.el6.x86_64 zlib-1.2.3-29.el6.x86_64
>> (gdb) bt
>> #0  MPIDI_CH3I_MRAILI_Cq_poll (vbuf_handle=0x79c8c50, vc_req=0x1, receiving=-2076533752, is_blocking=1)
>>      at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:920
>> #1  0x00002b9c83f01539 in MPIDI_CH3I_read_progress (vc_pptr=0x79c8c50, v_ptr=0x1, rdmafp_found=0x2b9c843a9c08, is_blocking=1)
>>      at src/mpid/ch3/channels/mrail/src/rdma/ch3_read_progress.c:148
>> #2  0x00002b9c83eff011 in cm_handle_pending_send (is_blocking=127700048, state=0x1)
>>      at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:249
>> #3  MPIDI_CH3I_Progress (is_blocking=127700048, state=0x1) at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:244
>> #4  0x00002b9c84035e9a in PMPI_Waitany (count=127700048, array_of_requests=0x1, index=0x2b9c843a9c08, status=0x1)
>>      at src/mpi/pt2pt/waitany.c:198
>> #5  0x00002b9c83fef027 in MPIR_Reduce_knomial_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=1, datatype=1275069445, op=127410992,
>>      root=-800453280, comm_ptr=0x7b87320, errflag=0x7fffd04a1178) at src/mpi/coll/reduce_osu.c:985
>> #6  0x00002b9c83fef8ef in MPIR_Reduce_two_level_helper_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1,
>>      op=127410992, root=-800453280, comm_ptr=0x7b86d68, errflag=0x2b9c83ff041b) at src/mpi/coll/reduce_osu.c:1267
>> #7  0x00002b9c83ff041b in MPIR_Reduce_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992,
>>      root=-800453280, comm_ptr=0x0, errflag=0x2b9c83f8a206) at src/mpi/coll/reduce_osu.c:1484
>> #8  0x00002b9c83f8a206 in MPIR_Reduce_impl (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992,
>>      root=-800453280, comm_ptr=0x7fffd04a4ea0, errflag=0x2b9c83f8c01e) at src/mpi/coll/reduce.c:1029
>> #9  0x00002b9c83f8c01e in PMPI_Reduce (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992, root=-800453280,
>>      comm=-800436576) at src/mpi/coll/reduce.c:1216
>> #10 0x0000000000b76bf7 in CarpetReduce::Finalise (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outvals=0x1, outtype=127410992,
>>      myoutvals=0x7fffd04a0d60, mycounts=0x1, red=0xb79231)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:954
>> #11 0x0000000000b79231 in CarpetReduce::ReduceGVs (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outtype=1, outvals=0x7982330,
>>      num_invars=-800453280, invars=0x9ae12b0, red=0x7fffd04a16b0, igrid=0)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:1584
>> #12 0x0000000000b7ad8d in CarpetReduce::maximum_GVs (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outtype=1, outvals=0x7982330,
>>      num_invars=-800453280, invars=0x200000000)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:1622
>> #13 0x00000000006eded0 in CCTK_Reduce (GH=0x79c8c50, proc=1, operation_handle=-2076533752, num_out_vals=1, type_out_vals=127410992,
>>      out_vals=0x7fffd04a0d60, num_in_fields=102) at /work/00915/hinder/Cactus/src/comm/Reduction.c:429
>> #14 0x00000000006d2f5a in CarpetIOScalar::OutputVarAs (cctkGH=0x79c8c50, varname=0x1 <Address 0x1 out of bounds>,
>>      alias=0x2b9c843a9c08 "\001", out_reductions=0x1 <Address 0x1 out of bounds>)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:461
>> #15 0x00000000006d1f0c in CarpetIOScalar::TriggerOutput (cctkGH=0x79c8c50, vindex=1)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:731
>> #16 0x00000000006d1cc3 in CarpetIOScalar::OutputGH (cctkGH=0x79c8c50)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:170
>> #17 0x0000000000bd2bee in Carpet::OutputGH (cctkGH=0x79c8c50)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/OutputGH.cc:56
>> #18 0x0000000000bc81c2 in Carpet::CallAnalysis (cctkGH=0x79c8c50)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/Evolve.cc:767
>> #19 0x0000000000bc70e3 in Carpet::Evolve (fc=0x79c8c50)
>>      at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/Evolve.cc:80
>> #20 0x0000000000575bc1 in main (argc=4, argv=0x7fffd04a5a28) at /work/00915/hinder/Cactus/src/main/flesh.cc:84
>>
>
>
> I would have been using Carpet 2ede3e9156b91bb5061802cdf9e50dca2bb2114d (01-Apr-2014) plus some local commits, so somewhere on the trunk between ET_2013_11 (Noether) and ET_2014_15 (Wheeler).
>
>
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>


-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yosef at astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.


More information about the Users mailing list