[Users] Einstein Toolkit Meeting Reminder

Ian Hinder ian.hinder at aei.mpg.de
Mon Jul 28 05:30:06 CDT 2014


On 21 Jul 2014, at 21:52, Frank Loeffler <knarf at cct.lsu.edu> wrote:

> - Josef: has problems on stampede, trouble with MPI and checkpointing
>  suggestion: attaching debugger to see where it hangs. Something might
>  have changed in recent ET releases that makes issues worse now, but it
>  is not sure whom to blame. We might just now trigger a problem with
>  the stampede system.

If it is a hang, then I would look at the IOScalar reduction output.  I had a hang, and attached GDB, and got the following backtrace on 2-May-2014:

> MPIDI_CH3I_MRAILI_Cq_poll (vbuf_handle=0x79c8c50, vc_req=0x1, receiving=-2076533752, is_blocking=1)
>     at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:920
> 920	    if (type == T_CHANNEL_NO_ARRIVE) {
> Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 gsl-intel13-1.15-1.x86_64 libgcc-4.4.7-4.el6.x86_64 libibmad-1.3.8-1.x86_64 libibumad-1.3.7-1.x86_64 libibverbs-1.1.4-1.24.gb89d4d7.x86_64 libmlx4-1.0.1-1.20.g6771d22.x86_64 libmthca-1.0.5-0.1.gbe5eef3.x86_64 librdmacm-1.0.15-1.x86_64 libstdc++-4.4.7-4.el6.x86_64 nss-softokn-freebl-3.14.3-9.el6.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0  MPIDI_CH3I_MRAILI_Cq_poll (vbuf_handle=0x79c8c50, vc_req=0x1, receiving=-2076533752, is_blocking=1)
>     at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:920
> #1  0x00002b9c83f01539 in MPIDI_CH3I_read_progress (vc_pptr=0x79c8c50, v_ptr=0x1, rdmafp_found=0x2b9c843a9c08, is_blocking=1)
>     at src/mpid/ch3/channels/mrail/src/rdma/ch3_read_progress.c:148
> #2  0x00002b9c83eff011 in cm_handle_pending_send (is_blocking=127700048, state=0x1)
>     at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:249
> #3  MPIDI_CH3I_Progress (is_blocking=127700048, state=0x1) at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:244
> #4  0x00002b9c84035e9a in PMPI_Waitany (count=127700048, array_of_requests=0x1, index=0x2b9c843a9c08, status=0x1)
>     at src/mpi/pt2pt/waitany.c:198
> #5  0x00002b9c83fef027 in MPIR_Reduce_knomial_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=1, datatype=1275069445, op=127410992, 
>     root=-800453280, comm_ptr=0x7b87320, errflag=0x7fffd04a1178) at src/mpi/coll/reduce_osu.c:985
> #6  0x00002b9c83fef8ef in MPIR_Reduce_two_level_helper_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, 
>     op=127410992, root=-800453280, comm_ptr=0x7b86d68, errflag=0x2b9c83ff041b) at src/mpi/coll/reduce_osu.c:1267
> #7  0x00002b9c83ff041b in MPIR_Reduce_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992, 
>     root=-800453280, comm_ptr=0x0, errflag=0x2b9c83f8a206) at src/mpi/coll/reduce_osu.c:1484
> #8  0x00002b9c83f8a206 in MPIR_Reduce_impl (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992, 
>     root=-800453280, comm_ptr=0x7fffd04a4ea0, errflag=0x2b9c83f8c01e) at src/mpi/coll/reduce.c:1029
> #9  0x00002b9c83f8c01e in PMPI_Reduce (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992, root=-800453280, 
>     comm=-800436576) at src/mpi/coll/reduce.c:1216
> #10 0x0000000000b76bf7 in CarpetReduce::Finalise (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outvals=0x1, outtype=127410992, 
>     myoutvals=0x7fffd04a0d60, mycounts=0x1, red=0xb79231)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:954
> #11 0x0000000000b79231 in CarpetReduce::ReduceGVs (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outtype=1, outvals=0x7982330, 
>     num_invars=-800453280, invars=0x9ae12b0, red=0x7fffd04a16b0, igrid=0)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:1584
> #12 0x0000000000b7ad8d in CarpetReduce::maximum_GVs (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outtype=1, outvals=0x7982330, 
>     num_invars=-800453280, invars=0x200000000)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:1622
> #13 0x00000000006eded0 in CCTK_Reduce (GH=0x79c8c50, proc=1, operation_handle=-2076533752, num_out_vals=1, type_out_vals=127410992, 
>     out_vals=0x7fffd04a0d60, num_in_fields=102) at /work/00915/hinder/Cactus/src/comm/Reduction.c:429
> #14 0x00000000006d2f5a in CarpetIOScalar::OutputVarAs (cctkGH=0x79c8c50, varname=0x1 <Address 0x1 out of bounds>, 
>     alias=0x2b9c843a9c08 "\001", out_reductions=0x1 <Address 0x1 out of bounds>)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:461
> #15 0x00000000006d1f0c in CarpetIOScalar::TriggerOutput (cctkGH=0x79c8c50, vindex=1)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:731
> #16 0x00000000006d1cc3 in CarpetIOScalar::OutputGH (cctkGH=0x79c8c50)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:170
> #17 0x0000000000bd2bee in Carpet::OutputGH (cctkGH=0x79c8c50)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/OutputGH.cc:56
> #18 0x0000000000bc81c2 in Carpet::CallAnalysis (cctkGH=0x79c8c50)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/Evolve.cc:767
> #19 0x0000000000bc70e3 in Carpet::Evolve (fc=0x79c8c50)
>     at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/Evolve.cc:80
> #20 0x0000000000575bc1 in main (argc=4, argv=0x7fffd04a5a28) at /work/00915/hinder/Cactus/src/main/flesh.cc:84
> 


I would have been using Carpet 2ede3e9156b91bb5061802cdf9e50dca2bb2114d (01-Apr-2014) plus some local commits, so somewhere on the trunk between ET_2013_11 (Noether) and ET_2014_15 (Wheeler).

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20140728/3f3ebac3/attachment.bin 


More information about the Users mailing list