[Users] Einstein Toolkit Meeting Reminder
Ian Hinder
ian.hinder at aei.mpg.de
Mon Jul 28 05:30:06 CDT 2014
On 21 Jul 2014, at 21:52, Frank Loeffler <knarf at cct.lsu.edu> wrote:
> - Josef: has problems on stampede, trouble with MPI and checkpointing
> suggestion: attaching debugger to see where it hangs. Something might
> have changed in recent ET releases that makes issues worse now, but it
> is not sure whom to blame. We might just now trigger a problem with
> the stampede system.
If it is a hang, then I would look at the IOScalar reduction output. I had a hang, and attached GDB, and got the following backtrace on 2-May-2014:
> MPIDI_CH3I_MRAILI_Cq_poll (vbuf_handle=0x79c8c50, vc_req=0x1, receiving=-2076533752, is_blocking=1)
> at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:920
> 920 if (type == T_CHANNEL_NO_ARRIVE) {
> Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 gsl-intel13-1.15-1.x86_64 libgcc-4.4.7-4.el6.x86_64 libibmad-1.3.8-1.x86_64 libibumad-1.3.7-1.x86_64 libibverbs-1.1.4-1.24.gb89d4d7.x86_64 libmlx4-1.0.1-1.20.g6771d22.x86_64 libmthca-1.0.5-0.1.gbe5eef3.x86_64 librdmacm-1.0.15-1.x86_64 libstdc++-4.4.7-4.el6.x86_64 nss-softokn-freebl-3.14.3-9.el6.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0 MPIDI_CH3I_MRAILI_Cq_poll (vbuf_handle=0x79c8c50, vc_req=0x1, receiving=-2076533752, is_blocking=1)
> at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:920
> #1 0x00002b9c83f01539 in MPIDI_CH3I_read_progress (vc_pptr=0x79c8c50, v_ptr=0x1, rdmafp_found=0x2b9c843a9c08, is_blocking=1)
> at src/mpid/ch3/channels/mrail/src/rdma/ch3_read_progress.c:148
> #2 0x00002b9c83eff011 in cm_handle_pending_send (is_blocking=127700048, state=0x1)
> at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:249
> #3 MPIDI_CH3I_Progress (is_blocking=127700048, state=0x1) at src/mpid/ch3/channels/mrail/src/rdma/ch3_progress.c:244
> #4 0x00002b9c84035e9a in PMPI_Waitany (count=127700048, array_of_requests=0x1, index=0x2b9c843a9c08, status=0x1)
> at src/mpi/pt2pt/waitany.c:198
> #5 0x00002b9c83fef027 in MPIR_Reduce_knomial_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=1, datatype=1275069445, op=127410992,
> root=-800453280, comm_ptr=0x7b87320, errflag=0x7fffd04a1178) at src/mpi/coll/reduce_osu.c:985
> #6 0x00002b9c83fef8ef in MPIR_Reduce_two_level_helper_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1,
> op=127410992, root=-800453280, comm_ptr=0x7b86d68, errflag=0x2b9c83ff041b) at src/mpi/coll/reduce_osu.c:1267
> #7 0x00002b9c83ff041b in MPIR_Reduce_MV2 (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992,
> root=-800453280, comm_ptr=0x0, errflag=0x2b9c83f8a206) at src/mpi/coll/reduce_osu.c:1484
> #8 0x00002b9c83f8a206 in MPIR_Reduce_impl (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992,
> root=-800453280, comm_ptr=0x7fffd04a4ea0, errflag=0x2b9c83f8c01e) at src/mpi/coll/reduce.c:1029
> #9 0x00002b9c83f8c01e in PMPI_Reduce (sendbuf=0x79c8c50, recvbuf=0x1, count=-2076533752, datatype=1, op=127410992, root=-800453280,
> comm=-800436576) at src/mpi/coll/reduce.c:1216
> #10 0x0000000000b76bf7 in CarpetReduce::Finalise (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outvals=0x1, outtype=127410992,
> myoutvals=0x7fffd04a0d60, mycounts=0x1, red=0xb79231)
> at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:954
> #11 0x0000000000b79231 in CarpetReduce::ReduceGVs (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outtype=1, outvals=0x7982330,
> num_invars=-800453280, invars=0x9ae12b0, red=0x7fffd04a16b0, igrid=0)
> at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:1584
> #12 0x0000000000b7ad8d in CarpetReduce::maximum_GVs (cgh=0x79c8c50, proc=1, num_outvals=-2076533752, outtype=1, outvals=0x7982330,
> num_invars=-800453280, invars=0x200000000)
> at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetReduce/src/reduce.cc:1622
> #13 0x00000000006eded0 in CCTK_Reduce (GH=0x79c8c50, proc=1, operation_handle=-2076533752, num_out_vals=1, type_out_vals=127410992,
> out_vals=0x7fffd04a0d60, num_in_fields=102) at /work/00915/hinder/Cactus/src/comm/Reduction.c:429
> #14 0x00000000006d2f5a in CarpetIOScalar::OutputVarAs (cctkGH=0x79c8c50, varname=0x1 <Address 0x1 out of bounds>,
> alias=0x2b9c843a9c08 "\001", out_reductions=0x1 <Address 0x1 out of bounds>)
> at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:461
> #15 0x00000000006d1f0c in CarpetIOScalar::TriggerOutput (cctkGH=0x79c8c50, vindex=1)
> at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:731
> #16 0x00000000006d1cc3 in CarpetIOScalar::OutputGH (cctkGH=0x79c8c50)
> at /work/00915/hinder/Cactus/arrangements/Carpet/CarpetIOScalar/src/ioscalar.cc:170
> #17 0x0000000000bd2bee in Carpet::OutputGH (cctkGH=0x79c8c50)
> at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/OutputGH.cc:56
> #18 0x0000000000bc81c2 in Carpet::CallAnalysis (cctkGH=0x79c8c50)
> at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/Evolve.cc:767
> #19 0x0000000000bc70e3 in Carpet::Evolve (fc=0x79c8c50)
> at /work/00915/hinder/Cactus/arrangements/Carpet/Carpet/src/Evolve.cc:80
> #20 0x0000000000575bc1 in main (argc=4, argv=0x7fffd04a5a28) at /work/00915/hinder/Cactus/src/main/flesh.cc:84
>
I would have been using Carpet 2ede3e9156b91bb5061802cdf9e50dca2bb2114d (01-Apr-2014) plus some local commits, so somewhere on the trunk between ET_2013_11 (Noether) and ET_2014_15 (Wheeler).
--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20140728/3f3ebac3/attachment.bin
More information about the Users
mailing list