[ET Trac] [Einstein Toolkit] #534: Checkpointing fails on LoneStar

Einstein Toolkit trac-noreply at einsteintoolkit.org
Fri Aug 26 07:15:41 CDT 2011


#534: Checkpointing fails on LoneStar
--------------------------+-------------------------------------------------
 Reporter:  hinder        |       Owner:  eschnett
     Type:  defect        |      Status:  new     
 Priority:  major         |   Milestone:          
Component:  Carpet        |     Version:          
 Keywords:  CarpetIOHDF5  |  
--------------------------+-------------------------------------------------
 Hi,

 I am starting to run on LoneStar, but find that I cannot checkpoint.  This
 is for a production simulation.  I see:

   INFO (CarpetIOHDF5):
 ---------------------------------------------------------
   INFO (CarpetIOHDF5): Dumping periodic checkpoint at iteration 9876,
 simulation time 18.5175
   INFO (CarpetIOHDF5):
 ---------------------------------------------------------

 on stdout, and there is nothing on stderr.  The checkpoint files are
 partially written:

   -rw------- 1 hinder G-25181 61M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_0.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_10.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_11.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_12.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_13.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_14.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_15.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_16.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_17.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_18.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_19.h5
   -rw------- 1 hinder G-25181 60M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_1.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_20.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_21.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_22.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_23.h5
   -rw------- 1 hinder G-25181 58M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_24.h5
   -rw------- 1 hinder G-25181 58M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_25.h5
   -rw------- 1 hinder G-25181 58M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_26.h5
   -rw------- 1 hinder G-25181 57M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_27.h5
   -rw------- 1 hinder G-25181 58M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_28.h5
   -rw------- 1 hinder G-25181 58M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_29.h5
   -rw------- 1 hinder G-25181 60M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_2.h5
   -rw------- 1 hinder G-25181 58M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_30.h5
   -rw------- 1 hinder G-25181 57M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_31.h5
   -rw------- 1 hinder G-25181 59M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_3.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_4.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_5.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_6.h5
   -rw------- 1 hinder G-25181 47M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_7.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_8.h5
   -rw------- 1 hinder G-25181 46M Aug 25 15:58
 checkpoint.chkpt.tmp.it_9876.file_9.h5

 and invalid:

   c334-106$ h5ls checkpoint.chkpt.tmp.it_9876.file_0.h5
   checkpoint.chkpt.tmp.it_9876.file_0.h5: unable to open file

 The job and the processes are all still running.  Logging into one of the
 nodes and attaching gdb to the Cactus process yields:

   0x00002b8eb3f6a287 in MPIDI_CH3I_MRAILI_Get_next_vbuf
 (vc_ptr=0x7fff50b96cb0, vbuf_ptr=0x7fff50b96cb8) at
 ibv_channel_manager.c:367
   367               if (*head && vc->mrail.rfp.p_RDMA_recv !=
 vc->mrail.rfp.p_RDMA_recv_tail)
   (gdb) bt
   #0  0x00002b8eb3f6a287 in MPIDI_CH3I_MRAILI_Get_next_vbuf
 (vc_ptr=0x7fff50b96cb0, vbuf_ptr=0x7fff50b96cb8) at
 ibv_channel_manager.c:367
   #1  0x00002b8eb3f0426c in MPIDI_CH3I_read_progress
 (vc_pptr=0x7fff50b96cb0, v_ptr=0x7fff50b96cb8, is_blocking=259164064) at
 ch3_read_progress.c:130
   #2  0x00002b8eb3f023dd in MPIDI_CH3I_Progress (is_blocking=1354329264,
 state=0x7fff50b96cb8) at ch3_progress.c:206
   #3  0x00002b8eb3f6852c in MPIC_Wait (request_ptr=0x7fff50b96cb0) at
 helper_fns.c:518
   #4  0x00002b8eb3f67e9c in MPIC_Recv (buf=0x7fff50b96cb0,
 count=1354329272, datatype=259164064, source=30, tag=0, comm=-788556288,
 status=0x1)
     at helper_fns.c:76
   #5  0x00002b8eb3eee47e in MPIR_Bcast_OSU (buffer=0x7fff50b96cb0,
 count=1354329272, datatype=259164064, root=30, comm_ptr=0x0) at
 bcast_osu.c:283
   #6  0x00002b8eb3eed0c6 in PMPI_Bcast (buffer=0x7fff50b96cb0,
 count=1354329272, datatype=259164064, root=30, comm=0) at bcast.c:1274
   #7  0x0000000000c12fc4 in CarpetIOHDF5::WriteVarChunkedParallel
 (cctkGH=0x7fff50b96cb0, outfile=1354329272, io_bytes=@0xf7287a0,
 request=0x1e,
     called_from_checkpoint=false, indexfile=-788556288, $q8=<value
 optimized out>, $q9=<value optimized out>, $r0=<value optimized out>,
     $r1=<value optimized out>, $r2=<value optimized out>, $r3=<value
 optimized out>)
     at
 /work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/Output.cc:519
   #8  0x0000000000bf3f7f in CarpetIOHDF5::Checkpoint
 (cctkGH=0x7fff50b96cb0, called_from=1354329272, $W3=<value optimized out>,
 $W4=<value optimized out>)
     at
 /work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/CarpetIOHDF5.cc:973
   #9  0x0000000000bf360b in CarpetIOHDF5::CarpetIOHDF5_EvolutionCheckpoint
 (cctkGH=0x7fff50b96cb0)
     at
 /work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/CarpetIOHDF5.cc:186
   #10 0x0000000000413a5f in CCTK_CallFunction (function=0x7fff50b96cb0,
 fdata=0x7fff50b96cb8, data=0xf7287a0)
     at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:291
   #11 0x00000000011c2eb1 in Carpet::CallFunction (function=0x7fff50b96cb0,
 attribute=0x7fff50b96cb8, data=0xf7287a0, $01=<value optimized out>,
     $04=<value optimized out>, $05=<value optimized out>) at
 /work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/CallFunction.cc:135
   #12 0x0000000000418dea in CCTKi_ScheduleCallFunction
 (function=0x7fff50b96cb0, attribute=0x7fff50b96cb8, data=0xf7287a0)
     at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:2826
   #13 0x000000000041bc26 in CCTKi_DoScheduleTraverse
 (group_name=0x7fff50b96cb0 "", item_entry=0x7fff50b96cb8,
 item_exit=0xf7287a0, while_check=0x1e,
     if_check=0, function_process=0x2aabd0ff9600, data=0x7fff50b97848) at
 /work/00915/hinder/Cactus/llama/src/schedule/ScheduleTraverse.c:158
   #14 0x0000000000414f19 in CCTK_ScheduleTraverse (where=0x7fff50b96cb0
 "", GH=0x7fff50b96cb8, CallFunction=0xf7287a0)
     at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:812
   #15 0x000000000116f6c0 in Carpet::CallAnalysis (cctkGH=0x7fff50b96cb0,
 $=2=<value optimized out>)
     at
 /work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/Evolve.cc:556
   #16 0x000000000116e755 in Carpet::Evolve (fc=0x7fff50b96cb0, $<1=<value
 optimized out>)
     at
 /work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/Evolve.cc:81
   #17 0x000000000040ccc5 in main (argc=4, argv=0x7fff50b98888) at
 /work/00915/hinder/Cactus/llama/src/main/flesh.cc:84

 I'm not sure why CarpetIOHDF5 is performing MPI calls.  This is with the
 stable version of Carpet.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/534>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list