[ET Trac] [Einstein Toolkit] #534: Checkpointing fails on LoneStar
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Fri Aug 26 07:15:41 CDT 2011
#534: Checkpointing fails on LoneStar
--------------------------+-------------------------------------------------
Reporter: hinder | Owner: eschnett
Type: defect | Status: new
Priority: major | Milestone:
Component: Carpet | Version:
Keywords: CarpetIOHDF5 |
--------------------------+-------------------------------------------------
Hi,
I am starting to run on LoneStar, but find that I cannot checkpoint. This
is for a production simulation. I see:
INFO (CarpetIOHDF5):
---------------------------------------------------------
INFO (CarpetIOHDF5): Dumping periodic checkpoint at iteration 9876,
simulation time 18.5175
INFO (CarpetIOHDF5):
---------------------------------------------------------
on stdout, and there is nothing on stderr. The checkpoint files are
partially written:
-rw------- 1 hinder G-25181 61M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_0.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_10.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_11.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_12.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_13.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_14.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_15.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_16.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_17.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_18.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_19.h5
-rw------- 1 hinder G-25181 60M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_1.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_20.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_21.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_22.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_23.h5
-rw------- 1 hinder G-25181 58M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_24.h5
-rw------- 1 hinder G-25181 58M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_25.h5
-rw------- 1 hinder G-25181 58M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_26.h5
-rw------- 1 hinder G-25181 57M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_27.h5
-rw------- 1 hinder G-25181 58M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_28.h5
-rw------- 1 hinder G-25181 58M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_29.h5
-rw------- 1 hinder G-25181 60M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_2.h5
-rw------- 1 hinder G-25181 58M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_30.h5
-rw------- 1 hinder G-25181 57M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_31.h5
-rw------- 1 hinder G-25181 59M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_3.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_4.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_5.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_6.h5
-rw------- 1 hinder G-25181 47M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_7.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_8.h5
-rw------- 1 hinder G-25181 46M Aug 25 15:58
checkpoint.chkpt.tmp.it_9876.file_9.h5
and invalid:
c334-106$ h5ls checkpoint.chkpt.tmp.it_9876.file_0.h5
checkpoint.chkpt.tmp.it_9876.file_0.h5: unable to open file
The job and the processes are all still running. Logging into one of the
nodes and attaching gdb to the Cactus process yields:
0x00002b8eb3f6a287 in MPIDI_CH3I_MRAILI_Get_next_vbuf
(vc_ptr=0x7fff50b96cb0, vbuf_ptr=0x7fff50b96cb8) at
ibv_channel_manager.c:367
367 if (*head && vc->mrail.rfp.p_RDMA_recv !=
vc->mrail.rfp.p_RDMA_recv_tail)
(gdb) bt
#0 0x00002b8eb3f6a287 in MPIDI_CH3I_MRAILI_Get_next_vbuf
(vc_ptr=0x7fff50b96cb0, vbuf_ptr=0x7fff50b96cb8) at
ibv_channel_manager.c:367
#1 0x00002b8eb3f0426c in MPIDI_CH3I_read_progress
(vc_pptr=0x7fff50b96cb0, v_ptr=0x7fff50b96cb8, is_blocking=259164064) at
ch3_read_progress.c:130
#2 0x00002b8eb3f023dd in MPIDI_CH3I_Progress (is_blocking=1354329264,
state=0x7fff50b96cb8) at ch3_progress.c:206
#3 0x00002b8eb3f6852c in MPIC_Wait (request_ptr=0x7fff50b96cb0) at
helper_fns.c:518
#4 0x00002b8eb3f67e9c in MPIC_Recv (buf=0x7fff50b96cb0,
count=1354329272, datatype=259164064, source=30, tag=0, comm=-788556288,
status=0x1)
at helper_fns.c:76
#5 0x00002b8eb3eee47e in MPIR_Bcast_OSU (buffer=0x7fff50b96cb0,
count=1354329272, datatype=259164064, root=30, comm_ptr=0x0) at
bcast_osu.c:283
#6 0x00002b8eb3eed0c6 in PMPI_Bcast (buffer=0x7fff50b96cb0,
count=1354329272, datatype=259164064, root=30, comm=0) at bcast.c:1274
#7 0x0000000000c12fc4 in CarpetIOHDF5::WriteVarChunkedParallel
(cctkGH=0x7fff50b96cb0, outfile=1354329272, io_bytes=@0xf7287a0,
request=0x1e,
called_from_checkpoint=false, indexfile=-788556288, $q8=<value
optimized out>, $q9=<value optimized out>, $r0=<value optimized out>,
$r1=<value optimized out>, $r2=<value optimized out>, $r3=<value
optimized out>)
at
/work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/Output.cc:519
#8 0x0000000000bf3f7f in CarpetIOHDF5::Checkpoint
(cctkGH=0x7fff50b96cb0, called_from=1354329272, $W3=<value optimized out>,
$W4=<value optimized out>)
at
/work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/CarpetIOHDF5.cc:973
#9 0x0000000000bf360b in CarpetIOHDF5::CarpetIOHDF5_EvolutionCheckpoint
(cctkGH=0x7fff50b96cb0)
at
/work/00915/hinder/Cactus/llama/arrangements/Carpet/CarpetIOHDF5/src/CarpetIOHDF5.cc:186
#10 0x0000000000413a5f in CCTK_CallFunction (function=0x7fff50b96cb0,
fdata=0x7fff50b96cb8, data=0xf7287a0)
at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:291
#11 0x00000000011c2eb1 in Carpet::CallFunction (function=0x7fff50b96cb0,
attribute=0x7fff50b96cb8, data=0xf7287a0, $01=<value optimized out>,
$04=<value optimized out>, $05=<value optimized out>) at
/work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/CallFunction.cc:135
#12 0x0000000000418dea in CCTKi_ScheduleCallFunction
(function=0x7fff50b96cb0, attribute=0x7fff50b96cb8, data=0xf7287a0)
at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:2826
#13 0x000000000041bc26 in CCTKi_DoScheduleTraverse
(group_name=0x7fff50b96cb0 "", item_entry=0x7fff50b96cb8,
item_exit=0xf7287a0, while_check=0x1e,
if_check=0, function_process=0x2aabd0ff9600, data=0x7fff50b97848) at
/work/00915/hinder/Cactus/llama/src/schedule/ScheduleTraverse.c:158
#14 0x0000000000414f19 in CCTK_ScheduleTraverse (where=0x7fff50b96cb0
"", GH=0x7fff50b96cb8, CallFunction=0xf7287a0)
at /work/00915/hinder/Cactus/llama/src/main/ScheduleInterface.c:812
#15 0x000000000116f6c0 in Carpet::CallAnalysis (cctkGH=0x7fff50b96cb0,
$=2=<value optimized out>)
at
/work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/Evolve.cc:556
#16 0x000000000116e755 in Carpet::Evolve (fc=0x7fff50b96cb0, $<1=<value
optimized out>)
at
/work/00915/hinder/Cactus/llama/arrangements/Carpet/Carpet/src/Evolve.cc:81
#17 0x000000000040ccc5 in main (argc=4, argv=0x7fff50b98888) at
/work/00915/hinder/Cactus/llama/src/main/flesh.cc:84
I'm not sure why CarpetIOHDF5 is performing MPI calls. This is with the
stable version of Carpet.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/534>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list