[ET Trac] [Einstein Toolkit] #1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate jobs that hang
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Mon Mar 9 05:48:55 CDT 2015
#1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate
jobs that hang
------------------------------------+---------------------------------------
Reporter: dradice@… | Owner: dradice@…
Type: enhancement | Status: reviewed_ok
Priority: optional | Milestone:
Component: EinsteinToolkit thorn | Version: development version
Resolution: | Keywords:
------------------------------------+---------------------------------------
Comment (by rhaas):
All are fine with me. All me patches were just suggestions and an
illustration of what I had in mind.
I did not put the heartbeat it into the running directory since normally
all Cactus output should go into {{{out_dir}}} and ignored the fact that
this actually makes it hard for a script to read the heartbeat file.
Outputting to file more frequently should work fine is the number of
iterations to wait is much smaller than the walltime between checkpoint
(or if one checkpoints every so many iterations, is a divisor of the
checkpoint checkpoint_every parameter). The code as written by me actually
has a bug. It uses the last time the {{{timestamp}}} variable was updated
for {{{then}}} but instead it should use the last time the output file was
written to. I am a bit weary about saying that it is not not expensive to
write an int to a text file. I agree that writing the int once the file is
open is fast. I worry that opening the file is actually slow (may take
several seconds) on large lustre file systems where the (single) metadata
server is the bottleneck.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1751#comment:7>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list