[ET Trac] [Einstein Toolkit] #1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate jobs that hang

Einstein Toolkit trac-noreply at einsteintoolkit.org
Mon Mar 9 05:48:55 CDT 2015


#1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate
jobs that hang
------------------------------------+---------------------------------------
  Reporter:  dradice@…              |       Owner:  dradice@…          
      Type:  enhancement            |      Status:  reviewed_ok        
  Priority:  optional               |   Milestone:                     
 Component:  EinsteinToolkit thorn  |     Version:  development version
Resolution:                         |    Keywords:                     
------------------------------------+---------------------------------------

Comment (by rhaas):

 All are fine with me. All me patches were just suggestions and an
 illustration of what I had in mind.

 I did not put the heartbeat it into the running directory since normally
 all Cactus output should go into {{{out_dir}}} and ignored the fact that
 this actually makes it hard for a script to read the heartbeat file.

 Outputting to file more frequently should work fine is the number of
 iterations to wait is much smaller than the walltime between checkpoint
 (or if one checkpoints every so many iterations, is a divisor of the
 checkpoint checkpoint_every parameter). The code as written by me actually
 has a bug. It uses the last time the {{{timestamp}}} variable was updated
 for {{{then}}} but instead it should use the last time the output file was
 written to. I am a bit weary about saying that it is not not expensive to
 write an int to a text file. I agree that writing the int once the file is
 open is fast. I worry that opening the file is actually slow (may take
 several seconds) on large lustre file systems where the (single) metadata
 server is the bottleneck.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1751#comment:7>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list