[ET Trac] [Einstein Toolkit] #1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate jobs that hang

Einstein Toolkit trac-noreply at einsteintoolkit.org
Sun Mar 8 17:17:35 CDT 2015


#1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate
jobs that hang
------------------------------------+---------------------------------------
  Reporter:  dradice@…              |       Owner:  dradice@…          
      Type:  enhancement            |      Status:  reviewed_ok        
  Priority:  optional               |   Milestone:                     
 Component:  EinsteinToolkit thorn  |     Version:  development version
Resolution:                         |    Keywords:                     
------------------------------------+---------------------------------------

Comment (by dradice@…):

 Hello, thank you for the additional patches. However there are some things
 I would like to change in those:

 * The "everything is fine" message is very useful to see when (if) the
 watchdog thread is actually executed by the OS. Moving it to the main
 thread defeats its purpose in my opinion.
 * In the heartbeat output we might want to output much more frequently
 than every half of the check time: large jobs might "hang" for long time
 while checkpointing. I would suggest writing in the heartbeat file more
 often, say every few iterations (anyway it is really not so expensive to
 write an int in a text file).
 * I woud also move the heartbeat file to the running directory, to avoid
 having the runscript know the name of the output directory

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1751#comment:6>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list