[ET Trac] [Einstein Toolkit] #1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate jobs that hang

Einstein Toolkit trac-noreply at einsteintoolkit.org
Fri Mar 6 23:17:54 CST 2015


#1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate
jobs that hang
------------------------------------+---------------------------------------
  Reporter:  dradice@…              |       Owner:  dradice@…          
      Type:  enhancement            |      Status:  assigned           
  Priority:  optional               |   Milestone:                     
 Component:  EinsteinToolkit thorn  |     Version:  development version
Resolution:                         |    Keywords:                     
------------------------------------+---------------------------------------

Comment (by dradice@…):

 Replying to [comment:2 rhaas]:

 Thank you for having reviewed my patch.

 > * the thorn needs documentation in the standard documentation.tex file,
 this can be very short, I would think that the description in the pull
 request plus the usual boilerplate text will be sufficient

 I updated the pull request: now my patch contains the documentation as
 well.

 > * the thorn has no test case however since it would have to test for an
 abort, a test case may be hard to design

 I wouldn't know how to create a unit test for the WatchDog thorn. However
 the code is sufficiently simple that it is easy to check that it operates
 as intended. Whether calling "abort()" results in the run being cleared
 from the queuing system without leaving "zombies" is a much more delicate
 issue, because it depends on the reason why a run is hanging in the first
 place and on the detail of the specific machine. I do not see a way to
 really test this. The WatchDog thorn is meant to avoid burning allocations
 on dead jobs, however I do not think that one can guarantee 100% that it
 will work: the users should use this at their own risk.

 > * the thorn uses fprintf(stderr, ...) for both error and informational
 messages. For informational messages (the "Everything is fine" message) it
 could use CCTK_VInfo() in the main thread's ANALYSIS routine. The warnings
 cannot be changed since they are emitted by the secondary thread and
 Cactus is not thread safe.

 "Everything is fine" is a message from the watchdog thread. It cannot be
 relayed using the CCTK_* functions.

 > * reading the man-page for asctime_r I do not think that the explicit
 zero termination in line 32 and 49 is required since asctime null
 terminates its output (which is also guaranteed to fit within 26
 characters).

 I am removing the newline character.

 > * if possible, the thorn should check for the presence of PTHREADS,
 currently since PTHREADS is a Cactus extras, the only way to do so seems
 to be at compile time via:
 > {{{
 > #ifndef(CCTK_PTHREADS)
 > #error "WATCHDOG required PTHREADS. Please enable PTHREADS=yes in your
 option list."
 > #endif
 > }}}

 This is fixed in the new version of the pull request.

 > * it may be interesting to make check_every STEERABLE=ALWAYS by
 resetting it inside the ANALYSIS routine (and protecting access to it by
 the mutex).

 This is certainly possible, but it seems an unneeded complication to me.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1751#comment:3>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list