[ET Trac] [Einstein Toolkit] #1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate jobs that hang
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Fri Mar 6 23:17:54 CST 2015
#1751: [Pull request: CactusUtils/WatchDog] new thorn to automatically terminate
jobs that hang
------------------------------------+---------------------------------------
Reporter: dradice@… | Owner: dradice@…
Type: enhancement | Status: assigned
Priority: optional | Milestone:
Component: EinsteinToolkit thorn | Version: development version
Resolution: | Keywords:
------------------------------------+---------------------------------------
Comment (by dradice@…):
Replying to [comment:2 rhaas]:
Thank you for having reviewed my patch.
> * the thorn needs documentation in the standard documentation.tex file,
this can be very short, I would think that the description in the pull
request plus the usual boilerplate text will be sufficient
I updated the pull request: now my patch contains the documentation as
well.
> * the thorn has no test case however since it would have to test for an
abort, a test case may be hard to design
I wouldn't know how to create a unit test for the WatchDog thorn. However
the code is sufficiently simple that it is easy to check that it operates
as intended. Whether calling "abort()" results in the run being cleared
from the queuing system without leaving "zombies" is a much more delicate
issue, because it depends on the reason why a run is hanging in the first
place and on the detail of the specific machine. I do not see a way to
really test this. The WatchDog thorn is meant to avoid burning allocations
on dead jobs, however I do not think that one can guarantee 100% that it
will work: the users should use this at their own risk.
> * the thorn uses fprintf(stderr, ...) for both error and informational
messages. For informational messages (the "Everything is fine" message) it
could use CCTK_VInfo() in the main thread's ANALYSIS routine. The warnings
cannot be changed since they are emitted by the secondary thread and
Cactus is not thread safe.
"Everything is fine" is a message from the watchdog thread. It cannot be
relayed using the CCTK_* functions.
> * reading the man-page for asctime_r I do not think that the explicit
zero termination in line 32 and 49 is required since asctime null
terminates its output (which is also guaranteed to fit within 26
characters).
I am removing the newline character.
> * if possible, the thorn should check for the presence of PTHREADS,
currently since PTHREADS is a Cactus extras, the only way to do so seems
to be at compile time via:
> {{{
> #ifndef(CCTK_PTHREADS)
> #error "WATCHDOG required PTHREADS. Please enable PTHREADS=yes in your
option list."
> #endif
> }}}
This is fixed in the new version of the pull request.
> * it may be interesting to make check_every STEERABLE=ALWAYS by
resetting it inside the ANALYSIS routine (and protecting access to it by
the mutex).
This is certainly possible, but it seems an unneeded complication to me.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1751#comment:3>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list