[ET Trac] [Einstein Toolkit] #1327: Delay subsequent restarts in the case of certain problems

Fri Apr 19 05:17:04 CDT 2013

#1327: Delay subsequent restarts in the case of certain problems
-------------------------+--------------------------------------------------
 Reporter:  hinder       |       Owner:  eschnett
     Type:  enhancement  |      Status:  new     
 Priority:  major        |   Milestone:          
Component:  SimFactory   |     Version:          
 Keywords:               |  
-------------------------+--------------------------------------------------
 If one restart of a simulation exits abnormally, e.g. due to some
 transient problem on a cluster, all subsequent restarts might also run
 into the same problem.  If we can distinguish between terminations due to
 internal (i.e. numerical or code-related) problems and external (MPI
 errors, filesystem issues) problems, we can do different things for each.
 Possible actions could be:

 1. Continue as normal with the next restart;
 2. Delay the next restart for a few hours, in the hope that the transient
 cluster problems are resolved;
 3. Hold the next restart and notify the user by email that an
 unrecoverable error has occurred.

 These could be communicated by exit codes (whether through official
 methods, or through an exit code file).  Distinguishing between 2 and 3
 could be achieved by regular expression matching on the standard output or
 standard error file.  This would make the mechanism independent of Cactus.
 So Cactus would only have to say "good" or "bad", and SimFactory could
 then decide if "bad" meant to delay or hold based on some logic in its
 machine database.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1327>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit