[ET Trac] [Einstein Toolkit] #1327: Delay subsequent restarts in the case of certain problems
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Fri Apr 19 05:17:04 CDT 2013
#1327: Delay subsequent restarts in the case of certain problems
-------------------------+--------------------------------------------------
Reporter: hinder | Owner: eschnett
Type: enhancement | Status: new
Priority: major | Milestone:
Component: SimFactory | Version:
Keywords: |
-------------------------+--------------------------------------------------
If one restart of a simulation exits abnormally, e.g. due to some
transient problem on a cluster, all subsequent restarts might also run
into the same problem. If we can distinguish between terminations due to
internal (i.e. numerical or code-related) problems and external (MPI
errors, filesystem issues) problems, we can do different things for each.
Possible actions could be:
1. Continue as normal with the next restart;
2. Delay the next restart for a few hours, in the hope that the transient
cluster problems are resolved;
3. Hold the next restart and notify the user by email that an
unrecoverable error has occurred.
These could be communicated by exit codes (whether through official
methods, or through an exit code file). Distinguishing between 2 and 3
could be achieved by regular expression matching on the standard output or
standard error file. This would make the mechanism independent of Cactus.
So Cactus would only have to say "good" or "bad", and SimFactory could
then decide if "bad" meant to delay or hold based on some logic in its
machine database.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1327>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list