[ET Trac] [Einstein Toolkit] #1286: SimFactory should not run queued chained jobs if a previous job fails
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Mon Mar 11 06:23:06 CDT 2013
#1286: SimFactory should not run queued chained jobs if a previous job fails
-------------------------+--------------------------------------------------
Reporter: hinder | Owner: eschnett
Type: enhancement | Status: new
Priority: major | Milestone:
Component: SimFactory | Version:
Keywords: |
-------------------------+--------------------------------------------------
When a simulation consists of multiple chained jobs, the failure of one
job is likely to lead to the failure of subsequent jobs. Possible reasons
for failure of a job include:
1. Running out of disk quota;
2. An error in the code;
3. A numerical problem;
4. A problem with the cluster
Of all these, only the last could potentially be recovered from by simply
running the next job in the chain, and in any case, if this is done
immediately, it is likely to fail because the problem may not have
resolved itself.
As a result, to avoid wasting CPU hours on the remaining jobs in the
chain, I think simfactory should hold or remove the subsequent chained
jobs. Probably removing the jobs would be easier and simpler, and users
can always run "submit" on them to restart them.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1286>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list