[ET Trac] [Einstein Toolkit] #1286: SimFactory should not run queued chained jobs if a previous job fails

Mon Mar 11 06:23:06 CDT 2013

#1286: SimFactory should not run queued chained jobs if a previous job fails
-------------------------+--------------------------------------------------
 Reporter:  hinder       |       Owner:  eschnett
     Type:  enhancement  |      Status:  new     
 Priority:  major        |   Milestone:          
Component:  SimFactory   |     Version:          
 Keywords:               |  
-------------------------+--------------------------------------------------
 When a simulation consists of multiple chained jobs, the failure of one
 job is likely to lead to the failure of subsequent jobs.  Possible reasons
 for failure of a job include:

 1. Running out of disk quota;
 2. An error in the code;
 3. A numerical problem;
 4. A problem with the cluster

 Of all these, only the last could potentially be recovered from by simply
 running the next job in the chain, and in any case, if this is done
 immediately, it is likely to fail because the problem may not have
 resolved itself.

 As a result, to avoid wasting CPU hours on the remaining jobs in the
 chain, I think simfactory should hold or remove the subsequent chained
 jobs.  Probably removing the jobs would be easier and simpler, and users
 can always run "submit" on them to restart them.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1286>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit