[ET Trac] [Einstein Toolkit] #1286: SimFactory should not run queued chained jobs if a previous job fails

Mon Mar 11 10:05:45 CDT 2013

#1286: SimFactory should not run queued chained jobs if a previous job fails
--------------------------+-------------------------------------------------
  Reporter:  hinder       |       Owner:  eschnett
      Type:  enhancement  |      Status:  new     
  Priority:  major        |   Milestone:          
 Component:  SimFactory   |     Version:          
Resolution:               |    Keywords:          
--------------------------+-------------------------------------------------

Comment (by knarf):

 Replying to [comment:3 eschnett]:
 > That may be too simplistic. If a simulation runs out of queue time,
 would that count as "success"? Would you expect Simfactory to continue
 chaining jobs in this case?

 I guess everybody probably agrees to "yes" here, although I can see that
 this might also be a problem (assume no checkpoint was written because the
 corresponding parameter was incorrectly set).

 > What if different MPI processes return different exit codes?

 I would assume the overall mpirun-like command to have a non-zero exit
 code then. Maybe it turns out that Cactus should write that explicitly to
 a file (when it does properly exit).

 > What is, in general the exit code of mpirun anyway?

 Very likely to be dependent on more than we like.

 > What would you do if a simulation runs of of time? out of memory? out of
 disk space? What if there is a file permission error and the simulation
 can't write? What if the Cactus executable never actually starts because
 something is wrong?

 I agree that catching all of these correctly will be quite some work. That
 doesn't mean we couldn't start working on some and leave others for later.

 > What if the user used qdel to stop a simulation?

 If a user does this with chained jobs still in the queue I indeed expect
 these to start.

 > What if the user used the web interface or a termination trigger to stop
 the simulation?

 That's a good question. I can see arguments both ways. We don't specify if
 this should mean 'stop _this_ simulation' or 'stop the entire run'.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1286#comment:4>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit