[ET Trac] [Einstein Toolkit] #1286: SimFactory should not run queued chained jobs if a previous job fails
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Mon Mar 11 10:05:45 CDT 2013
#1286: SimFactory should not run queued chained jobs if a previous job fails
--------------------------+-------------------------------------------------
Reporter: hinder | Owner: eschnett
Type: enhancement | Status: new
Priority: major | Milestone:
Component: SimFactory | Version:
Resolution: | Keywords:
--------------------------+-------------------------------------------------
Comment (by knarf):
Replying to [comment:3 eschnett]:
> That may be too simplistic. If a simulation runs out of queue time,
would that count as "success"? Would you expect Simfactory to continue
chaining jobs in this case?
I guess everybody probably agrees to "yes" here, although I can see that
this might also be a problem (assume no checkpoint was written because the
corresponding parameter was incorrectly set).
> What if different MPI processes return different exit codes?
I would assume the overall mpirun-like command to have a non-zero exit
code then. Maybe it turns out that Cactus should write that explicitly to
a file (when it does properly exit).
> What is, in general the exit code of mpirun anyway?
Very likely to be dependent on more than we like.
> What would you do if a simulation runs of of time? out of memory? out of
disk space? What if there is a file permission error and the simulation
can't write? What if the Cactus executable never actually starts because
something is wrong?
I agree that catching all of these correctly will be quite some work. That
doesn't mean we couldn't start working on some and leave others for later.
> What if the user used qdel to stop a simulation?
If a user does this with chained jobs still in the queue I indeed expect
these to start.
> What if the user used the web interface or a termination trigger to stop
the simulation?
That's a good question. I can see arguments both ways. We don't specify if
this should mean 'stop _this_ simulation' or 'stop the entire run'.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1286#comment:4>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list