[ET Trac] [Einstein Toolkit] #316: Checkpoint recovery nonfunctional
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Fri Mar 18 14:01:15 CDT 2011
#316: Checkpoint recovery nonfunctional
-------------------------+--------------------------------------------------
Reporter: hinder | Owner: mthomas
Type: defect | Status: new
Priority: blocker | Milestone:
Component: SimFactory | Version:
Resolution: | Keywords: regression
-------------------------+--------------------------------------------------
Comment (by eschnett):
This patch has several problems:
We just agreed on the mailing list to store checkpoint files in a
directory "checkpoints" that is at the same level as the "output-NNNN"
directories. These checkpoint files therefore do not fit the pattern you
are expecting, and lead to warnings. Instead, these checkpoint files
should be ignored. Instead of looking for checkpoint files in the whole
simulation directory, it may be better to look in the individual restart
directories.
The code to replace the "output-NNNN" patterns is too complex. Instead of
pattern matching and then manual string operations, a direct regexp
replacement would be simpler and safer.
The pattern matching code does not check where in the path name the
pattern "output-NNNN" exists. If a simulation is called "output", this
leads to problems.
The message "linking file XXX" is printed even if the linking step does
not actually happen.
The code to check whether a checkpoint file is found twice only looks at
file names. This is fragile and can hide problems, for example when two
restarts use different numbers of processes. Instead, the incoming list of
checkpoint files should be pruned. This would also allow an explicit
choice of whether newer or older checkpoint files should be chosen if they
exist in several restarts: the current code presumably chooses the older
ones, whereas we may want to use the newer ones.
The fallback call to "shutil.copyfile" overwrites existing file contents.
This is a serious problem if the existing file is a hard link (because
this will then overwrite the file content). The destination needs to be
unlinked first.
This code links all checkpoint files. Only checkpoint files from the last
iteration should be linked.
Overall, the code says "checkpoint" when it should say "recover" in
several places.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/316#comment:9>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list