[ET Trac] [Einstein Toolkit] #316: Checkpoint recovery nonfunctional

Einstein Toolkit trac-noreply at einsteintoolkit.org
Fri Mar 18 14:01:15 CDT 2011


#316: Checkpoint recovery nonfunctional
-------------------------+--------------------------------------------------
  Reporter:  hinder      |       Owner:  mthomas   
      Type:  defect      |      Status:  new       
  Priority:  blocker     |   Milestone:            
 Component:  SimFactory  |     Version:            
Resolution:              |    Keywords:  regression
-------------------------+--------------------------------------------------

Comment (by eschnett):

 This patch has several problems:

 We just agreed on the mailing list to store checkpoint files in a
 directory "checkpoints" that is at the same level as the "output-NNNN"
 directories. These checkpoint files therefore do not fit the pattern you
 are expecting, and lead to warnings. Instead, these checkpoint files
 should be ignored. Instead of looking for checkpoint files in the whole
 simulation directory, it may be better to look in the individual restart
 directories.

 The code to replace the "output-NNNN" patterns is too complex. Instead of
 pattern matching and then manual string operations, a direct regexp
 replacement would be simpler and safer.

 The pattern matching code does not check where in the path name the
 pattern "output-NNNN" exists. If a simulation is called "output", this
 leads to problems.

 The message "linking file XXX" is printed even if the linking step does
 not actually happen.

 The code to check whether a checkpoint file is found twice only looks at
 file names. This is fragile and can hide problems, for example when two
 restarts use different numbers of processes. Instead, the incoming list of
 checkpoint files should be pruned. This would also allow an explicit
 choice of whether newer or older checkpoint files should be chosen if they
 exist in several restarts: the current code presumably chooses the older
 ones, whereas we may want to use the newer ones.

 The fallback call to "shutil.copyfile" overwrites existing file contents.
 This is a serious problem if the existing file is a hard link (because
 this will then overwrite the file content). The destination needs to be
 unlinked first.

 This code links all checkpoint files. Only checkpoint files from the last
 iteration should be linked.

 Overall, the code says "checkpoint" when it should say "recover" in
 several places.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/316#comment:9>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list