[ET Trac] [Einstein Toolkit] #626: Recovery fails in AHFinderDirect RecoverML with out-of-bounds assertion in CarpetLib

Tue Oct 18 12:32:53 CDT 2011

#626: Recovery fails in AHFinderDirect RecoverML with out-of-bounds assertion in
CarpetLib
---------------------+------------------------------------------------------
  Reporter:  rhaas   |       Owner:  eschnett  
      Type:  defect  |      Status:  new       
  Priority:  major   |   Milestone:  ET_2011_10
 Component:  Carpet  |     Version:            
Resolution:          |    Keywords:            
---------------------+------------------------------------------------------

Comment (by eschnett):

 Thank you.

 This is an unfortunate coincidence of events.

 Cactus itself does not have a notion of a maximum number of timelevels;
 this number can be different for each grid group. However, for proper time
 interpolation etc., Carpet needs to assume such a global maximum, and it
 uses "prolongation_order_time+1" for this. (This parameter should probably
 be renamed.) That is, this is the maximum time level that Carpet can
 handle for interpolation; individual grid groups may have more timelevels,
 but Carpet will not store/provide any meta-data for them, and the
 application can only access these as they are (i.e. in the same way as
 with PUGH).

 In this particular parameter file, prolongation_order_time is 2 (the
 default), but the metric seems to have 3 time level. That means that the
 metric has a pre-previous time level that is most likely never used, but
 which nevertheless has storage, and which is checkpointed and recovered. I
 assume that nothing in the code actually uses the 3rd timelevel, except
 that timelevel rotation will move old data into it.

 During recovery, all timelevels need to be synchronised. This is special,
 since during evolution only the current timelevel would be synchronised.
 All timelevels need to be synchronised because the number of processes may
 change, and the ghost zones need to be filled. It would instead be
 possible to read in the ghost zone data from file, but synchronising is
 probably much more efficient.

 The major difference between Carpet-git and Carpet-hg is that Carpet now
 explicitly stores which time is associated with each timelevel, instead of
 assuming a constant delta time. This can save time interpolation during
 regridding (leading to slightly different, but still valid results), and
 also allows e.g. changing the time step size during evolution. However,
 this also means that the times for the past timelevels need to be
 initialised correctly, so Carpet needs to know how many old timelevels
 there can be.

 And this is where things go wrong: Carpet has no meta-data for the 3rd
 (unused) timelevel. During synchronisation, Carpet accesses (although this
 is not actually needed) the time associated with this timelevel, and this
 operation fails.

 There are several ways to correct this:
 * Allocate only 2 timelevels for the metric (since only 2 are used anyway)
 * Set prolongation_order_time to 3 instead of 2 (since you want to use 3
 timelevels, you may as well tell Carpet about this)
 * Introduce a special case for synchronising after recovering that somehow
 doesn't access Carpet's timelevel metadata (which are not be required for
 synchronising)
 * Wrap accessing timelevels during communication in an if statement, so
 that the error is not triggered

 I would obviously prefer one of the first two options. In addition, the
 error message should obviously be improved.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/626#comment:1>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit