[Users] OpenMP problems in ET

Wed Jul 20 07:50:04 CDT 2011

On 19 Jul 2011, at 18:00, Frank Loeffler wrote:

> On Sat, Jul 16, 2011 at 01:41:06AM +0900, Hee Il Kim wrote:
>> I recently found OpenMP runs of ET can make different results depending on
>> the number of threads (NT=1 vs. NT neq 1). In some experiments, the
>> difference becomes noticeable only after a long time, but you can see the
>> difference even for the TOV test run with static_tov.par (I compared the
>> time variation of rho_max). With the same parameter setup except for the
>> extended cctk_final time, the difference becomes noticeable around t = 1300.
> 
> Differences in results are expected when running on different numbers of
> mpi processes or openmp threads. How large these differences get depends
> on what exactly is done, but the longer a simulation runs the larger the
> difference can, in theory, get. This is true even when there is no bug
> and everything goes as it should. The challenge is to be sure that
> this is indeed the case, and differences are not creeping in because of
> some bug.
> 
> One of the possibilities to create differences is when the results of
> reductions are used within the simulation. Reductions will necessarily
> produce (small) differences depending on the number of MPI processes or
> openmp threads - because the order in which the reduction is done
> differs and creates a different numerical error. This error shouldn't be
> all that large. However, if results from this are fed back into the
> simulation, these difference might be amplified, especially if iterative
> schemes come into play and the number of taken iterations suddenly
> changes because of a tiny change in the residuum shifting it above or
> below a given tolerance.

Just to clarify: the "numerical error" that Frank is talking about is due to the lack of associativity of floating point operations, where you can get differences in the last binary digit depending on the order in which operations occur.  These are not "errors" as such, as there is no well-defined "correct" answer.  Each result is as correct as the other.  Implementations of finite differencing schemes on finite-precision hardware generically lead to an uncertainty in the result of O[C(t) eps/dt] where C(t) is a function of the time coordinate only, independent of dx and dt, eps characterises the size of the round-off error (e.g. 1e-15), and dx and dt are the space and time step used in finite differencing.  i.e. as the time step is decreased, the uncertainty increases.  This result is in Gustafsson, Kreiss and Oliger (I don't have it in front of me at the moment, but I can look up the reference if anyone is interested).  So if you change the order of operations, e.g. by doing different compiler optimisations, you can expect to see differences on this order.  In my reading of it, this result seems to apply to systems where all the evolved variables are order of unity, so it might be even worse in other cases, and for nonlinear systems.  The case of parallelisation affecting the order of operations in reductions can be considered to be analogous to this.

It would be very nice to have more understanding of how our calculations react to small changes in the initial data and equations such as these.

> One example where tiny differences can have a large impact is when grids
> are moved according to the location of, e.g., a neutron star. Assuming
> that the stars are tracked by looking for the maximum of some density, a
> tiny change at that location might suddenly make a neighbor the maximum,
> resulting in a different region being refined, amplifying differences.
> 
> All of these differences should vanish when increasing resolution, and
> this seems what you also observe. I am sorry that I cannot give a
> general answer, but this should suggest that differences are not
> necessarily bad - it all depends on how large these differences are,
> whether their origin is understood and whether they are reduced when
> increasing resolution.

Any algorithm which selects a grid structure or a grid point based on the value of a floating point number could translate an O(eps) difference into an O(dx) or worse difference, and such differences should decrease with increasing resolution.  But if the differences are caused only by non-associativity of floating point operations, these should (a) probably remain fairly small, and (b) should not get smaller with increased resolution, in fact they should get larger, as the number of time steps increases - see above.

-- 
Ian Hinder
ian.hinder at aei.mpg.de