[Users] Problem with CarpetRegrid2/AMR

Hal Finkel hfinkel at anl.gov
Tue Sep 13 19:03:55 CDT 2011


Erik,

The problem occurs with just one process.

Thanks again,
Hal

On Tue, 2011-09-13 at 18:45 -0400, Erik Schnetter wrote:
> Hal
> 
> Were you running with multiple processes?
> 
> -erik
> 
> On Tue, Sep 13, 2011 at 12:00 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > Ian,
> >
> > I'd not found the problem yet. Hopefully this will be the hint we need
> > to get this fixed.
> >
> > Thanks again,
> > Hal
> >
> > On Tue, 2011-09-13 at 15:31 +0100, Ian Hawke wrote:
> >> Erik, Hal,
> >>
> >> Did you have any luck tracking down this error? I've just come back to
> >> this and am seeing the same error message; it appears to arise when two
> >> grids on a refined level merge, as in:
> >>
> >>     [3][0][0]   exterior: [-0.010000,-0.020000,-0.020000] :
> >> [0.000000,0.020000,0.020000] : [0.001250,0.001250,0.001250]
> >>     [3][0][1]   exterior: [0.010000,-0.020000,-0.020000] :
> >> [0.040000,0.020000,0.020000] : [0.001250,0.001250,0.001250]
> >>
> >> becomes
> >>
> >>   [3][0][0]   exterior: [-0.010000,-0.020000,-0.020000] :
> >> [0.040000,0.020000,0.020000] : [0.001250,0.001250,0.001250]
> >>
> >> It seems that the old grids are destroyed before the data is
> >> copied/populated in Recompose (either that or the old grid structure is
> >> not referred to in the data transfer).
> >>
> >> Ian
> >>
> >> On 07/09/11 01:03, Erik Schnetter wrote:
> >> > Hal
> >> >
> >> > This is where numbers are assigned to components. The communication
> >> > schedule decides which component needs to send data to which other
> >> > component (which may be located on another process or not); this
> >> > schedule is created for each refinement level independently, and may
> >> > (if there is an error) refer to component numbers that don't exist.
> >> > This schedule is set up in dh.cc.
> >> >
> >> > Can you send me the example you are currently running (your source
> >> > code and parameter file)? I will try to give it a try.
> >> >
> >> > -erik
> >> >
> >> > On Tue, Sep 6, 2011 at 7:44 PM, Hal Finkel<hfinkel at anl.gov>  wrote:
> >> >> On Thu, 2011-09-01 at 15:16 -0400, Erik Schnetter wrote:
> >> >>> On Thu, Sep 1, 2011 at 3:05 PM, Hal Finkel<hfinkel at anl.gov>  wrote:
> >> >>>> On Thu, 2011-09-01 at 14:25 -0400, Erik Schnetter wrote:
> >> >>>>> On Thu, Sep 1, 2011 at 11:51 AM, Hal Finkel<hfinkel at anl.gov>  wrote:
> >> >>>>>> On Thu, 2011-09-01 at 11:37 -0400, Erik Schnetter wrote:
> >> >>>>>>> On Thu, Sep 1, 2011 at 10:53 AM, Hal Finkel<hfinkel at anl.gov>  wrote:
> >> >>>>>>>> On Tue, 2011-08-30 at 21:06 -0400, Erik Schnetter wrote:
> >> >>>>>>>>> On Tue, Aug 30, 2011 at 5:28 PM, Hal Finkel<hfinkel at anl.gov>  wrote:
> >> >>>>>>>>>> Could I also decrease the block size? I currently have
> >> >>>>>>>>>> CarpetRegrid2::adaptive_block_size = 4, could it be smaller than that?
> >> >>>>>>>>>> Is there a restriction based on the number of ghost points?
> >> >>>>>>>>> Yes, you can reduce the block size. I assume that both the regridding
> >> >>>>>>>>> operation and the time evolution will become slower if you do that,
> >> >>>>>>>>> because more blocks will have to be handled.
> >> >>>>>>>> Regardless of what I do, once we get past the first coarse time step,
> >> >>>>>>>> the program seems to "hang" at "INFO (Carpet): [ml=0][rl=0][m=0][tl=0]
> >> >>>>>>>> Regridding map 0...".
> >> >>>>>>>>
> >> >>>>>>>> Overall, it is in dh::regrid(do_init=true). It spends most of its time
> >> >>>>>>>> in bboxset<int, 3>::normalize() and, specifically, mostly in the loop:
> >> >>>>>>>> for (typename bset::iterator nsi = nbs.begin(); nsi != nbs.end(); ++
> >> >>>>>>>> nsi). The normalize() function does exit, however, so it is not hanging
> >> >>>>>>>> in that function.
> >> >>>>>>>>
> >> >>>>>>>> The core problem seems to be that it takes a long time to execute:
> >> >>>>>>>> boxes  = boxes .shift(-dir) - boxes;
> >> >>>>>>>> in dh::regrid(do_init=true). Probably because boxes has 129064 elements.
> >> >>>>>>>> The coarse grid is now only 30^3 and I've left the regrid box size at 4.
> >> >>>>>>>> I'd think, then, that the coarse grid should have a maximum of 30^3/4^3
> >> >>>>>>>> ~ 420 refinement regions.
> >> >>>>>>>>
> >> >>>>>>>> What is the best way to figure out what is going on?
> >> >>>>>>> Hal
> >> >>>>>>>
> >> >>>>>>> Yes, this function is very slow. I did not expect it to be
> >> >>>>>>> prohibitively slow. Are you compiling with optimisation enabled?
> >> >>>>>> I've tried with optimizations enabled (and without for debugging).
> >> >>>>>>
> >> >>>>>>> The bboxset represents the set of refined regions, and it is
> >> >>>>>>> internally represented as a list of bboxes (regions). Carpet performs
> >> >>>>>>> set operations on these (intersection, union, complement, etc.) to
> >> >>>>>>> determine the communication schedule, i.e. which ghost zones of which
> >> >>>>>>> bbox need to be filled from which other bbox. Unfortunately, the
> >> >>>>>>> algorithm used for this is O(n^2) in the number of refined regions,
> >> >>>>>>> and set operations when implemented via lists themselves are O(n^2) in
> >> >>>>>>> the set size, leading to a rather unfortunate overall complexity. The
> >> >>>>>>> only cure is to reduce the number of bboxes (make them larger) and to
> >> >>>>>>> regrid fewer times.
> >> >>>>>> This is what I suspected, but nevertheless, is there something wrong?
> >> >>>>>> How many boxes do you expect that I should have? The reason that it does
> >> >>>>>> not finish, even with optimizations, is that there are 129K boxes in the
> >> >>>>>> loop (that's at least 16 billion box normalizations?).
> >> >>>>>>
> >> >>>>>> The coarse grid is only 30^3, and the regrid box size is 4, so at
> >> >>>>>> maximum, there should be ~400 level one boxes. Even if some of those
> >> >>>>>> have level 2 boxes, I don't understand how there could be 129K boxes.
> >> >>>>> The refinement structure itself should have one bbox per refined 4^3
> >> >>>>> box, and both CarpetRegrid2 and CarpetLib would try to combine these
> >> >>>>> into fewer boxes where possible, i.e. where one can form rectangles or
> >> >>>>> larger cubes. I would thus expect no more than (30/4)^2 = 64 bboxes on
> >> >>>>> level one.
> >> >>>> That makes sense. I think that there is a bug somewhere which is causing
> >> >>>> the box set to be much too big. Furthermore, it does not happen on every
> >> >>>> run, only sometimes. When it does not happen, I hit another bug after a
> >> >>>> few coarse timesteps:
> >> >>>>
> >> >>>> I get a range-check exception from std::vector in a call to:
> >> >>>> gh::get_local_component (rl=1, c=8)
> >> >>>> the problem is that this returns:
> >> >>>> local_components_.AT(rl).AT(c);
> >> >>>> and local_components_[1].size() is 8
> >> >>>> The call to get_local_component is coming from ggf::transfer_from_all
> >> >>>> at:
> >> >>>> int const lc2 = h.get_local_component(rl2,c2);
> >> >>>> where c2 is from psend.component.
> >> >>>> So it looks like there is an off-by-one error somewhere.
> >> >>> Very strange. This code should be quite solid by now. psend is set in
> >> >>> the file dh.cc in thorn Carpet/CarpetLib; there is one (large) routine
> >> >>> that calculates the communication schedule. Some of the indexing
> >> >>> errors there in the past included confusing the number of components
> >> >>> on different refinement levels, which led to indexing errors such as
> >> >>> the one you describe.
> >> >> The bad component numbers are not coming from:
> >> >> preg.component = tmpncomps.AT(m)++;
> >> >> in Carpet/src/Recompose.cc
> >> >>
> >> >> Where else are the component numbers assigned?
> >>
> >> _______________________________________________
> >> Users mailing list
> >> Users at einsteintoolkit.org
> >> http://lists.einsteintoolkit.org/mailman/listinfo/users
> >
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 1-630-252-0023
> > hfinkel at anl.gov
> >
> > _______________________________________________
> > Users mailing list
> > Users at einsteintoolkit.org
> > http://lists.einsteintoolkit.org/mailman/listinfo/users
> >
> 
> 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
1-630-252-0023
hfinkel at anl.gov



More information about the Users mailing list