[Users] Problem with CarpetRegrid2/AMR

Erik Schnetter schnetter at cct.lsu.edu
Tue Sep 13 17:58:13 CDT 2011


After how many iterations does the code abort?

-erik

On Tue, Sep 13, 2011 at 6:45 PM, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
> Hal
>
> Were you running with multiple processes?
>
> -erik
>
> On Tue, Sep 13, 2011 at 12:00 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> Ian,
>>
>> I'd not found the problem yet. Hopefully this will be the hint we need
>> to get this fixed.
>>
>> Thanks again,
>> Hal
>>
>> On Tue, 2011-09-13 at 15:31 +0100, Ian Hawke wrote:
>>> Erik, Hal,
>>>
>>> Did you have any luck tracking down this error? I've just come back to
>>> this and am seeing the same error message; it appears to arise when two
>>> grids on a refined level merge, as in:
>>>
>>>     [3][0][0]   exterior: [-0.010000,-0.020000,-0.020000] :
>>> [0.000000,0.020000,0.020000] : [0.001250,0.001250,0.001250]
>>>     [3][0][1]   exterior: [0.010000,-0.020000,-0.020000] :
>>> [0.040000,0.020000,0.020000] : [0.001250,0.001250,0.001250]
>>>
>>> becomes
>>>
>>>   [3][0][0]   exterior: [-0.010000,-0.020000,-0.020000] :
>>> [0.040000,0.020000,0.020000] : [0.001250,0.001250,0.001250]
>>>
>>> It seems that the old grids are destroyed before the data is
>>> copied/populated in Recompose (either that or the old grid structure is
>>> not referred to in the data transfer).
>>>
>>> Ian
>>>
>>> On 07/09/11 01:03, Erik Schnetter wrote:
>>> > Hal
>>> >
>>> > This is where numbers are assigned to components. The communication
>>> > schedule decides which component needs to send data to which other
>>> > component (which may be located on another process or not); this
>>> > schedule is created for each refinement level independently, and may
>>> > (if there is an error) refer to component numbers that don't exist.
>>> > This schedule is set up in dh.cc.
>>> >
>>> > Can you send me the example you are currently running (your source
>>> > code and parameter file)? I will try to give it a try.
>>> >
>>> > -erik
>>> >
>>> > On Tue, Sep 6, 2011 at 7:44 PM, Hal Finkel<hfinkel at anl.gov>  wrote:
>>> >> On Thu, 2011-09-01 at 15:16 -0400, Erik Schnetter wrote:
>>> >>> On Thu, Sep 1, 2011 at 3:05 PM, Hal Finkel<hfinkel at anl.gov>  wrote:
>>> >>>> On Thu, 2011-09-01 at 14:25 -0400, Erik Schnetter wrote:
>>> >>>>> On Thu, Sep 1, 2011 at 11:51 AM, Hal Finkel<hfinkel at anl.gov>  wrote:
>>> >>>>>> On Thu, 2011-09-01 at 11:37 -0400, Erik Schnetter wrote:
>>> >>>>>>> On Thu, Sep 1, 2011 at 10:53 AM, Hal Finkel<hfinkel at anl.gov>  wrote:
>>> >>>>>>>> On Tue, 2011-08-30 at 21:06 -0400, Erik Schnetter wrote:
>>> >>>>>>>>> On Tue, Aug 30, 2011 at 5:28 PM, Hal Finkel<hfinkel at anl.gov>  wrote:
>>> >>>>>>>>>> Could I also decrease the block size? I currently have
>>> >>>>>>>>>> CarpetRegrid2::adaptive_block_size = 4, could it be smaller than that?
>>> >>>>>>>>>> Is there a restriction based on the number of ghost points?
>>> >>>>>>>>> Yes, you can reduce the block size. I assume that both the regridding
>>> >>>>>>>>> operation and the time evolution will become slower if you do that,
>>> >>>>>>>>> because more blocks will have to be handled.
>>> >>>>>>>> Regardless of what I do, once we get past the first coarse time step,
>>> >>>>>>>> the program seems to "hang" at "INFO (Carpet): [ml=0][rl=0][m=0][tl=0]
>>> >>>>>>>> Regridding map 0...".
>>> >>>>>>>>
>>> >>>>>>>> Overall, it is in dh::regrid(do_init=true). It spends most of its time
>>> >>>>>>>> in bboxset<int, 3>::normalize() and, specifically, mostly in the loop:
>>> >>>>>>>> for (typename bset::iterator nsi = nbs.begin(); nsi != nbs.end(); ++
>>> >>>>>>>> nsi). The normalize() function does exit, however, so it is not hanging
>>> >>>>>>>> in that function.
>>> >>>>>>>>
>>> >>>>>>>> The core problem seems to be that it takes a long time to execute:
>>> >>>>>>>> boxes  = boxes .shift(-dir) - boxes;
>>> >>>>>>>> in dh::regrid(do_init=true). Probably because boxes has 129064 elements.
>>> >>>>>>>> The coarse grid is now only 30^3 and I've left the regrid box size at 4.
>>> >>>>>>>> I'd think, then, that the coarse grid should have a maximum of 30^3/4^3
>>> >>>>>>>> ~ 420 refinement regions.
>>> >>>>>>>>
>>> >>>>>>>> What is the best way to figure out what is going on?
>>> >>>>>>> Hal
>>> >>>>>>>
>>> >>>>>>> Yes, this function is very slow. I did not expect it to be
>>> >>>>>>> prohibitively slow. Are you compiling with optimisation enabled?
>>> >>>>>> I've tried with optimizations enabled (and without for debugging).
>>> >>>>>>
>>> >>>>>>> The bboxset represents the set of refined regions, and it is
>>> >>>>>>> internally represented as a list of bboxes (regions). Carpet performs
>>> >>>>>>> set operations on these (intersection, union, complement, etc.) to
>>> >>>>>>> determine the communication schedule, i.e. which ghost zones of which
>>> >>>>>>> bbox need to be filled from which other bbox. Unfortunately, the
>>> >>>>>>> algorithm used for this is O(n^2) in the number of refined regions,
>>> >>>>>>> and set operations when implemented via lists themselves are O(n^2) in
>>> >>>>>>> the set size, leading to a rather unfortunate overall complexity. The
>>> >>>>>>> only cure is to reduce the number of bboxes (make them larger) and to
>>> >>>>>>> regrid fewer times.
>>> >>>>>> This is what I suspected, but nevertheless, is there something wrong?
>>> >>>>>> How many boxes do you expect that I should have? The reason that it does
>>> >>>>>> not finish, even with optimizations, is that there are 129K boxes in the
>>> >>>>>> loop (that's at least 16 billion box normalizations?).
>>> >>>>>>
>>> >>>>>> The coarse grid is only 30^3, and the regrid box size is 4, so at
>>> >>>>>> maximum, there should be ~400 level one boxes. Even if some of those
>>> >>>>>> have level 2 boxes, I don't understand how there could be 129K boxes.
>>> >>>>> The refinement structure itself should have one bbox per refined 4^3
>>> >>>>> box, and both CarpetRegrid2 and CarpetLib would try to combine these
>>> >>>>> into fewer boxes where possible, i.e. where one can form rectangles or
>>> >>>>> larger cubes. I would thus expect no more than (30/4)^2 = 64 bboxes on
>>> >>>>> level one.
>>> >>>> That makes sense. I think that there is a bug somewhere which is causing
>>> >>>> the box set to be much too big. Furthermore, it does not happen on every
>>> >>>> run, only sometimes. When it does not happen, I hit another bug after a
>>> >>>> few coarse timesteps:
>>> >>>>
>>> >>>> I get a range-check exception from std::vector in a call to:
>>> >>>> gh::get_local_component (rl=1, c=8)
>>> >>>> the problem is that this returns:
>>> >>>> local_components_.AT(rl).AT(c);
>>> >>>> and local_components_[1].size() is 8
>>> >>>> The call to get_local_component is coming from ggf::transfer_from_all
>>> >>>> at:
>>> >>>> int const lc2 = h.get_local_component(rl2,c2);
>>> >>>> where c2 is from psend.component.
>>> >>>> So it looks like there is an off-by-one error somewhere.
>>> >>> Very strange. This code should be quite solid by now. psend is set in
>>> >>> the file dh.cc in thorn Carpet/CarpetLib; there is one (large) routine
>>> >>> that calculates the communication schedule. Some of the indexing
>>> >>> errors there in the past included confusing the number of components
>>> >>> on different refinement levels, which led to indexing errors such as
>>> >>> the one you describe.
>>> >> The bad component numbers are not coming from:
>>> >> preg.component = tmpncomps.AT(m)++;
>>> >> in Carpet/src/Recompose.cc
>>> >>
>>> >> Where else are the component numbers assigned?
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at einsteintoolkit.org
>>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>> --
>> Hal Finkel
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 1-630-252-0023
>> hfinkel at anl.gov
>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>   http://www.cct.lsu.edu/~eschnett/
>



-- 
Erik Schnetter <schnetter at cct.lsu.edu>   http://www.cct.lsu.edu/~eschnett/


More information about the Users mailing list