[Users] memory leak in Carpet?

Sun Jul 29 15:12:18 CDT 2018

hi Ian,

thanks again for your thorough reply! i've checked the low resolution run that i
have (and which ran successfully) and the pattern that i observe is that maxrss
typically grows after each regridding. the increase in maxrss is not monotonic,
though, as it does go down at times. then, after the BHs merge and the
regridding stops, maxrss settles down (to a value still a bit higher than the
maxrss at t=0). this is all for the first segment of the simulation, as i
haven't done any checkpointing for this run.

so i guess i don't have enough data to conclude whether what i'm observing is
indeed a memory leak, or if it's just Carpet's regridding algorithm doing what
it's supposed to do. in any case, i guess what surprises me is that, at times,
the memory consumption can be bigger than what it was at the beginning of the
run by a factor of 1.6, which can easily lead to an out-of-memory situation...

unfortunately these days i don't have the time to investigate this any
further...

many thanks,
Miguel

On 23/07/2018 16:06, ian.hinder at aei.mpg.de wrote:
> 
> 
>> On 23 Jul 2018, at 15:00, Miguel Zilhão <miguel.zilhao.nogueira at tecnico.ulisboa.pt 
>> <mailto:miguel.zilhao.nogueira at tecnico.ulisboa.pt>> wrote:
>>
>> hi Ian and all,
>>
>>>>> This could be caused by memory fragmentation due to all the freeing and mallocing that happens 
>>>>> during regridding when the sizes of the grids change.  Can you try using tcmalloc or jemalloc 
>>>>> instead of glibc malloc and reporting back?  One workaround could be to run shorter simulations 
>>>>> (i.e. set a walltime of 12 h instead of 24 h).
>>>>
>>>> thanks for your reply. in one of my cases, for the resolution used and the available memory, i 
>>>> was out of memory quite quickly -- within 6 hours or so... so unfortunately it becomes a bit 
>>>> impractical for large simulations...
>>>>
>>>> what would i need to do in order to use tcmalloc or jemalloc?
>>> I have used tcmalloc.  I think you will need the following:
>>> - Install tcmalloc (https://github.com/gperftools/gperftools), and libunwind, which it depends on.
>>> - In your optionlist, link with tcmalloc.  I have
>>> LDFLAGS = -rdynamic -L/home/ianhin/software/gperftools-2.1/lib 
>>> -Wl,-rpath,/home/ianhin/software/gperftools-2.1/lib -ltcmalloc
>>> This should be sufficient I think for tcmalloc to be used instead of glibc malloc.  Try this out, 
>>> and see if things are better.  I also have a thorn which hooks into the tcmalloc API.  You can 
>>> get it from
>>
>> thanks a lot for these pointers. i've tried it out, though i've used tcmalloc from ubuntu's 
>> repositories and therefore compiled ET with -ltcmalloc_minimal. i don't know whether this makes a 
>> difference, but from the trial run that i'm doing i so far seem to see the same memory increase i 
>> had seen before...
> 
> Hi,
> 
> Did you use the tcmalloc thorn and the parameters to make it release memory back to the OS after 
> each regridding?
> 
>> is there anything else that can be tried to try to pinpoint this issue? it seems to be serious... 
>> i looked for an open ticket but didn't find anything. shall i submit one?
> 
> You can look at the Carpet::memory_procs variable (in Carpet/Carpet/interface.ccl).  This will tell 
> you how much memory Carpet has allocated for different things.  If one of these is growing during 
> the run, but resets to a lower value after checkpoint recovery, then that suggests a memory leak.
> 
> Hmm.  Now this is coming back to me.  I just searched my email, and I found a draft email that I 
> never sent from 2015 with subject "Memory leak in Carpet?".  Here it is:
> 
>> Does anyone have any reason to suspect that something in Carpet might be leaking memory?  I have a 
>> fairly straightforward QC0 simulation, based on qc0-mclachlan, and it looks like something is 
>> leaking memory.
>>
>> I have looked at several diagnostics.
>>
>> – carpet::grid_functions: this contains the amount of memory Carpet thinks is allocated for grid 
>> functions.  Since I have regridding, this is not in general going to be constant, but it turns out 
>> that it remains approximate constant at about 5 GB per process (average and maximum across 
>> processes are about the same).  Each process has 12 GB available on Datura.  So I should be well 
>> within the memory limits of the machine, by more than a factor of 2.
>>
>> – The run crashes with signal 9 during regridding at some point; this is probably the OOM killer.
>> – SystemStatistics::swap_used_mb starts to grow after the first regridding, and seems to grow 
>> linearly throughout the run.  The crash time corresponds to it hitting about 18 GB, which is the 
>> maximum swap configured on the node.
>>
>> – SystemStatistics::arena: This is the 'arena' field from mallinfo 
>> (http://man7.org/linux/man-pages/man3/mallinfo.3.html) which is supposed to give 'Non-mmapped 
>> space allocated (bytes)'.  This suffers from being stored in a 32 bit integer in the mallinfo 
>> structure (https://sourceware.org/ml/libc-alpha/2014-11/msg00431.html), but I have adjusted it 
>> manually by adding 4 GB when it looks like it is dropping unphysically.  This shows that the 
>> amount of non-mmapped memory allocated is increasing on the order of 1 GB on each regridding, and 
>> not between regriddings.  Firstly, the amount of memory allocated shouldn't be increasing so much, 
>> since carpet::grid_functions remains approximately constant.  Secondly, I thought that we were 
>> supposed to be using mmap for grid function data, so why do we have such a large amount of 
>> non-mmapped memory?
>>
>> – SystemStatistics::hblkhd: This is the hblkhd field from mallinfo, which is supposed to be "Space 
>> allocated in mmapped regions (bytes)".  This increases to about 2 GB after a couple of 
>> regriddings, but then stays roughly constant at 2 GB, which seems fine, apart from the fact that I 
>> had hoped that mmap was being used for all the gridfunction data
> 
> I suspect I got distracted doing my own investigations while I was writing it, and then lost track 
> of it.  Typically, I think I run with excess memory for each run, so that by 24 h, it hasn't grown 
> too badly.
> 
> Further searching finds an email from someone else saying that they had memory leak problems and 
> asking me about it.  My reply was:
> 
> On 5 May 2015, at 17:14, Ian Hinder <ian.hinder at aei.mpg.de <mailto:ian.hinder at aei.mpg.de>> wrote:
> 
>> Hi,
>>
>> I have added Erik and Roland to CC, as we have been discussing this; I hope this is OK.
>>
>> It sounds very similar.  I am running on several hundred cores (<600) and the simulations often 
>> fail with OOM errors after less than a day.  First the RSS grows, then the swap, then the 
>> OOM-killer kills it.  I have observed this both on Datura and Hydra.  Stopping the simulations and 
>> recovering from a checkpoint usually fixes the problem, and it runs on for another half day or so. 
>>  I have done a fair amount of work on this, so I will summarise here.
>>
>> *Monitor process RSS*
>>
>> The process resident set size is the amount of address space which is currently mapped into 
>> physical memory by the OS.  Thorn SystemStatistics can be used to measure this and put it into a 
>> Cactus variable, which can then be reduced across processes.  I use:
>>
>> IOBasic::outInfo_every      = 1
>> IOBasic::outInfo_reductions = "maximum"
>> IOBasic::outInfo_vars       = "
>>   SystemStatistics::maxrss_mb
>>   SystemStatistics::swap_used_mb
>>   Carpet::gridfunctions
>> "
>>
>> and
>>
>> IOScalar::outScalar_every = 128
>> IOScalar::outScalar_vars  = "
>>   SystemStatistics::process_memory_mb
>>   Carpet::memory_procs
>> "
>>
>> SystemStatistics calls its variable "maxrss" but it should actually be called "rss", as that is 
>> what is output.  maxrss is also available from the OS, and would give the maximum the RSS had ever 
>> been during the process lifetime.
>>
>> Carpet::gridfunctions (in Carpet::memory_procs) measures the amount of memory Carpet has allocated 
>> in gridfunctions.  For me, this remains essentially flat, whereas maxrss grows after each 
>> regridding until it reaches the maximum available, then the swap starts to grow.  This indicates 
>> that the problem is not due to Carpet allocating more and more grid points due to grids changing 
>> size.  It could be due to failing to free allocated memory (a leak) or freed data taking up space 
>> which cannot be used for further allocations or returned to the OS (fragmentation).
>>
>> *Terminate and checkpoint on OOM*
>>
>> I have a local change to SystemStatistics which adds parameters for maximum values of RSS and swap 
>> usage, above which it calls CCTK_TerminateNext, so if you have checkpoint_on_terminate, you get a 
>> clean termination and can continue the run without losing too much CPU time.  I have been running 
>> with this for a couple of weeks now, and it works as advertised.  I have a branch with this on, 
>> but I just realised it conflicts with a change Erik made.  If you want this, let me know and I 
>> will sort it out.
>>
>> *Memory profiling*
>>
>> The malloc implementation in glibc provides no usable statistics.  mallinfo is limited to 32 bit 
>> integers, which overflow for 64 bit systems.  malloc_info, at least in the version on datura, 
>> doesn't include memory allocated via mmap.  Useless. Instead, you need to use an external memory 
>> profiler to see what is going on.  I have used "igprof" successfully, and this shows me that there 
>> is no "leak" of allocated memory corresponding to the increase in RSS.  i.e. the problem is not 
>> caused by forgetting to free something.  This suggests that the problem is fragmentation, where 
>> malloc has unallocated blocks of memory which it does not or cannot return to the OS.  Malloc 
>> allocates memory in two ways: either in its main heap, or by allocating anonymous mmap regions.  I 
>> had thought that only the latter could be fully returned to the OS, but this is not true.  Any 
>> region of address space can be marked as unused (internally via the madvise(MADV_DONTNEED) system 
>> call) and a malloc implementation can do this on regions of its address space which have been 
>> freed.  If such regions are too small (smaller than a page), then they could accumulate and not be 
>> returned to the OS.
>>
>> *Alternative malloc implementations*
>>
>> At the suggestion of Roland, I tried using the tcmalloc 
>> (http://gperftools.googlecode.com/git/doc/tcmalloc.html) library, which is a drop-in replacement 
>> for glibc malloc which is part of gperftools.  This works fairly easily.  You can compile the 
>> "minimal" version with no dependencies and then modify your optionlist:
>>
>> CPPFLAGS = -I/home/rhaas/software/gperftools-2.1/include/gperftools
>> LDFLAGS = -L/home/rhaas/software/gperftools-2.1/lib 
>> -Wl,-rpath,/home/rhaas/software/gperftools-2.1/lib -ltcmalloc_minimal
>>
>> I found in one example case that this reduced the RSS process growth, so I am now using it for all 
>> my simulations.  However, I still run into the same problem eventually, so it might be that it 
>> makes it better but doesn't solve it completely.
>>
>> tcmalloc has an introspection interface which is presumably not as useless as glibc's malloc: 
>> http://gperftools.googlecode.com/git/doc/tcmalloc.html. I haven't tried this yet.
>>
>> *Checkpoint recovery*
>>
>> I noticed from the igprof profile that there are 11000 allocations (and frees) during checkpoint 
>> recovery on one process, all from the HDF5 uncompression routine.  This is the "deflate" filter. 
>>  When it decompresses a dataset, it allocates a buffer, initially sized the same as the compressed 
>> dataset (really dumb, as it will always need to be bigger).  It then uncompresses into the buffer, 
>> "realloc"ing the buffer to twice the size each time it runs out of space.  You can imagine that 
>> this might cause a lot of fragmentation.  There is no tunable parameter, but we could modify the 
>> code (it's in https://svn.hdfgroup.uiuc.edu/hdf5/tags/hdf5-1_8_12/src/H5Zdeflate.c) to use a much 
>> larger starting buffer size, in the hope that this reduces the number of reallocs, and hence the 
>> amount of fragmentation.  This wouldn't help the accumulated RSS, but it would probably produce a 
>> one-off decrease in the amount of fragmentation.  I am currently not using periodic checkpointing, 
>> so I don't know if the compression routine has the same problem.  Probably not, since it knows the 
>> output buffer size has to be smaller than the input buffer size.  Apparently Frank Löffler also 
>> modified this routine, which solved some of his problems of running out of memory during recovery. 
>>  Another alternative would be to disable checkpoint compression.
>>
>> To see if you are suffering from the same problem, I think the quickest way would be to link 
>> against tcmalloc and use MallocExtension::instance()->GetNumericProperty(property_name, value) 
>> from tcmalloc to read off the generic.current_allocated_bytes property (Number of bytes used by 
>> the application. This will not typically match the memory use reported by the OS, because it does 
>> not include TCMalloc overhead or memory fragmentation).  You could also look at the other 
>> properties they provide.  Then compare this with the process RSS from systemstatistics, and 
>> Carpet's gridfunctions variable, and check to see if you actually have a memory leak, or if you 
>> are suffering from fragmentation. 
> 
> This brings to mind another question: are you using HDF5 compression?  If so, do you see the same 
> problem if you switch it off?
> 
> And finally: do you get this growth on the very first segment of a simulation, or only on subsequent 
> segments?  I am thinking that checkpoint *recovery* severely fragments the heap, especially with 
> compression, and this somehow causes growth in the RSS with each regridding.
> 
> So, while I had forgotten about all this, it turns out that I had actually thought quite a lot about 
> it :)
> 
> -- 
> Ian Hinder
> https://ianhinder.net
>