[Users] memory leak in Carpet?

Wed Aug 8 06:04:58 CDT 2018

> On 8 Aug 2018, at 11:41, Miguel Zilhão <miguel.zilhao.nogueira at tecnico.ulisboa.pt> wrote:
> 
> hi Ian,
> 
>> The memory seems to reach a steady state by iteration ~3000.  Can you run an example where it dies with an OOM?
> 
> the OOM cases that i had were done in our local cluster (where i haven't compiled with tcmalloc); those were just higher resolution versions of this same simulation, where the OOM would be triggered around the time of one of these memory increases (ie, after iteration 2000 in this case, i'd guess).

Hi Miguel,

The memory problems are very likely strongly related to the machine you run on.  I don't know that we can take much information from a smaller test run on a different machine. We already see from this run that Carpet is not "leaking" memory continuously; the curves for allocated memory show what has been malloced and not freed, and it remains more or less constant after the initial phase.

I think it's worth trying to get tcmalloc running on the cluster.  So this means that you have never seen the OOM happen when using tcmalloc.  It's possible that the improved memory allocation in tcmalloc over glibc would entirely solve the problem.  

>> Can you check this by plotting tcmalloc::generic_current_allocated + tcmalloc::pageheap_free against systemstatistics-process_memory::maxrss?  If that is the case, then there is no issue with fragmentation, because even though the address space is fragmented, the "holes" have mostly been returned to the OS for other processes to use ("unmapped").
> 
> sure, i've attached a plot with this.

Sorry, I made a mistake.  It should have been pageheap_unmapped, not pageheap_free.  Sorry!  pageheap_free is essentially zero, and cannot account for the difference.

>> The point that Roland made also applies here: we are looking at the max across all processes and assuming that every process is the same.  It's possible that one process has a high unmapped curve, but another has a high rss curve, and we don't see this on the plot.  We would have to do 1D output of the grid arrays and plot each process separately to see the full detail.  One way to see if this is necessary would be to plot both the max and min instead of just the max.  That way, we can see if this is likely to be an issue.
> 
> ok, i'm attaching another plot with both the min (dashed lines) and the max (full lines) plotted. i hope it helps.

Thanks.  This shows that the gridfunction usage is more or less similar across all processes, which is good.  However, there is significant variation in most of the other quantities across processes.  To understand this better, we would have to look at 1D ASCII output of the grid arrays, which is a bit painful to plot in gnuplot.  Before this, I would definitely try to get tcmalloc running and outputting this information on the cluster in a run that actually shows the OOM.  My guess is that you won't get an OOM with tcmalloc, and all will be fine :)

-- 
Ian Hinder
https://ianhinder.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20180808/bccd260f/attachment.html