[Users] memory leak in Carpet?

Fri Aug 10 06:37:13 CDT 2018

> On 8 Aug 2018, at 12:38, Miguel Zilhão <miguel.zilhao.nogueira at tecnico.ulisboa.pt> wrote:
> 
> hi Ian,
>> The memory problems are very likely strongly related to the machine you run on.  I don't know that we can take much information from a smaller test run on a different machine. We already see from this run that Carpet is not "leaking" memory continuously; the curves for allocated memory show what has been malloced and not freed, and it remains more or less constant after the initial phase.
>> I think it's worth trying to get tcmalloc running on the cluster.  So this means that you have never seen the OOM happen when using tcmalloc.  It's possible that the improved memory allocation in tcmalloc over glibc would entirely solve the problem.
> 
> well, i did have cases where i'd ran out of memory also in my workstation with tcmalloc (where i've been doing these tests), with this same configuration and more resolution. i don't have an OOM-killer in the workstation, though, so at some point the system would just start to swap (at which point i'd kill the job).

OK.

>> Sorry, I made a mistake.  It should have been pageheap_unmapped, not pageheap_free.  Sorry!   pageheap_free is essentially zero, and cannot account for the difference.
> 
> ah, no problem. i'm attaching the updated plot.

Good, that looks better.  So we see that the rss mostly follows the sum of allocated and unmapped memory.  I think one thing I have seen in the past is that a high rss is not necessarily an indication of a problem.  Even thought the OS hasn't unmapped the pages from the process' address space, the memory is free if another process (or the current process) needs it.  I suspect that the saturation point at iteration ~3000 is the point at which all the processes have a lot of unmapped memory, and the OS needs to start actually unmapping it, which stops the rss from growing any further.

>>>> The point that Roland made also applies here: we are looking at the max across all processes and assuming that every process is the same.  It's possible that one process has a high unmapped curve, but another has a high rss curve, and we don't see this on the plot.  We would have to do 1D output of the grid arrays and plot each process separately to see the full detail.  One way to see if this is necessary would be to plot both the max and min instead of just the max.  That way, we can see if this is likely to be an issue.
>>> 
>>> ok, i'm attaching another plot with both the min (dashed lines) and the max (full lines) plotted. i hope it helps.
>> Thanks.  This shows that the gridfunction usage is more or less similar across all processes, which is good.  However, there is significant variation in most of the other quantities across processes.   To understand this better, we would have to look at 1D ASCII output of the grid arrays, which is a bit painful to plot in gnuplot.  Before this, I would definitely try to get tcmalloc running and outputting this information on the cluster in a run that actually shows the OOM.  My guess is that you won't get an OOM with tcmalloc, and all will be fine :)
> 
> ok, i could also try to do this on cluster once it's back online (currently it's down for maintenance).

OK. I'll be interested to see the results when you have them.  The thing to look out for is generic_current_allocated growing.

-- 
Ian Hinder
https://ianhinder.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20180810/21aea32d/attachment.html