<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 23 Jul 2018, at 15:00, Miguel Zilhão &lt;<a href="mailto:miguel.zilhao.nogueira@tecnico.ulisboa.pt" class="">miguel.zilhao.nogueira@tecnico.ulisboa.pt</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">hi Ian and all,<br class=""><br class=""><blockquote type="cite" class=""><blockquote type="cite" class=""><blockquote type="cite" class="">This could be caused by memory fragmentation due to all the freeing and mallocing that happens during regridding when the sizes of the grids change. &nbsp;Can you try using tcmalloc or jemalloc instead of glibc malloc and reporting back? &nbsp;One workaround could be to run shorter simulations (i.e. set a walltime of 12 h instead of 24 h).<br class=""></blockquote><br class="">thanks for your reply. in one of my cases, for the resolution used and the available memory, i was out of memory quite quickly -- within 6 hours or so... so unfortunately it becomes a bit impractical for large simulations...<br class=""><br class="">what would i need to do in order to use tcmalloc or jemalloc?<br class=""></blockquote>I have used tcmalloc. &nbsp;I think you will need the following:<br class="">- Install tcmalloc (<a href="https://github.com/gperftools/gperftools" class="">https://github.com/gperftools/gperftools</a>), and libunwind, which it depends on.<br class="">- In your optionlist, link with tcmalloc. &nbsp;I have<br class="">LDFLAGS = -rdynamic&nbsp;-L/home/ianhin/software/gperftools-2.1/lib -Wl,-rpath,/home/ianhin/software/gperftools-2.1/lib -ltcmalloc<br class="">This should be sufficient I think for tcmalloc to be used instead of glibc malloc. &nbsp;Try this out, and see if things are better. &nbsp;I also have a thorn which hooks into the tcmalloc API. &nbsp;You can get it from<br class=""></blockquote><br class="">thanks a lot for these pointers. i've tried it out, though i've used tcmalloc from ubuntu's repositories and therefore compiled ET with -ltcmalloc_minimal. i don't know whether this makes a difference, but from the trial run that i'm doing i so far seem to see the same memory increase i had seen before...<br class=""></div></div></blockquote><div><br class=""></div><div>Hi,</div><div><br class=""></div><div>Did you use the tcmalloc thorn and the parameters to make it release memory back to the OS after each regridding?</div><br class=""><blockquote type="cite" class=""><div class=""><div class="">is there anything else that can be tried to try to pinpoint this issue? it seems to be serious... i looked for an open ticket but didn't find anything. shall i submit one?<br class=""></div></div></blockquote><div><br class=""></div></div><div class="">You can look at the Carpet::memory_procs variable (in Carpet/Carpet/interface.ccl). &nbsp;This will tell you how much memory Carpet has allocated for different things. &nbsp;If one of these is growing during the run, but resets to a lower value after checkpoint recovery, then that suggests a memory leak.</div><div class=""><br class=""></div><div class="">Hmm. &nbsp;Now this is coming back to me. &nbsp;I just searched my email, and I found a draft email that I never sent from 2015 with subject "Memory leak in Carpet?". &nbsp;Here it is:</div><div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><div style="font-family: Menlo-Regular; font-size: 11px;" class="">Does anyone have any reason to suspect that something in Carpet might be leaking memory? &nbsp;I have a fairly straightforward QC0 simulation, based on qc0-mclachlan, and it looks like something is leaking memory.</div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><br class=""></div><div style="font-family: Menlo-Regular; font-size: 11px;" class="">I have looked at several diagnostics. &nbsp;</div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><br class=""></div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><span class="Apple-tab-span" style="white-space: pre;">        </span>–&nbsp;carpet::grid_functions: this contains the amount of memory Carpet thinks is allocated for grid functions. &nbsp;Since I have regridding, this is not in general going to be constant, but it turns out that it remains approximate constant at about 5 GB per process (average and maximum across processes are about the same). &nbsp;Each process has 12 GB available on Datura. &nbsp;So I should be well within the memory limits of the machine, by more than a factor of 2.</div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><br class=""></div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><span class="Apple-tab-span" style="white-space: pre;">        </span>– The run crashes with signal 9 during regridding at some point; this is probably the OOM killer.</div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><span class="Apple-tab-span" style="white-space: pre;">        </span></div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><span class="Apple-tab-span" style="white-space: pre;">        </span>– SystemStatistics::swap_used_mb starts to grow after the first regridding, and seems to grow linearly throughout the run. &nbsp;The crash time corresponds to it hitting about 18 GB, which is the maximum swap configured on the node.</div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><br class=""></div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><span class="Apple-tab-span" style="white-space: pre;">        </span>–&nbsp;SystemStatistics::arena: This is the 'arena' field from mallinfo (<a href="http://man7.org/linux/man-pages/man3/mallinfo.3.html" class="">http://man7.org/linux/man-pages/man3/mallinfo.3.html</a>) which is supposed to give 'Non-mmapped space allocated (bytes)'. &nbsp;This suffers from being stored in a 32 bit integer in the mallinfo structure (<a href="https://sourceware.org/ml/libc-alpha/2014-11/msg00431.html" class="">https://sourceware.org/ml/libc-alpha/2014-11/msg00431.html</a>), but I have adjusted it manually by adding 4 GB when it looks like it is dropping unphysically. &nbsp;This shows that the amount of non-mmapped memory allocated is increasing on the order of 1 GB on each regridding, and not between regriddings. &nbsp;Firstly, the amount of memory allocated shouldn't be increasing so much, since carpet::grid_functions remains approximately constant. &nbsp;Secondly, I thought that we were supposed to be using mmap for grid function data, so why do we have such a large amount of non-mmapped memory?</div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><br class=""></div><div style="font-family: Menlo-Regular; font-size: 11px;" class=""><span class="Apple-tab-span" style="white-space: pre;">        </span>– SystemStatistics::hblkhd: This is the hblkhd field from mallinfo, which is supposed to be "Space allocated in mmapped regions (bytes)". &nbsp;This increases to about 2 GB after a couple of regriddings, but then stays roughly constant at 2 GB, which seems fine, apart from the fact that I had hoped that mmap was being used for all the gridfunction data</div></blockquote></div><div class=""><br class=""></div><div class="">I suspect I got distracted doing my own investigations while I was writing it, and then lost track of it. &nbsp;Typically, I think I run with excess memory for each run, so that by 24 h, it hasn't grown too badly.</div><div class=""><br class=""></div><div class="">Further searching finds an email from someone else saying that they had memory leak problems and asking me about it. &nbsp;My reply was:</div><div class=""><br class=""></div><div class="">On 5 May 2015, at 17:14, Ian Hinder &lt;<a href="mailto:ian.hinder@aei.mpg.de" class="">ian.hinder@aei.mpg.de</a>&gt; wrote:</div><div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><div class="">Hi,</div><div class=""><br class=""></div><div class="">I have added Erik and Roland to CC, as we have been discussing this; I hope this is OK.</div><div class=""><br class=""></div><div class="">It sounds very similar. &nbsp;I am running on several hundred cores (&lt;600) and the simulations often fail with OOM errors after less than a day. &nbsp;First the RSS grows, then the swap, then the OOM-killer kills it. &nbsp;I have observed this both on Datura and Hydra. &nbsp;Stopping the simulations and recovering from a checkpoint usually fixes the problem, and it runs on for another half day or so. &nbsp;I have done a fair amount of work on this, so I will summarise here.</div><div class=""><br class=""></div><div class=""><b class="">Monitor process RSS</b></div><div class=""><br class=""></div><div class="">The process resident set size is the amount of address space which is currently mapped into physical memory by the OS. &nbsp;Thorn SystemStatistics can be used to measure this and put it into a Cactus variable, which can then be reduced across processes. &nbsp;I use:</div><div class=""><br class=""></div><div class="">IOBasic::outInfo_every &nbsp; &nbsp; &nbsp;= 1<br class="">IOBasic::outInfo_reductions = "maximum"<br class="">IOBasic::outInfo_vars &nbsp; &nbsp; &nbsp; = "<br class="">&nbsp; SystemStatistics::maxrss_mb<br class="">&nbsp; SystemStatistics::swap_used_mb<br class=""><div class="">&nbsp; Carpet::gridfunctions</div>"</div><div class=""><br class=""></div><div class="">and</div><div class=""><br class=""></div><div class=""><div class="">IOScalar::outScalar_every = 128</div><div class="">IOScalar::outScalar_vars &nbsp;= "</div><div class="">&nbsp; SystemStatistics::process_memory_mb</div><div class="">&nbsp; Carpet::memory_procs</div><div class="">"</div></div><div class=""><br class=""></div><div class="">SystemStatistics calls its variable "maxrss" but it should actually be called "rss", as that is what is output. &nbsp;maxrss is also available from the OS, and would give the maximum the RSS had ever been during the process lifetime.</div><div class=""><br class=""></div><div class="">Carpet::gridfunctions (in Carpet::memory_procs) measures the amount of memory Carpet has allocated in gridfunctions. &nbsp;For me, this remains essentially flat, whereas maxrss grows after each regridding until it reaches the maximum available, then the swap starts to grow. &nbsp;This indicates that the problem is not due to Carpet allocating more and more grid points due to grids changing size. &nbsp;It could be due to failing to free allocated memory (a leak) or freed data taking up space which cannot be used for further allocations or returned to the OS (fragmentation).</div><div class=""><br class=""></div><div class=""><b class="">Terminate and checkpoint on OOM</b></div><div class=""><br class=""></div><div class="">I have a local change to SystemStatistics which adds parameters for maximum values of RSS and swap usage, above which it calls CCTK_TerminateNext, so if you have checkpoint_on_terminate, you get a clean termination and can continue the run without losing too much CPU time. &nbsp;I have been running with this for a couple of weeks now, and it works as advertised. &nbsp;I have a branch with this on, but I just realised it conflicts with a change Erik made. &nbsp;If you want this, let me know and I will sort it out.</div><div class=""><br class=""></div><div class=""><b class="">Memory profiling</b></div><div class=""><br class=""></div><div class="">The malloc implementation in glibc provides no usable statistics. &nbsp;mallinfo is limited to 32 bit integers, which overflow for 64 bit systems. &nbsp;malloc_info, at least in the version on datura, doesn't include memory allocated via mmap. &nbsp;Useless. Instead, you need to use an external memory profiler to see what is going on. &nbsp;I have used "igprof" successfully, and this shows me that there is no "leak" of allocated memory corresponding to the increase in RSS. &nbsp;i.e. the problem is not caused by forgetting to free something. &nbsp;This suggests that the problem is fragmentation, where malloc has unallocated blocks of memory which it does not or cannot return to the OS. &nbsp;Malloc allocates memory in two ways: either in its main heap, or by allocating anonymous mmap regions. &nbsp;I had thought that only the latter could be fully returned to the OS, but this is not true. &nbsp;Any region of address space can be marked as unused (internally via the&nbsp;madvise(MADV_DONTNEED) system call) and a malloc implementation can do this on regions of its address space which have been freed. &nbsp;If such regions are too small (smaller than a page), then they could accumulate and not be returned to the OS.</div><div class=""><br class=""></div><div class=""><b class="">Alternative malloc implementations</b></div><div class=""><br class=""></div><div class="">At the suggestion of Roland, I tried using the tcmalloc (<a href="http://gperftools.googlecode.com/git/doc/tcmalloc.html" class="">http://gperftools.googlecode.com/git/doc/tcmalloc.html</a>) library, which is a drop-in replacement for glibc malloc which is part of gperftools. &nbsp;This works fairly easily. &nbsp;You can compile the "minimal" version with no dependencies and then modify your optionlist:</div><div class=""><br class=""></div><div class=""><div class="">CPPFLAGS = -I/home/rhaas/software/gperftools-2.1/include/gperftools</div><div class="">LDFLAGS =&nbsp;-L/home/rhaas/software/gperftools-2.1/lib -Wl,-rpath,/home/rhaas/software/gperftools-2.1/lib -ltcmalloc_minimal</div><br class=""></div><div class="">I found in one example case that this reduced the RSS process growth, so I am now using it for all my simulations. &nbsp;However, I still run into the same problem eventually, so it might be that it makes it better but doesn't solve it completely.</div><div class=""><br class=""></div><div class="">tcmalloc has an introspection interface which is presumably not as useless as glibc's malloc:&nbsp;<a href="http://gperftools.googlecode.com/git/doc/tcmalloc.html" class="">http://gperftools.googlecode.com/git/doc/tcmalloc.html</a>. I haven't tried this yet.</div><div class=""><br class=""></div><div class=""><b class="">Checkpoint recovery</b></div><div class=""><br class=""></div><div class="">I noticed from the igprof profile that there are 11000 allocations (and frees) during checkpoint recovery on one process, all from the HDF5 uncompression routine. &nbsp;This is the "deflate" filter. &nbsp;When it decompresses a dataset, it allocates a buffer, initially sized the same as the compressed dataset (really dumb, as it will always need to be bigger). &nbsp;It then uncompresses into the buffer, "realloc"ing the buffer to twice the size each time it runs out of space. &nbsp;You can imagine that this might cause a lot of fragmentation. &nbsp;There is no tunable parameter, but we could modify the code (it's in&nbsp;<a href="https://svn.hdfgroup.uiuc.edu/hdf5/tags/hdf5-1_8_12/src/H5Zdeflate.c" class="">https://svn.hdfgroup.uiuc.edu/hdf5/tags/hdf5-1_8_12/src/H5Zdeflate.c</a>) to use a much larger starting buffer size, in the hope that this reduces the number of reallocs, and hence the amount of fragmentation. &nbsp;This wouldn't help the accumulated RSS, but it would probably produce a one-off decrease in the amount of fragmentation. &nbsp;I am currently not using periodic checkpointing, so I don't know if the compression routine has the same problem. &nbsp;Probably not, since it knows the output buffer size has to be smaller than the input buffer size. &nbsp;Apparently Frank Löffler also modified this routine, which solved some of his problems of running out of memory during recovery. &nbsp;Another alternative would be to disable checkpoint compression.</div><div class=""><br class=""></div><div class="">To see if you are suffering from the same problem, I think the quickest way would be to link against tcmalloc and use&nbsp;<span style="background-color: rgb(255, 255, 255);" class="">MallocExtension::instance()-&gt;GetNumericProperty(property_name, value) from tcmalloc to read off the&nbsp;</span>generic.current_allocated_bytes property (Number of bytes used by the application. This will not typically match the memory use reported by the OS, because it does not include TCMalloc overhead or memory fragmentation). &nbsp;You could also look at the other properties they provide. &nbsp;Then compare this with the process RSS from systemstatistics, and Carpet's gridfunctions variable, and check to see if you actually have a memory leak, or if you are suffering from fragmentation. &nbsp;</div></blockquote></div><div class=""><br class=""></div><div class="">This brings to mind another question: are you using HDF5 compression? &nbsp;If so, do you see the same problem if you switch it off?</div><div class=""><br class=""></div><div class="">And finally: do you get this growth on the very first segment of a simulation, or only on subsequent segments? &nbsp;I am thinking that checkpoint *recovery* severely fragments the heap, especially with compression, and this somehow causes growth in the RSS with each regridding.</div><div class=""><br class=""></div><div class="">So, while I had forgotten about all this, it turns out that I had actually thought quite a lot about it :)</div><div class=""><br class=""></div><div class="">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">--&nbsp;<br class="">Ian Hinder<br class=""><a href="https://ianhinder.net" class="">https://ianhinder.net</a><br class=""></div></div>


</div>

<br class=""></body></html>