<div dir="ltr">Regarding OpenMP:<div><br></div><div>Cactus usually tries to optimize which treads run on which cores. If there are multiple independent processes running on a node, then Cactus must not do that, since this will slow down both applications a lot. In particular, must not set "CACTUS_SET_THREAD_BINDINGS=1" (not setting it or setting it to 0 is fine).</div><div><br></div><div>As Ian mentioned, you can either build without OpenMP support, or choose to use a single thread at run time, both will work.</div><div><br></div><div>At this time, it might be best to post your complete setup, i.e. all the options and scripts you are using to configure and build and submit and run, so that others can have a look and cross-check.</div><div><br></div><div>Of course, all of this is independent of any nans you encounter. Those are still due to a bug. I don't think it makes much sense at this point to debug this -- instead, you will want to run a larger simulations.</div><div><br></div><div>-erik</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 23, 2017 at 4:34 AM, Einstein Toolkit <span dir="ltr"><<a href="mailto:trac-noreply@einsteintoolkit.org" target="_blank">trac-noreply@einsteintoolkit.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">#2008: NaNs when running static tov on >40 cores<br>
</span><span class="">------------------------------<wbr>------+-----------------------<wbr>----------------<br>
Reporter: allgwy001@… | Owner: knarf<br>
Type: defect | Status: assigned<br>
Priority: major | Milestone:<br>
Component: Carpet | Version: ET_2016_05<br>
Resolution: | Keywords:<br>
------------------------------<wbr>------+-----------------------<wbr>----------------<br>
<br>
</span>Comment (by hinder):<br>
<br>
Replying to [comment:6 allgwy001@…]:<br>
<span class=""> > Thank you very much for looking into it!<br>
><br>
> We can't seem to disable OpenMP in the compile. However, it's<br>
effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not<br>
play nicely with other software, especially in the hybridized domain of<br>
combining OpenMPI and OpenMP, where multiple users share nodes", they are<br>
not willing to enable it for me.<br>
<br>
</span> I have never run on a system where multiple users share nodes; it doesn't<br>
really fit very well with the sort of application that Cactus is. You<br>
don't want to be worrying about whether other processes are competing with<br>
you for memory, memory bandwidth, or cores. When you have exclusive<br>
access to each node, OpenMP is usually a good idea. By the way: what sort<br>
of interconnect do you have? Gigabit ethernet, or infiniband, or<br>
something else? If users are sharing nodes, then I suspect that this<br>
cluster is gigabit ethernet only, and you may be limited to small jobs,<br>
since the performance of gigabit ethernet will quickly become your<br>
bottleneck. What cluster are you using? From your email address, I'm<br>
guessing that it is one of the ones at <a href="http://hpc.uct.ac.za" rel="noreferrer" target="_blank">http://hpc.uct.ac.za</a>? If so, or<br>
you are using a similar scheduler, then you should be able to do this, as<br>
in their documentation:<br>
<br>
NB3: If your software prefers to use all cores on a computer then make<br>
sure that you reserve these cores. For example running on an 800 series<br>
server which has 8 cores per server change the directive line in your<br>
script as follows:<br>
<br>
#PBS -l nodes=1:ppn=8:series800<br>
<br>
Once you are the exclusive user of a node, I don't see a problem with<br>
enabling OpenMP. Also note: OpenMP is not something that needs to be<br>
enabled by the system administrator; it is determined by your compilation<br>
flags (on by default in the ET) and activated with OMP_NUM_THREADS. Is it<br>
possible that there was a confusion, and the admins were talking about<br>
hyperthreading instead, which is a very different thing, and which I agree<br>
you probably don't want to have enabled (it would have to be enabled by<br>
the admins)?<br>
<span class=""><br>
><br>
> Excluding boundary points and symmetry points, I find 31 evolved points<br>
in each spatial direction for the most refined region. That gives 29 791<br>
points in three dimensions. For each of the other four regions I find<br>
25 695 points; that's 132 571 in total. Does Carpet provide any output I<br>
can use to verify this?<br>
<br>
</span> Carpet provides a lot of output :) You may get something useful by<br>
setting<br>
<br>
CarpetLib::output_bboxes = "yes"<br>
<br>
On the most refined region, making the approximation that the domain is<br>
divided into N identical cubical regions, then for N = 40, you would have<br>
29791/40 = 745 ~ 9^3^, so about 9 evolved points in each direction. The<br>
volume of ghost plus evolved points would be (9+3+3)^3^ = 15^3^, so the<br>
number of ghost points is 15^3^ - 9^3^, and the ratio of ghost to evolved<br>
points is (15^3^ - 9^3^)/9^3^ = (15/9)^3^ - 1 = 3.6. So you have 3.6<br>
times as many points being communicated as you have being evolved.<br>
Especially if the interconnect is only gigabit ethernet, I'm not surprised<br>
that the scaling flattens off by this point. Note that if you use OpenMP,<br>
this ratio will be much smaller, because openmp threads communicate using<br>
shared memory, not ghost zones. Essentially you will have fewer<br>
processes, each with a larger cube, and multiple threads working on that<br>
cube in shared memory.<br>
<span class=""><br>
> My contact writes the following: "I fully understand the [comments about<br>
the scalability]. However we see a similar decrement with<br>
cctk_final_time=1000 [he initially tested with smaller times] and hence I<br>
would assume a larger solution space. Unless your problem is what is<br>
called embarrassingly parallel you will always be faced with a<br>
communication issue."<br>
><br>
> This is incorrect, right? My understanding is that the size of the<br>
solution space should remain the same regardless of cctk_final_time.<br>
<br>
</span> Yes - it looks like your contact doesn't know that the code is iterative,<br>
and cctk_final_time simply counts the number of iterations. In order to<br>
test with a larger problem size, you would need to reduce CoordBase::dx,<br>
dy and dz, so that there is higher overall resolution, and hence more<br>
points. I would expect the scalability to improve with larger problem<br>
sizes.<br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Ticket URL: <<a href="https://trac.einsteintoolkit.org/ticket/2008#comment:7" rel="noreferrer" target="_blank">https://trac.einsteintoolkit.<wbr>org/ticket/2008#comment:7</a>><br>
</font></span><div class="HOEnZb"><div class="h5">Einstein Toolkit <<a href="http://einsteintoolkit.org" rel="noreferrer" target="_blank">http://einsteintoolkit.org</a>><br>
The Einstein Toolkit<br>
______________________________<wbr>_________________<br>
Trac mailing list<br>
<a href="mailto:Trac@einsteintoolkit.org">Trac@einsteintoolkit.org</a><br>
<a href="http://lists.einsteintoolkit.org/mailman/listinfo/trac" rel="noreferrer" target="_blank">http://lists.einsteintoolkit.<wbr>org/mailman/listinfo/trac</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>><br><a href="http://www.perimeterinstitute.ca/personal/eschnetter/" target="_blank">http://www.perimeterinstitute.ca/personal/eschnetter/</a></div>
</div>