[Users] [ET Trac] [Einstein Toolkit] #2008: NaNs when running static tov on >40 cores
schnetter at cct.lsu.edu
Thu Feb 23 08:47:34 CST 2017
Cactus usually tries to optimize which treads run on which cores. If there
are multiple independent processes running on a node, then Cactus must not
do that, since this will slow down both applications a lot. In particular,
must not set "CACTUS_SET_THREAD_BINDINGS=1" (not setting it or setting it
to 0 is fine).
As Ian mentioned, you can either build without OpenMP support, or choose to
use a single thread at run time, both will work.
At this time, it might be best to post your complete setup, i.e. all the
options and scripts you are using to configure and build and submit and
run, so that others can have a look and cross-check.
Of course, all of this is independent of any nans you encounter. Those are
still due to a bug. I don't think it makes much sense at this point to
debug this -- instead, you will want to run a larger simulations.
On Thu, Feb 23, 2017 at 4:34 AM, Einstein Toolkit <
trac-noreply at einsteintoolkit.org> wrote:
> #2008: NaNs when running static tov on >40 cores
> Reporter: allgwy001@… | Owner: knarf
> Type: defect | Status: assigned
> Priority: major | Milestone:
> Component: Carpet | Version: ET_2016_05
> Resolution: | Keywords:
> Comment (by hinder):
> Replying to [comment:6 allgwy001@…]:
> > Thank you very much for looking into it!
> > We can't seem to disable OpenMP in the compile. However, it's
> effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not
> play nicely with other software, especially in the hybridized domain of
> combining OpenMPI and OpenMP, where multiple users share nodes", they are
> not willing to enable it for me.
> I have never run on a system where multiple users share nodes; it doesn't
> really fit very well with the sort of application that Cactus is. You
> don't want to be worrying about whether other processes are competing with
> you for memory, memory bandwidth, or cores. When you have exclusive
> access to each node, OpenMP is usually a good idea. By the way: what sort
> of interconnect do you have? Gigabit ethernet, or infiniband, or
> something else? If users are sharing nodes, then I suspect that this
> cluster is gigabit ethernet only, and you may be limited to small jobs,
> since the performance of gigabit ethernet will quickly become your
> bottleneck. What cluster are you using? From your email address, I'm
> guessing that it is one of the ones at http://hpc.uct.ac.za? If so, or
> you are using a similar scheduler, then you should be able to do this, as
> in their documentation:
> NB3: If your software prefers to use all cores on a computer then make
> sure that you reserve these cores. For example running on an 800 series
> server which has 8 cores per server change the directive line in your
> script as follows:
> #PBS -l nodes=1:ppn=8:series800
> Once you are the exclusive user of a node, I don't see a problem with
> enabling OpenMP. Also note: OpenMP is not something that needs to be
> enabled by the system administrator; it is determined by your compilation
> flags (on by default in the ET) and activated with OMP_NUM_THREADS. Is it
> possible that there was a confusion, and the admins were talking about
> hyperthreading instead, which is a very different thing, and which I agree
> you probably don't want to have enabled (it would have to be enabled by
> the admins)?
> > Excluding boundary points and symmetry points, I find 31 evolved points
> in each spatial direction for the most refined region. That gives 29 791
> points in three dimensions. For each of the other four regions I find
> 25 695 points; that's 132 571 in total. Does Carpet provide any output I
> can use to verify this?
> Carpet provides a lot of output :) You may get something useful by
> CarpetLib::output_bboxes = "yes"
> On the most refined region, making the approximation that the domain is
> divided into N identical cubical regions, then for N = 40, you would have
> 29791/40 = 745 ~ 9^3^, so about 9 evolved points in each direction. The
> volume of ghost plus evolved points would be (9+3+3)^3^ = 15^3^, so the
> number of ghost points is 15^3^ - 9^3^, and the ratio of ghost to evolved
> points is (15^3^ - 9^3^)/9^3^ = (15/9)^3^ - 1 = 3.6. So you have 3.6
> times as many points being communicated as you have being evolved.
> Especially if the interconnect is only gigabit ethernet, I'm not surprised
> that the scaling flattens off by this point. Note that if you use OpenMP,
> this ratio will be much smaller, because openmp threads communicate using
> shared memory, not ghost zones. Essentially you will have fewer
> processes, each with a larger cube, and multiple threads working on that
> cube in shared memory.
> > My contact writes the following: "I fully understand the [comments about
> the scalability]. However we see a similar decrement with
> cctk_final_time=1000 [he initially tested with smaller times] and hence I
> would assume a larger solution space. Unless your problem is what is
> called embarrassingly parallel you will always be faced with a
> communication issue."
> > This is incorrect, right? My understanding is that the size of the
> solution space should remain the same regardless of cctk_final_time.
> Yes - it looks like your contact doesn't know that the code is iterative,
> and cctk_final_time simply counts the number of iterations. In order to
> test with a larger problem size, you would need to reduce CoordBase::dx,
> dy and dz, so that there is higher overall resolution, and hence more
> points. I would expect the scalability to improve with larger problem
> Ticket URL: <https://trac.einsteintoolkit.org/ticket/2008#comment:7>
> Einstein Toolkit <http://einsteintoolkit.org>
> The Einstein Toolkit
> Trac mailing list
> Trac at einsteintoolkit.org
Erik Schnetter <schnetter at cct.lsu.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Users