[ET Trac] [Einstein Toolkit] #2008: NaNs when running static tov on >40 cores

Thu Feb 23 03:34:08 CST 2017

#2008: NaNs when running static tov on >40 cores
------------------------------------+---------------------------------------
  Reporter:  allgwy001@…            |       Owner:  knarf     
      Type:  defect                 |      Status:  assigned  
  Priority:  major                  |   Milestone:            
 Component:  Carpet                 |     Version:  ET_2016_05
Resolution:                         |    Keywords:            
------------------------------------+---------------------------------------

Comment (by hinder):

 Replying to [comment:6 allgwy001@…]:
 > Thank you very much for looking into it!
 >
 > We can't seem to disable OpenMP in the compile. However, it's
 effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not
 play nicely with other software, especially in the hybridized domain of
 combining OpenMPI and OpenMP, where multiple users share nodes", they are
 not willing to enable it for me.

 I have never run on a system where multiple users share nodes; it doesn't
 really fit very well with the sort of application that Cactus is.  You
 don't want to be worrying about whether other processes are competing with
 you for memory, memory bandwidth, or cores.  When you have exclusive
 access to each node, OpenMP is usually a good idea.  By the way: what sort
 of interconnect do you have?  Gigabit ethernet, or infiniband, or
 something else?  If users are sharing nodes, then I suspect that this
 cluster is gigabit ethernet only, and you may be limited to small jobs,
 since the performance of gigabit ethernet will quickly become your
 bottleneck.  What cluster are you using?  From your email address, I'm
 guessing that it is one of the ones at http://hpc.uct.ac.za?  If so, or
 you are using a similar scheduler, then you should be able to do this, as
 in their documentation:

     NB3: If your software prefers to use all cores on a computer then make
 sure that you reserve these cores. For example running on an 800 series
 server which has 8 cores per server change the directive line in your
 script as follows:

       #PBS -l nodes=1:ppn=8:series800

 Once you are the exclusive user of a node, I don't see a problem with
 enabling OpenMP.  Also note: OpenMP is not something that needs to be
 enabled by the system administrator; it is determined by your compilation
 flags (on by default in the ET) and activated with OMP_NUM_THREADS.  Is it
 possible that there was a confusion, and the admins were talking about
 hyperthreading instead, which is a very different thing, and which I agree
 you probably don't want to have enabled (it would have to be enabled by
 the admins)?

 >
 > Excluding boundary points and symmetry points, I find 31 evolved points
 in each spatial direction for the most refined region. That gives 29 791
 points in three dimensions. For each of the other four regions I find
 25 695 points; that's 132 571 in total. Does Carpet provide any output I
 can use to verify this?

 Carpet provides a lot of output :)  You may get something useful by
 setting

   CarpetLib::output_bboxes = "yes"

 On the most refined region, making the approximation that the domain is
 divided into N identical cubical regions, then for N = 40, you would have
 29791/40 = 745 ~ 9^3^, so about 9 evolved points in each direction.  The
 volume of ghost plus evolved points would be (9+3+3)^3^ = 15^3^, so the
 number of ghost points is 15^3^ - 9^3^, and the ratio of ghost to evolved
 points is (15^3^ - 9^3^)/9^3^ = (15/9)^3^ - 1 = 3.6.  So you have 3.6
 times as many points being communicated as you have being evolved.
 Especially if the interconnect is only gigabit ethernet, I'm not surprised
 that the scaling flattens off by this point.  Note that if you use OpenMP,
 this ratio will be much smaller, because openmp threads communicate using
 shared memory, not ghost zones.  Essentially you will have fewer
 processes, each with a larger cube, and multiple threads working on that
 cube in shared memory.

 > My contact writes the following: "I fully understand the [comments about
 the scalability]. However we see a similar decrement with
 cctk_final_time=1000 [he initially tested with smaller times] and hence I
 would assume a larger solution space. Unless your problem is what is
 called embarrassingly parallel you will always be faced with a
 communication issue."
 >
 > This is incorrect, right? My understanding is that the size of the
 solution space should remain the same regardless of cctk_final_time.

 Yes - it looks like your contact doesn't know that the code is iterative,
 and cctk_final_time simply counts the number of iterations.  In order to
 test with a larger problem size, you would need to reduce CoordBase::dx,
 dy and dz, so that there is higher overall resolution, and hence more
 points.  I would expect the scalability to improve with larger problem
 sizes.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/2008#comment:7>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit