[ET Trac] [Einstein Toolkit] #2008: NaNs when running static tov on >40 cores

Thu Feb 23 13:49:26 CST 2017

#2008: NaNs when running static tov on >40 cores
------------------------------------+---------------------------------------
  Reporter:  allgwy001@…            |       Owner:  knarf     
      Type:  defect                 |      Status:  assigned  
  Priority:  major                  |   Milestone:            
 Component:  Carpet                 |     Version:  ET_2016_05
Resolution:                         |    Keywords:            
------------------------------------+---------------------------------------

Comment (by allgwy001@…):

 Thanks Ian!

 Yes, we're running on the HEX cluster at the UCT HPC facility. The admin
 agrees that node sharing isn't desirable, but UCT is in the middle of a
 financial crisis and we simply can't afford dedicated access to a single
 node right now.

 We use 56G infiniband interconnect, but since the admin ran his tests on a
 single node, it doesn't really matter.

 He tried running on two nodes at 19:30 this evening and found that the IB
 traffic (bottom right) was minimal.

 [[Image(traffic.png)]]

 He's also well-aware about the difference between OpenMP and
 hyperthreading (which is disabled on all HPC nodes), as well as the
 OMP_NUM_THREADS environment variable.

 Replying to [comment:7 hinder]:
 > Replying to [comment:6 allgwy001@…]:
 > > Thank you very much for looking into it!
 > >
 > > We can't seem to disable OpenMP in the compile. However, it's
 effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not
 play nicely with other software, especially in the hybridized domain of
 combining OpenMPI and OpenMP, where multiple users share nodes", they are
 not willing to enable it for me.
 >
 > I have never run on a system where multiple users share nodes; it
 doesn't really fit very well with the sort of application that Cactus is.
 You don't want to be worrying about whether other processes are competing
 with you for memory, memory bandwidth, or cores.  When you have exclusive
 access to each node, OpenMP is usually a good idea.  By the way: what sort
 of interconnect do you have?  Gigabit ethernet, or infiniband, or
 something else?  If users are sharing nodes, then I suspect that this
 cluster is gigabit ethernet only, and you may be limited to small jobs,
 since the performance of gigabit ethernet will quickly become your
 bottleneck.  What cluster are you using?  From your email address, I'm
 guessing that it is one of the ones at http://hpc.uct.ac.za?  If so, or
 you are using a similar scheduler, then you should be able to do this, as
 in their documentation:
 >
 >     NB3: If your software prefers to use all cores on a computer then
 make sure that you reserve these cores. For example running on an 800
 series server which has 8 cores per server change the directive line in
 your script as follows:
 >
 >       #PBS -l nodes=1:ppn=8:series800
 >
 > Once you are the exclusive user of a node, I don't see a problem with
 enabling OpenMP.  Also note: OpenMP is not something that needs to be
 enabled by the system administrator; it is determined by your compilation
 flags (on by default in the ET) and activated with OMP_NUM_THREADS.  Is it
 possible that there was a confusion, and the admins were talking about
 hyperthreading instead, which is a very different thing, and which I agree
 you probably don't want to have enabled (it would have to be enabled by
 the admins)?
 >
 > >
 > > Excluding boundary points and symmetry points, I find 31 evolved
 points in each spatial direction for the most refined region. That gives
 29 791 points in three dimensions. For each of the other four regions I
 find 25 695 points; that's 132 571 in total. Does Carpet provide any
 output I can use to verify this?
 >
 > Carpet provides a lot of output :)  You may get something useful by
 setting
 >
 >   CarpetLib::output_bboxes = "yes"
 >
 > On the most refined region, making the approximation that the domain is
 divided into N identical cubical regions, then for N = 40, you would have
 29791/40 = 745 ~ 9^3^, so about 9 evolved points in each direction.  The
 volume of ghost plus evolved points would be (9+3+3)^3^ = 15^3^, so the
 number of ghost points is 15^3^ - 9^3^, and the ratio of ghost to evolved
 points is (15^3^ - 9^3^)/9^3^ = (15/9)^3^ - 1 = 3.6.  So you have 3.6
 times as many points being communicated as you have being evolved.
 Especially if the interconnect is only gigabit ethernet, I'm not surprised
 that the scaling flattens off by this point.  Note that if you use OpenMP,
 this ratio will be much smaller, because openmp threads communicate using
 shared memory, not ghost zones.  Essentially you will have fewer
 processes, each with a larger cube, and multiple threads working on that
 cube in shared memory.
 >
 > > My contact writes the following: "I fully understand the [comments
 about the scalability]. However we see a similar decrement with
 cctk_final_time=1000 [he initially tested with smaller times] and hence I
 would assume a larger solution space. Unless your problem is what is
 called embarrassingly parallel you will always be faced with a
 communication issue."
 > >
 > > This is incorrect, right? My understanding is that the size of the
 solution space should remain the same regardless of cctk_final_time.
 >
 > Yes - it looks like your contact doesn't know that the code is
 iterative, and cctk_final_time simply counts the number of iterations.  In
 order to test with a larger problem size, you would need to reduce
 CoordBase::dx, dy and dz, so that there is higher overall resolution, and
 hence more points.  I would expect the scalability to improve with larger
 problem sizes.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/2008#comment:8>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit