[Users] strong scaling tests on Stampede (TACC)

Fri Feb 22 07:05:10 CST 2013

On 22 Feb 2013, at 02:31, Bruno Giacomazzo <bruno.giacomazzo at jila.colorado.edu> wrote:

> Hi,
> 	I did some strong scaling tests on Stampede (https://www.xsede.org/tacc-stampede) with both Whisky and GRHydro (using the development version of ET in both cases). I used a Carpet par file that Roberto DePietri provided me and that he used for similar tests on an Italian cluster (I have attached the GRHydro version, the Whisky one is similar except for using Whisky instead of GRHydro). I used both Intel MPI and Mvapich and I did both pure MPI and MPI/OpenMP runs.
> 
> 	I have attached a text file with my results. The first column is the name of the run (if it starts with mvapich it used mvapich otherwise it used Intel MPI), the second one is the number of cores (option --procs in simfactory), the third one the number of threads (--num-threads), the fourth one the time in seconds spent in CCTK_EVOL, and the fifth one the walltime in seconds (i.e., the total time used by the run as measured on the cluster). I have also attached a couple of figures that show CCTK_EVOL vs #cores and walltime vs #cores (only for Intel MPI runs).
> 
> 	First of all, in pure MPI runs (--num-threads=1) I was unable to run on more than 1024 cores using Intel MPI (the run was just crashing before iteration zero or hanging up).

Could it have been running out of memory? Alternatively, some clusters require you to set special job options before running jobs with very large numbers of processes, or to use mpirun in a special way.  The Stampede user guide should have details on this. Having said that, no one should be using pure MPI these days anyway.

> No problem instead when using --num-threads=8 or --num-threads=16. I also noticed that scaling was particularly bad in pure MPI runs and that a lot of time was spent outside CCTK_EVOL (both with Intel MPI and MVAPICH). After speaking with Roberto, I found out that the problem is due to 1D ASCII output (which is active in that parfile) and that makes the runs particularly slow above ~100 cores on this machine.

Yes, this makes sense.  Each process sends all the data to be output to a single process, which then writes the file.  So the number of processes involved in the communication will be much larger (by a factor of 6 or 12) if you are using pure MPI.  Since it is 1D, you are probably dominated by the number of communicating processes, and associated latency, rather than the actual volume of data being transferred.

> In plot_scaling_walltime_all.pdf I plot also two pure MPI runs, but without 1D ASCII output and the scaling is much better in this case (the time spent in CCTK_EVOL is identical to the case with 1D output and hence I didn't plot them in the other figure). I didn't try using 1D hdf5 output instead, does anyone use it?

Yes, I use 1D, 2D and 3D HDF5 output, and it works very well.  I only enable ASCII output for testsuites and debugging.  HDF5 output is vastly superior: the size of the data is much smaller due to being binary and avoiding extra columns, HDF5 compression can be enabled which reduces it further, and the main reason: HDF5 output is seekable, so I can access iteration 1024 without having to first read and parse iterations 0 to 1023.  I don't need a separate "post-processing" phase; I read from the simulation output files directly.

Regarding the scaling issue you mentioned above, 1D HDF5 and 1D ASCII output use the same method for communicating the data to a single process, so I don't expect HDF5 to improve the scaling.  The amount of data transferred and the number of processes involved will be the same.  The only thing that will improve is the size of the data on the disk, and hence the time taken to write it.

> 	According to my tests, --num-threads=16 performs better than --num-threads=8 (which is the current default value in simfactory) and Intel MPI seems to be better than MVAPICH. Is there a particular reason for using 8 instead of 16 threads as the default simfactory value on Stampede?

Stampede has two processors per node, each with 8 cores.  They say "The memory subsystem has 4 channels from each processor's memory controller to 4 DDR3 ECC DIMMS,", which suggests to me that the memory is not associated with each processor (uniform memory architecture; UMA).  The usual reason for using one process per processor is that each processor has fast access to only half the memory (nonuniform memory architecture; NUMA).  See http://cactuscode.org/pipermail/users/2013-February/thread.html#3295 for a discussion of the situation on SuperMUC.  Now, I read that Stampede and SuperMUC are using exactly the same processor, and on SuperMUC, apparently it is NUMA.  However, the processors might be deployed in a different way on Stampede.  The best approach to investigate this is to enable the thorn hwloc (see that thread) and see what it reports on startup; it should say if the architecture is UMA or NUMA.

I would be careful with interpreting benchmark results.  To answer this question definitively, you should run on a single node and look at the timers (all processes) for a single routine which doesn't do any communication; e.g. ML_BSSN_RHS2; for a small number (e.g. 10) of iterations.  The processor decomposition and load balancing changes when changing the number of threads might have more of an impact than the memory access speed, and might be different from one grid setup to the next.  Similarly, the relation of the local grid size and shape to the cache characteristics of the core might also have a big impact.

Another thing to watch out for: it is essential for good performance to ensure that each thread is bound to a single CPU core ("CPU affinity").  Thorn hwloc will do this automatically; if you don't use it, you rely on the system, or the run scripts, to do this for you, and YMMV.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder