[Users] strong scaling tests on Stampede (TACC)
Ian Hinder
ian.hinder at aei.mpg.de
Thu Sep 19 13:00:16 CDT 2013
On 22 Feb 2013, at 02:31, Bruno Giacomazzo <bruno.giacomazzo at jila.colorado.edu> wrote:
> Hi,
> I did some strong scaling tests on Stampede (https://www.xsede.org/tacc-stampede) with both Whisky and GRHydro (using the development version of ET in both cases). I used a Carpet par file that Roberto DePietri provided me and that he used for similar tests on an Italian cluster (I have attached the GRHydro version, the Whisky one is similar except for using Whisky instead of GRHydro). I used both Intel MPI and Mvapich and I did both pure MPI and MPI/OpenMP runs.
>
> I have attached a text file with my results. The first column is the name of the run (if it starts with mvapich it used mvapich otherwise it used Intel MPI), the second one is the number of cores (option --procs in simfactory), the third one the number of threads (--num-threads), the fourth one the time in seconds spent in CCTK_EVOL, and the fifth one the walltime in seconds (i.e., the total time used by the run as measured on the cluster). I have also attached a couple of figures that show CCTK_EVOL vs #cores and walltime vs #cores (only for Intel MPI runs).
>
> First of all, in pure MPI runs (--num-threads=1) I was unable to run on more than 1024 cores using Intel MPI (the run was just crashing before iteration zero or hanging up). No problem instead when using --num-threads=8 or --num-threads=16. I also noticed that scaling was particularly bad in pure MPI runs and that a lot of time was spent outside CCTK_EVOL (both with Intel MPI and MVAPICH). After speaking with Roberto, I found out that the problem is due to 1D ASCII output (which is active in that parfile) and that makes the runs particularly slow above ~100 cores on this machine. In plot_scaling_walltime_all.pdf I plot also two pure MPI runs, but without 1D ASCII output and the scaling is much better in this case (the time spent in CCTK_EVOL is identical to the case with 1D output and hence I didn't plot them in the other figure). I didn't try using 1D hdf5 output instead, does anyone use it?
>
> According to my tests, --num-threads=16 performs better than --num-threads=8 (which is the current default value in simfactory) and Intel MPI seems to be better than MVAPICH. Is there a particular reason for using 8 instead of 16 threads as the default simfactory value on Stampede?
(replying to an old thread)
I see the same; 16 threads seems to give roughly twice as much performance as 8. Is it possible that there is some issue with cpu/memory socket affinity? There is a tacc_affinity script which the examples say to use, but we don't use it in simfactory. I tried it, and it didn't seem to help. What helped was activating the hwloc thorn; this allowed me to use 8 threads per process, and actually gave slightly better performance than 16, though this might depend on the precise grid structure.
--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20130919/55474c35/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20130919/55474c35/attachment.bin
More information about the Users
mailing list