[Users] Possible performance issue

Tue Oct 8 12:29:54 CDT 2019

Dear Sir,

I have made the changes as suggested. In fact I compiled using a Intel MPI
( which was compiled by myself locally) using the optionlist of Stanpede 2
cluster as suggested by you , without OpenMP and with appropriate library
paths.I am glad that the speed has improved. I am now getting around 25
physical units per hour instead of 1.5 for a simulation running on 128 mpi
procs. The optimization parameters I am using are same as in the Stanpede2
cluster ( -Ofast -march=native) and not "-O3 -march=native" . Would it make
any difference?

I have not tried these changes on mpich-3.3.1 or openmpi yet.

Also where can I find more information to know about these optimization
parameters?

Thankyou very much for your time. This was really helpful!!

Yours Sincerely

Vaishak

On Mon, Oct 7, 2019 at 10:25 PM Haas, Roland <rhaas at illinois.edu> wrote:

> Hello Vaishak,
>
> you options do not include optimization parameters ie:
>
> CXX_OPTIMISE_FLAGS=
>
> does not set any optimization options which means that g++ will compile
> as if -O0 was used.
>
> Please change the option list that you are using to make sure that:
>
> CXX_OPTIMISE_FLAGS=-O3
> C_OPTIMISE_FLAGS=-O3
> F90_OPTIMISE_FLAGS=-O3
> F77_OPTIMISE_FLAGS=-O3
>
> are set. If you are sure that login node and compute node are the same
> architecture you should also add "-march=native" to these options.
>
> You may also consider using the Intel compiler (icc, icpc,
> ifort) instead of gcc (gcc, g++, gfortran) which may (or may not) give
> faster execution.
>
> For example the file
>
>
> https://bitbucket.org/simfactory/simfactory2/src/master/mdb/optionlists/stampede2-skx.cfg
>
> shows the settings used to compile for the SkyLake nodes of the
> Stampede2 cluster at TACC using the Intel compiler.
>
> I noticed that you are using a self-compiled MPI stack.
>
> Usually on clusters the admins will provide an MPI stack optimized for
> the cluster hardware (eg to make sure Infiniband interconnects are used
> rather than Ethernet). I would expect better performance using those
> than using a self-compiled MPI stack (the Intel MPI stack will do in
> that respect since it is always provided by the admins and never
> self-compiled). Details on how to use it depend on the cluster and you
> would have to consult the cluster websites and / or the output of the
> cluster's mpicc -showme:compile and mpicc -showme:link commands (for
> OpenMPI mvapich and intel-mpi have similar options) to find out which
> libraries the official mpi compiler wrapper would use.
>
> Yours,
> Roland
>
> > Dear Sir,
> >
> > Please find the config-info file attached herewith...
> >
> >
> > Yours,
> >
> > Vaishak
> >
> > On Mon, Oct 7, 2019 at 7:56 PM Haas, Roland <rhaas at illinois.edu> wrote:
> >
> > > Hello Vaishak,
> > >
> > > hmm, still very slow.
> > >
> > > One question that I forgot to ask before: did you make sure to build
> > > and optimized Cactus executable (setting OPTIMISE=yes, DEBUG=no to
> > > ensure that you have -O2 or -O3 optimisation settings enabled)?
> > >
> > > Ideally if you could send the file configs/sim/config-info that would
> > > tell me.
> > >
> > > Yours,
> > > Roland
> > >
> > > > Dear Sir,
> > > >
> > > > I am a little worried about the performance because this is a new
> cluster
> > > > we have and it is supposed to be performing well. I am inclined to
> think
> > > > that some libraries/compiler options / settings might be the
> bottleneck.
> > > >
> > > >
> > > > I am presently running two simulations, both using the same parameter
> > > file
> > > > GW150914.rpar.
> > > >
> > > > The first one is using mpich-3.3.1, the same as in the simulation
> > > mentioned
> > > > in the previous thread. I am using one node consisting of 2*16
> cores, and
> > > > 32 mpiprocs.
> > > >
> > > > The second one is using openmpi-3.1.2 with openmp. It uses 128 procs,
> > > > distributed among 16 mpiprocs and 8 openmp threads per mpiproc.
> Since I
> > > > have 32 PPN, it is launching 4 mpiprocs per node.
> > > >
> > > > I am herewith attaching the carpet-timing..asc file from both these
> runs.
> > > >
> > > > Thanking you
> > > >
> > > > Regards,
> > > > Vaishak
> > > >
> > > >
> > > > On Fri, Oct 4, 2019 at 8:05 PM Haas, Roland <rhaas at illinois.edu>
> wrote:
> > > >
> > > > > Hello Vaishak,
> > > > >
> > > > > I do not see anything obviously wrong with the setup.
> > > > >
> > > > > It uses 128 MPI ranks for the 4 nodes which fits with there being
> 2x16
> > > > > cores per node.
> > > > >
> > > > > Lookin at the timer tree output at iteration 1024 (search for
> > > > > "gettimeof " and you will find the spot) out of 5977s spend during
> > > > > Evolve about 2143s were spent in "syncs" which is communication and
> > > > > about the same amount of time in "thorns" that is doing
> computation.
> > > > > While this ratio is not great (spending more time sending data than
> > > > > doing computation) it is also not unheard of.
> > > > >
> > > > > Getting the original output files for the gallery data from Zenodo
> > > > > (link is on the gallery page):
> > > > >
> > > > > wget
> > > > >
> https://zenodo.org/record/155394/files/GW150914_28.tar.xz?download=1
> > > > >
> > > > > you can see (in GW150914_28/output-0000/GW150914_28.out) that that
> one
> > > > > took about 137s for syncs and 198s for thorns, so the same ratio
> but
> > > > > about a factor of 10 faster.
> > > > >
> > > > > I am reaching for straws here, but sometimes having too many MPI
> ranks
> > > > > can be detrimental if there is not enough work to split up (OpenMP
> can
> > > > > be a bit more forgiving in that respect, the original gallery run
> > > > > used 120 cores on 10 nodes using 6 OpenMP threads per MPI rank).
> > > > >
> > > > > Since each node has lots of RAM (more than the 96GB required to
> run the
> > > > > simulation), can you try and see what would happen if you were to
> run
> > > > > on only a single node?
> > > > >
> > > > > Also if you could add the parameter:
> > > > >
> > > > > Carpet::output_timers_every = 1024
> > > > >
> > > > > then provide the files carpet-timing-statistics*.asc that would
> let us
> > > > > know in even more detail where the time is spent.
> > > > >
> > > > > Running for a short time (2048 iterations) is enough to get data to
> > > > > compare.
> > > > >
> > > > > Yours,
> > > > > Roland
> > > > >
> > > > > > Dear All,
> > > > > >
> > > > > > I am running the simulation GW150914 using the parameter file
> > > available
> > > > > at
> > > > > > the ETK gallery at (GW150914-ETK gallery
> > > > > > <https://einsteintoolkit.org/gallery/bbh/index.html>) using 128
> > > cores.
> > > > > >
> > > > > > Each compute node consists of 2 X 16 Cores Intel SkyLake (
> Intel(R)
> > > > > Xeon(R)
> > > > > > Gold 6142 CPU @ 2.60GHz) and 384 GB RAM .  I have compiled and am
> > > running
> > > > > > Einstein Toolkit without OpenMP and using mpich-3.3.1.
> > > > > >
> > > > > >
> > > > > > The issue is that the simulation seems to be running at a very
> slow
> > > pace.
> > > > > > The number of physical time per hour that it is completing is
> only
> > > about
> > > > > > 1.3 units. At this rate to complete 1700 units, it would take
> about
> > > 54
> > > > > > days, in contrast to 2.8 days on (Intel(R) Xeon(R) CPU E5-2630
> v3 @
> > > > > > 2.40GHz) as per the details at the example run of GW150914
> available
> > > at
> > > > > the
> > > > > > gallery (GW150914-ETK gallery
> > > > > > <https://einsteintoolkit.org/gallery/bbh/index.html>).
> > > > > >
> > > > > > I have also tried using intel mpi (impi) but with simular
> results.
> > > > > >
> > > > > > I am also attaching the out file from the simulation.
> > > > > >
> > > > > > Looking forward to your inputs.
> > > > > >
> > > > > >
> > > > > > Thanks and regards,
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Vaishak P
> > > > > >
> > > > > > PhD Scholar,
> > > > > > Shyama Prasad Mukherjee Fellow
> > > > > > Inter-University Center for Astronomy and Astrophysics (IUCAA)
> > > > > > Pune, India
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > My email is as private as my paper mail. I therefore support
> encrypting
> > > > > and signing email messages. Get my PGP key from http://pgp.mit.edu
> .
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > My email is as private as my paper mail. I therefore support encrypting
> > > and signing email messages. Get my PGP key from http://pgp.mit.edu .
> > >
> >
> >
>
>
>
> --
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://pgp.mit.edu .
>

-- 
Regards,
Vaishak P

PhD Scholar,
Shyama Prasad Mukherjee Fellow
Inter-University Center for Astronomy and Astrophysics (IUCAA)
Pune, India
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20191008/be102495/attachment-0001.html