[Users] OpenMP is making it slower?
Erik Schnetter
schnetter at cct.lsu.edu
Tue May 17 12:46:31 CDT 2011
Scott
Thanks for showing the code.
I can think of several things that could go wrong:
1. It takes some time to start up and shut down parallel threads.
Therefore, people usually parallelise the outermost loop, i.e. the k
loop in your case. Parallelising the j loop requires starting and
stopping threads nz times, which adds overhead.
2. I notice that you don't declare private variables. Variables are
shared by default, and any local variables that you use inside the
parallel region (and which are not arrays where you only access one
element) need to be declared as private. In your case, these are
probably x, y, z, gxx, gxy, gxz, etc. Did you compare results between
serial and parallel runs? I would expect the results to differ, i.e.
the current parallel code seems to have a serious error.
3. How much computational work is done inside this loop? If most of
the time is spent in memory access writing to the gij and Kij arrays,
then OpenMP won't be able to help. Only if there is sufficient
computation going on will you see a benefit.
4. Since you say that you have 24 cores, I assume you have an AMD
system. In this case, your machine consists of 4 subsystems that have
6 cores each, and communication between these 4 subsystems will be
much slower than within each of these subsystems. People usually
recommend to use not more than 6 OpenMP threads, and to ensure that
these run within one of these subsystems. You can try setting the
environment variable GOMP_CPU_AFFINITY='0-5' to force your threads to
run on cores 0 to 5.
-erik
On Tue, May 17, 2011 at 1:00 PM, Scott Hawley <scott.hawley at belmont.edu> wrote:
> No doubt someone will ask for the code itself. The relevant part is given
> below. ( In old-school Fortran77)
> Specifically, the 'problem' I'm noticing is that the "% done" messages
> appear with less frequency and with lesser increment per wall clock time
> with OMP_NUM_THREADS > 1 than for OMP_NUM_THREADS = 1. The cpus get
> used alot more --- 'top' shows up to 2000% cpu usage for 24 threads --- but
> the wallclock time doesn't decrease at all.
> Also note that whether I use the long OMP directive shown (with the 'shared'
> declarations and schedule, etc) and the 'END PARALLEL DO' at the end, or if
> I just use a simple '!$OMP PARALLEL DO' and *nothing else*, the execution
> time is *identical*.
> Thanks again!
> -Scott
>
> chunk = 8
> do k = 1, nz
> write(msg,*)'[1F setbkgrnd: ' //
> & 'nz = ',nz,', % done = ',int(k*1.0d2/nz),' '
> call writemessage(msg)
>
>
>
> !$OMP PARALLEL DO SHARED(mask,agxx,agxy,agxz,agyy,agyz,agzz,
> !$OMP& aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,ax,ay,az),
> !$OMP& SCHEDULE(STATIC,chunk) PRIVATE(j)
> do j = 1, ny
> do i = 1, nx
> c if (ltrace) then
> c write(msg,*) '---------------'
> c call writemessage(msg)
> c endif
>
>
>
> if (mask(i,j,k) .ne. m_ex .and.
> & (ibonly .eq. 0 .or.
> & mask(i,j,k) .eq. m_ib)) then
> x = ax(i)
> y = ay(j)
> z = az(k)
> c the following two include files just perform many pointwise calc's
> include 'gd.inc'
> include 'kd.inc'
> agxx(i,j,k) = gxx
> agxy(i,j,k) = gxy
> agxz(i,j,k) = gxz
> agyy(i,j,k) = gyy
> agyz(i,j,k) = gyz
> agzz(i,j,k) = gzz
>
>
>
> aKxx(i,j,k) = Kxx
> aKxy(i,j,k) = Kxy
> aKxz(i,j,k) = Kxz
> aKyy(i,j,k) = Kyy
> aKyz(i,j,k) = Kyz
> aKzz(i,j,k) = Kzz
>
> else if (mask(i,j,k) .eq. m_ex) then
> c Excised points
> agxx(i,j,k) = exval
> agxy(i,j,k) = exval
> agxz(i,j,k) = exval
> agyy(i,j,k) = exval
> agyz(i,j,k) = exval
> agzz(i,j,k) = exval
>
>
>
> aKxx(i,j,k) = exval
> aKxy(i,j,k) = exval
> aKxz(i,j,k) = exval
> aKyy(i,j,k) = exval
> aKyz(i,j,k) = exval
> aKzz(i,j,k) = exval
> endif
> enddo
> enddo
> !$OMP END PARALLEL DO
> enddo
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
>
--
Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
More information about the Users
mailing list