[Users] OpenMP is making it slower?

Thu May 19 02:14:59 CDT 2011

Hey Scott,

OpenMP has a huge overhead. It must create new functions, which then are 
started as threads, and this takes some time. During a OpenMP tutorial I have 
made some tests with a matrix vector multiplication. It turned out, that you 
can see an increase of the scalability using OpenMP in such a case only from a 
vectory size with more then 20000 elements.

If you are using OpenMP additional to MPI, and further if you use simfactory 
to start your runs, I have another answer.

Recently I have made some comparisons between activated OpenMP extensions and 
only MPI, and I have used simfactory to start my simulations. Simfactory has 
not done that, what I have expected. Let me explain what I mean.

If I activate OpenMP by setting OMP_NUM_THREADS to 4, and then use simfactory 
with --procs=16, then what simfactory finally makes a run with 

-pe openmpi 16

but(!!)

numprocs=4
numthreads=4

Using only MPI you will have in such a case

numprocs=16
numthreads=1

Hope that helps. So have your program running on 16 nodes with MPI only, 
compared to 4 nodes and additional 4 OpenMP threads per node. In such a case 
MPI is always faster, if your code in scaling.

Hope that helps.

Cheers

Alexander

On Thursday, May 19, 2011 00:33:49 Scott Hawley wrote:
> Ok. Still having problems.
> I defaulted everything to private and explicitly declared my shared's.  Now
> what happens is that the outer "k" loop never gets incremented. Even if I
> run with only one thread, "k" always equals 1.
> 
> So the snippet of code follows.  Where m_ex and m_ib are parameters in
> Fortran and are hard-coded as numbers by the compiler. If I compile with
> -fopenmp it works fine, but at the -fopenmp and "setenv OMP_NUM_THREADS 1"
> and it won't increment.
> 
> Any new ideas?  Thanks in advance.
> -Scott
> 
> 
> !$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(mask,ibonly,ax,ay,az,
> !$OMP&        agxx,agxy,agxz,agyy,agyz,agzz,
> !$OMP&        aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,nx,ny,nz)
> !$OMP& SCHEDULE(STATIC,chunk)
>          do k = 1, nz
>             write(msg,*)'      setbkgrnd:   ' //
>      &              'nz = ',nz,',   % done = ',int(k*1.0d2/nz),'  '
>             call writemessage(msg)
> 
>             do j = 1, ny
>                do i = 1, nx
> 
>                   if (mask(i,j,k) .ne. m_ex .and.
>      &                (ibonly .eq. 0 .or.
>      &                 mask(i,j,k) .eq. m_ib)) then
> 
> 
>                    x = ax(i)
>                    y = ay(j)
>                    z = az(k)
>                    include 'gd.inc'
>                    include 'kd.inc'
> 
>                      agxx(i,j,k) = gxx
>                      agxy(i,j,k) = gxy
>                      agxz(i,j,k) = gxz
>                      agyy(i,j,k) = gyy
>                      agyz(i,j,k) = gyz
>                      agzz(i,j,k) = gzz
> 
>                      aKxx(i,j,k) = Kxx
>                      aKxy(i,j,k) = Kxy
>                      aKxz(i,j,k) = Kxz
>                      aKyy(i,j,k) = Kyy
>                      aKyz(i,j,k) = Kyz
>                      aKzz(i,j,k) = Kzz
>                  else if (mask(i,j,k) .eq. m_ex) then
> c                   Excised points
>                      agxx(i,j,k) = exval
>                      agxy(i,j,k) = exval
>                      agxz(i,j,k) = exval
>                      agyy(i,j,k) = exval
>                      agyz(i,j,k) = exval
>                      agzz(i,j,k) = exval
> 
>                      aKxx(i,j,k) = exval
>                      aKxy(i,j,k) = exval
>                      aKxz(i,j,k) = exval
>                      aKyy(i,j,k) = exval
>                      aKyz(i,j,k) = exval
>                      aKzz(i,j,k) = exval
>                  endif
> 
>                enddo
>             enddo
>          enddo
> 
> 
> There is no explicit OMP END DO statement because it's optional.
> 
> 
> 
> 
> 
> --
> Scott H. Hawley, Ph.D. 	 		Asst. Prof. of Physics
> Chemistry & Physics Dept       		Office: Hitch 100D
> Belmont University                 	Tel:  +1-615-460-6206
> Nashville, TN 37212 USA           	Fax: +1-615-460-5458
> PGP Key at http://sks-keyservers.net
> 
> On May 18, 2011, at 4:37 PM, Scott Hawley wrote:
> > Erik, Frank, Peter: Thanks guys.  I will pursue your suggestions.
> > 
> > 
> > --
> > Scott H. Hawley, Ph.D. 	 		Asst. Prof. of Physics
> > Chemistry & Physics Dept       		Office: Hitch 100D
> > Belmont University                 	Tel:  +1-615-460-6206
> > Nashville, TN 37212 USA           	Fax: +1-615-460-5458
> > PGP Key at http://sks-keyservers.net
> > 
> > On May 18, 2011, at 12:14 AM, Peter Diener wrote:
> >> Hi Scott,
> >> 
> >> On Tue, 17 May 2011, Frank Loeffler wrote:
> >>> Hi,
> >>> 
> >>> On Tue, May 17, 2011 at 03:02:00PM -0700, Scott Hawley wrote:
> >>>> Do these all be need to be declared as private?
> >>> 
> >>> If the temporary variables are declared only inside the loop they are
> >>> automatically thread-local. Oh wait, that is Fortran. Well - in that
> >>> case you should either declare all private, or (maybe easier) put the
> >>> include files into separate functions, declare the temporary variables
> >>> only there and call the functions from within the loop, in which case
> >>> they also don't have to be specified for openmp (as long as they are
> >>> not static).
> >>> 
> >>>> i certainly don't want the various processors overwriting each others'
> >>>> work, which might be what they're doing. -- maybe they're even
> >>>> generating NaNs which would slow things down a bit!
> >> 
> >> Alternatively you may use the DEFAULT(PRIVATE) clause, so that you only
> >> have to specify the shared variables. However, in that case you have to
> >> make sure to really declare all the shared variables as shared, since
> >> otherwise all processors will have to allocate storage and if they are
> >> 3d variables this will slow down the code and increase memory
> >> consumption. Also private variables have undefined values on entry to
> >> the parallel region so not declaring all shared variables properly can
> >> also adversely affect the result. So be careful.
> >> 
> >>> You should see that in the results though. It might make sense to first
> >>> make sure that the results with different numbers of threads are the
> >>> same (depending on the problem you might actually get bit-by-bit
> >>> identical results), and work on optimization later. I agree that your
> >>> slow-down actually points towards some bug.
> >>> 
> >>> Frank
> >> 
> >> Cheers,
> >> 
> >>  Peter
> > 
> > <PGP.sig><ATT00001..txt>

-- 
+++++++++++++++++++++++++++++++++++++++++++++++++
Dr. Alexander Beck-Ratzka - team leader eScience group

MPI for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1
D-14476 Potsdam

Tel.: 0049 -(0)331 - 567-7192
Email: alexander.beck-ratzka at aei.mpg.de
+++++++++++++++++++++++++++++++++++++++++++++++++