[Users] OpenMP is making it slower?
Alexander Beck-Ratzka
alexander.beck-ratzka at aei.mpg.de
Thu May 19 02:14:59 CDT 2011
Hey Scott,
OpenMP has a huge overhead. It must create new functions, which then are
started as threads, and this takes some time. During a OpenMP tutorial I have
made some tests with a matrix vector multiplication. It turned out, that you
can see an increase of the scalability using OpenMP in such a case only from a
vectory size with more then 20000 elements.
If you are using OpenMP additional to MPI, and further if you use simfactory
to start your runs, I have another answer.
Recently I have made some comparisons between activated OpenMP extensions and
only MPI, and I have used simfactory to start my simulations. Simfactory has
not done that, what I have expected. Let me explain what I mean.
If I activate OpenMP by setting OMP_NUM_THREADS to 4, and then use simfactory
with --procs=16, then what simfactory finally makes a run with
-pe openmpi 16
but(!!)
numprocs=4
numthreads=4
Using only MPI you will have in such a case
numprocs=16
numthreads=1
Hope that helps. So have your program running on 16 nodes with MPI only,
compared to 4 nodes and additional 4 OpenMP threads per node. In such a case
MPI is always faster, if your code in scaling.
Hope that helps.
Cheers
Alexander
On Thursday, May 19, 2011 00:33:49 Scott Hawley wrote:
> Ok. Still having problems.
> I defaulted everything to private and explicitly declared my shared's. Now
> what happens is that the outer "k" loop never gets incremented. Even if I
> run with only one thread, "k" always equals 1.
>
> So the snippet of code follows. Where m_ex and m_ib are parameters in
> Fortran and are hard-coded as numbers by the compiler. If I compile with
> -fopenmp it works fine, but at the -fopenmp and "setenv OMP_NUM_THREADS 1"
> and it won't increment.
>
> Any new ideas? Thanks in advance.
> -Scott
>
>
> !$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(mask,ibonly,ax,ay,az,
> !$OMP& agxx,agxy,agxz,agyy,agyz,agzz,
> !$OMP& aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,nx,ny,nz)
> !$OMP& SCHEDULE(STATIC,chunk)
> do k = 1, nz
> write(msg,*)' setbkgrnd: ' //
> & 'nz = ',nz,', % done = ',int(k*1.0d2/nz),' '
> call writemessage(msg)
>
> do j = 1, ny
> do i = 1, nx
>
> if (mask(i,j,k) .ne. m_ex .and.
> & (ibonly .eq. 0 .or.
> & mask(i,j,k) .eq. m_ib)) then
>
>
> x = ax(i)
> y = ay(j)
> z = az(k)
> include 'gd.inc'
> include 'kd.inc'
>
> agxx(i,j,k) = gxx
> agxy(i,j,k) = gxy
> agxz(i,j,k) = gxz
> agyy(i,j,k) = gyy
> agyz(i,j,k) = gyz
> agzz(i,j,k) = gzz
>
> aKxx(i,j,k) = Kxx
> aKxy(i,j,k) = Kxy
> aKxz(i,j,k) = Kxz
> aKyy(i,j,k) = Kyy
> aKyz(i,j,k) = Kyz
> aKzz(i,j,k) = Kzz
> else if (mask(i,j,k) .eq. m_ex) then
> c Excised points
> agxx(i,j,k) = exval
> agxy(i,j,k) = exval
> agxz(i,j,k) = exval
> agyy(i,j,k) = exval
> agyz(i,j,k) = exval
> agzz(i,j,k) = exval
>
> aKxx(i,j,k) = exval
> aKxy(i,j,k) = exval
> aKxz(i,j,k) = exval
> aKyy(i,j,k) = exval
> aKyz(i,j,k) = exval
> aKzz(i,j,k) = exval
> endif
>
> enddo
> enddo
> enddo
>
>
> There is no explicit OMP END DO statement because it's optional.
>
>
>
>
>
> --
> Scott H. Hawley, Ph.D. Asst. Prof. of Physics
> Chemistry & Physics Dept Office: Hitch 100D
> Belmont University Tel: +1-615-460-6206
> Nashville, TN 37212 USA Fax: +1-615-460-5458
> PGP Key at http://sks-keyservers.net
>
> On May 18, 2011, at 4:37 PM, Scott Hawley wrote:
> > Erik, Frank, Peter: Thanks guys. I will pursue your suggestions.
> >
> >
> > --
> > Scott H. Hawley, Ph.D. Asst. Prof. of Physics
> > Chemistry & Physics Dept Office: Hitch 100D
> > Belmont University Tel: +1-615-460-6206
> > Nashville, TN 37212 USA Fax: +1-615-460-5458
> > PGP Key at http://sks-keyservers.net
> >
> > On May 18, 2011, at 12:14 AM, Peter Diener wrote:
> >> Hi Scott,
> >>
> >> On Tue, 17 May 2011, Frank Loeffler wrote:
> >>> Hi,
> >>>
> >>> On Tue, May 17, 2011 at 03:02:00PM -0700, Scott Hawley wrote:
> >>>> Do these all be need to be declared as private?
> >>>
> >>> If the temporary variables are declared only inside the loop they are
> >>> automatically thread-local. Oh wait, that is Fortran. Well - in that
> >>> case you should either declare all private, or (maybe easier) put the
> >>> include files into separate functions, declare the temporary variables
> >>> only there and call the functions from within the loop, in which case
> >>> they also don't have to be specified for openmp (as long as they are
> >>> not static).
> >>>
> >>>> i certainly don't want the various processors overwriting each others'
> >>>> work, which might be what they're doing. -- maybe they're even
> >>>> generating NaNs which would slow things down a bit!
> >>
> >> Alternatively you may use the DEFAULT(PRIVATE) clause, so that you only
> >> have to specify the shared variables. However, in that case you have to
> >> make sure to really declare all the shared variables as shared, since
> >> otherwise all processors will have to allocate storage and if they are
> >> 3d variables this will slow down the code and increase memory
> >> consumption. Also private variables have undefined values on entry to
> >> the parallel region so not declaring all shared variables properly can
> >> also adversely affect the result. So be careful.
> >>
> >>> You should see that in the results though. It might make sense to first
> >>> make sure that the results with different numbers of threads are the
> >>> same (depending on the problem you might actually get bit-by-bit
> >>> identical results), and work on optimization later. I agree that your
> >>> slow-down actually points towards some bug.
> >>>
> >>> Frank
> >>
> >> Cheers,
> >>
> >> Peter
> >
> > <PGP.sig><ATT00001..txt>
--
+++++++++++++++++++++++++++++++++++++++++++++++++
Dr. Alexander Beck-Ratzka - team leader eScience group
MPI for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1
D-14476 Potsdam
Tel.: 0049 -(0)331 - 567-7192
Email: alexander.beck-ratzka at aei.mpg.de
+++++++++++++++++++++++++++++++++++++++++++++++++
More information about the Users
mailing list