[Users] OpenMP is making it slower?
Scott Hawley
scott.hawley at belmont.edu
Thu May 19 11:13:22 CDT 2011
Thanks Alexander. The overhead associated with re-allocating all those temporary variables is the reason why I used include files instead of defining separate subroutines.
My code works with MPI, but I was looking for a way to eek out a little more parallelism. The domain decomposition routine for MPI needs to be changed so I can get more than 8 nodes, and I thought OpenMP would be the way to do it.
And I believe I still can, *elsewhere* in the code. The particular routine I chose was just perhaps not a good one. I'm just removing the OpenMP directives from this routine, and let it only be parallel via MPI.
Thanks!
-Scott
--
Scott H. Hawley, Ph.D. Asst. Prof. of Physics
Chemistry & Physics Dept Office: Hitch 100D
Belmont University Tel: +1-615-460-6206
Nashville, TN 37212 USA Fax: +1-615-460-5458
PGP Key at http://sks-keyservers.net
On May 19, 2011, at 2:14 AM, Alexander Beck-Ratzka wrote:
> Hey Scott,
>
> OpenMP has a huge overhead. It must create new functions, which then are
> started as threads, and this takes some time. During a OpenMP tutorial I have
> made some tests with a matrix vector multiplication. It turned out, that you
> can see an increase of the scalability using OpenMP in such a case only from a
> vectory size with more then 20000 elements.
>
> If you are using OpenMP additional to MPI, and further if you use simfactory
> to start your runs, I have another answer.
>
> Recently I have made some comparisons between activated OpenMP extensions and
> only MPI, and I have used simfactory to start my simulations. Simfactory has
> not done that, what I have expected. Let me explain what I mean.
>
> If I activate OpenMP by setting OMP_NUM_THREADS to 4, and then use simfactory
> with --procs=16, then what simfactory finally makes a run with
>
> -pe openmpi 16
>
> but(!!)
>
> numprocs=4
> numthreads=4
>
> Using only MPI you will have in such a case
>
> numprocs=16
> numthreads=1
>
> Hope that helps. So have your program running on 16 nodes with MPI only,
> compared to 4 nodes and additional 4 OpenMP threads per node. In such a case
> MPI is always faster, if your code in scaling.
>
> Hope that helps.
>
> Cheers
>
> Alexander
>
> On Thursday, May 19, 2011 00:33:49 Scott Hawley wrote:
>> Ok. Still having problems.
>> I defaulted everything to private and explicitly declared my shared's. Now
>> what happens is that the outer "k" loop never gets incremented. Even if I
>> run with only one thread, "k" always equals 1.
>>
>> So the snippet of code follows. Where m_ex and m_ib are parameters in
>> Fortran and are hard-coded as numbers by the compiler. If I compile with
>> -fopenmp it works fine, but at the -fopenmp and "setenv OMP_NUM_THREADS 1"
>> and it won't increment.
>>
>> Any new ideas? Thanks in advance.
>> -Scott
>>
>>
>> !$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(mask,ibonly,ax,ay,az,
>> !$OMP& agxx,agxy,agxz,agyy,agyz,agzz,
>> !$OMP& aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,nx,ny,nz)
>> !$OMP& SCHEDULE(STATIC,chunk)
>> do k = 1, nz
>> write(msg,*)' setbkgrnd: ' //
>> & 'nz = ',nz,', % done = ',int(k*1.0d2/nz),' '
>> call writemessage(msg)
>>
>> do j = 1, ny
>> do i = 1, nx
>>
>> if (mask(i,j,k) .ne. m_ex .and.
>> & (ibonly .eq. 0 .or.
>> & mask(i,j,k) .eq. m_ib)) then
>>
>>
>> x = ax(i)
>> y = ay(j)
>> z = az(k)
>> include 'gd.inc'
>> include 'kd.inc'
>>
>> agxx(i,j,k) = gxx
>> agxy(i,j,k) = gxy
>> agxz(i,j,k) = gxz
>> agyy(i,j,k) = gyy
>> agyz(i,j,k) = gyz
>> agzz(i,j,k) = gzz
>>
>> aKxx(i,j,k) = Kxx
>> aKxy(i,j,k) = Kxy
>> aKxz(i,j,k) = Kxz
>> aKyy(i,j,k) = Kyy
>> aKyz(i,j,k) = Kyz
>> aKzz(i,j,k) = Kzz
>> else if (mask(i,j,k) .eq. m_ex) then
>> c Excised points
>> agxx(i,j,k) = exval
>> agxy(i,j,k) = exval
>> agxz(i,j,k) = exval
>> agyy(i,j,k) = exval
>> agyz(i,j,k) = exval
>> agzz(i,j,k) = exval
>>
>> aKxx(i,j,k) = exval
>> aKxy(i,j,k) = exval
>> aKxz(i,j,k) = exval
>> aKyy(i,j,k) = exval
>> aKyz(i,j,k) = exval
>> aKzz(i,j,k) = exval
>> endif
>>
>> enddo
>> enddo
>> enddo
>>
>>
>> There is no explicit OMP END DO statement because it's optional.
>>
>>
>>
>>
>>
>> --
>> Scott H. Hawley, Ph.D. Asst. Prof. of Physics
>> Chemistry & Physics Dept Office: Hitch 100D
>> Belmont University Tel: +1-615-460-6206
>> Nashville, TN 37212 USA Fax: +1-615-460-5458
>> PGP Key at http://sks-keyservers.net
>>
>> On May 18, 2011, at 4:37 PM, Scott Hawley wrote:
>>> Erik, Frank, Peter: Thanks guys. I will pursue your suggestions.
>>>
>>>
>>> --
>>> Scott H. Hawley, Ph.D. Asst. Prof. of Physics
>>> Chemistry & Physics Dept Office: Hitch 100D
>>> Belmont University Tel: +1-615-460-6206
>>> Nashville, TN 37212 USA Fax: +1-615-460-5458
>>> PGP Key at http://sks-keyservers.net
>>>
>>> On May 18, 2011, at 12:14 AM, Peter Diener wrote:
>>>> Hi Scott,
>>>>
>>>> On Tue, 17 May 2011, Frank Loeffler wrote:
>>>>> Hi,
>>>>>
>>>>> On Tue, May 17, 2011 at 03:02:00PM -0700, Scott Hawley wrote:
>>>>>> Do these all be need to be declared as private?
>>>>>
>>>>> If the temporary variables are declared only inside the loop they are
>>>>> automatically thread-local. Oh wait, that is Fortran. Well - in that
>>>>> case you should either declare all private, or (maybe easier) put the
>>>>> include files into separate functions, declare the temporary variables
>>>>> only there and call the functions from within the loop, in which case
>>>>> they also don't have to be specified for openmp (as long as they are
>>>>> not static).
>>>>>
>>>>>> i certainly don't want the various processors overwriting each others'
>>>>>> work, which might be what they're doing. -- maybe they're even
>>>>>> generating NaNs which would slow things down a bit!
>>>>
>>>> Alternatively you may use the DEFAULT(PRIVATE) clause, so that you only
>>>> have to specify the shared variables. However, in that case you have to
>>>> make sure to really declare all the shared variables as shared, since
>>>> otherwise all processors will have to allocate storage and if they are
>>>> 3d variables this will slow down the code and increase memory
>>>> consumption. Also private variables have undefined values on entry to
>>>> the parallel region so not declaring all shared variables properly can
>>>> also adversely affect the result. So be careful.
>>>>
>>>>> You should see that in the results though. It might make sense to first
>>>>> make sure that the results with different numbers of threads are the
>>>>> same (depending on the problem you might actually get bit-by-bit
>>>>> identical results), and work on optimization later. I agree that your
>>>>> slow-down actually points towards some bug.
>>>>>
>>>>> Frank
>>>>
>>>> Cheers,
>>>>
>>>> Peter
>>>
>>> <PGP.sig><ATT00001..txt>
>
> --
> +++++++++++++++++++++++++++++++++++++++++++++++++
> Dr. Alexander Beck-Ratzka - team leader eScience group
>
> MPI for Gravitational Physics (Albert Einstein Institute)
> Am Mühlenberg 1
> D-14476 Potsdam
>
> Tel.: 0049 -(0)331 - 567-7192
> Email: alexander.beck-ratzka at aei.mpg.de
> +++++++++++++++++++++++++++++++++++++++++++++++++
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 535 bytes
Desc: This is a digitally signed message part
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20110519/8181d056/attachment.bin
More information about the Users
mailing list