[Users] OpenMP is making it slower?

Thu May 19 11:13:22 CDT 2011

Thanks Alexander.  The overhead associated with re-allocating all those temporary variables is the reason why I used include files instead of defining separate subroutines.
My code works with MPI, but I was looking for a way to eek out a little more parallelism.  The domain decomposition routine for MPI needs to be changed so I can get more than 8 nodes, and I thought OpenMP would be the way to do it.

And I believe I still can, *elsewhere* in the code.  The particular routine I chose was just perhaps not a good one.  I'm just removing the OpenMP directives from this routine, and let it only be parallel via MPI.

Thanks!

-Scott

--
Scott H. Hawley, Ph.D. 	 		Asst. Prof. of Physics                            
Chemistry & Physics Dept       		Office: Hitch 100D             
Belmont University                 	Tel:  +1-615-460-6206
Nashville, TN 37212 USA           	Fax: +1-615-460-5458
PGP Key at http://sks-keyservers.net

On May 19, 2011, at 2:14 AM, Alexander Beck-Ratzka wrote:

> Hey Scott,
> 
> OpenMP has a huge overhead. It must create new functions, which then are 
> started as threads, and this takes some time. During a OpenMP tutorial I have 
> made some tests with a matrix vector multiplication. It turned out, that you 
> can see an increase of the scalability using OpenMP in such a case only from a 
> vectory size with more then 20000 elements.
> 
> If you are using OpenMP additional to MPI, and further if you use simfactory 
> to start your runs, I have another answer.
> 
> Recently I have made some comparisons between activated OpenMP extensions and 
> only MPI, and I have used simfactory to start my simulations. Simfactory has 
> not done that, what I have expected. Let me explain what I mean.
> 
> If I activate OpenMP by setting OMP_NUM_THREADS to 4, and then use simfactory 
> with --procs=16, then what simfactory finally makes a run with 
> 
> -pe openmpi 16
> 
> but(!!)
> 
> numprocs=4
> numthreads=4
> 
> Using only MPI you will have in such a case
> 
> numprocs=16
> numthreads=1
> 
> Hope that helps. So have your program running on 16 nodes with MPI only, 
> compared to 4 nodes and additional 4 OpenMP threads per node. In such a case 
> MPI is always faster, if your code in scaling.
> 
> Hope that helps.
> 
> Cheers
> 
> Alexander
> 
> On Thursday, May 19, 2011 00:33:49 Scott Hawley wrote:
>> Ok. Still having problems.
>> I defaulted everything to private and explicitly declared my shared's.  Now
>> what happens is that the outer "k" loop never gets incremented. Even if I
>> run with only one thread, "k" always equals 1.
>> 
>> So the snippet of code follows.  Where m_ex and m_ib are parameters in
>> Fortran and are hard-coded as numbers by the compiler. If I compile with
>> -fopenmp it works fine, but at the -fopenmp and "setenv OMP_NUM_THREADS 1"
>> and it won't increment.
>> 
>> Any new ideas?  Thanks in advance.
>> -Scott
>> 
>> 
>> !$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(mask,ibonly,ax,ay,az,
>> !$OMP&        agxx,agxy,agxz,agyy,agyz,agzz,
>> !$OMP&        aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,nx,ny,nz)
>> !$OMP& SCHEDULE(STATIC,chunk)
>>         do k = 1, nz
>>            write(msg,*)'      setbkgrnd:   ' //
>>     &              'nz = ',nz,',   % done = ',int(k*1.0d2/nz),'  '
>>            call writemessage(msg)
>> 
>>            do j = 1, ny
>>               do i = 1, nx
>> 
>>                  if (mask(i,j,k) .ne. m_ex .and.
>>     &                (ibonly .eq. 0 .or.
>>     &                 mask(i,j,k) .eq. m_ib)) then
>> 
>> 
>>                   x = ax(i)
>>                   y = ay(j)
>>                   z = az(k)
>>                   include 'gd.inc'
>>                   include 'kd.inc'
>> 
>>                     agxx(i,j,k) = gxx
>>                     agxy(i,j,k) = gxy
>>                     agxz(i,j,k) = gxz
>>                     agyy(i,j,k) = gyy
>>                     agyz(i,j,k) = gyz
>>                     agzz(i,j,k) = gzz
>> 
>>                     aKxx(i,j,k) = Kxx
>>                     aKxy(i,j,k) = Kxy
>>                     aKxz(i,j,k) = Kxz
>>                     aKyy(i,j,k) = Kyy
>>                     aKyz(i,j,k) = Kyz
>>                     aKzz(i,j,k) = Kzz
>>                 else if (mask(i,j,k) .eq. m_ex) then
>> c                   Excised points
>>                     agxx(i,j,k) = exval
>>                     agxy(i,j,k) = exval
>>                     agxz(i,j,k) = exval
>>                     agyy(i,j,k) = exval
>>                     agyz(i,j,k) = exval
>>                     agzz(i,j,k) = exval
>> 
>>                     aKxx(i,j,k) = exval
>>                     aKxy(i,j,k) = exval
>>                     aKxz(i,j,k) = exval
>>                     aKyy(i,j,k) = exval
>>                     aKyz(i,j,k) = exval
>>                     aKzz(i,j,k) = exval
>>                 endif
>> 
>>               enddo
>>            enddo
>>         enddo
>> 
>> 
>> There is no explicit OMP END DO statement because it's optional.
>> 
>> 
>> 
>> 
>> 
>> --
>> Scott H. Hawley, Ph.D. 	 		Asst. Prof. of Physics
>> Chemistry & Physics Dept       		Office: Hitch 100D
>> Belmont University                 	Tel:  +1-615-460-6206
>> Nashville, TN 37212 USA           	Fax: +1-615-460-5458
>> PGP Key at http://sks-keyservers.net
>> 
>> On May 18, 2011, at 4:37 PM, Scott Hawley wrote:
>>> Erik, Frank, Peter: Thanks guys.  I will pursue your suggestions.
>>> 
>>> 
>>> --
>>> Scott H. Hawley, Ph.D. 	 		Asst. Prof. of Physics
>>> Chemistry & Physics Dept       		Office: Hitch 100D
>>> Belmont University                 	Tel:  +1-615-460-6206
>>> Nashville, TN 37212 USA           	Fax: +1-615-460-5458
>>> PGP Key at http://sks-keyservers.net
>>> 
>>> On May 18, 2011, at 12:14 AM, Peter Diener wrote:
>>>> Hi Scott,
>>>> 
>>>> On Tue, 17 May 2011, Frank Loeffler wrote:
>>>>> Hi,
>>>>> 
>>>>> On Tue, May 17, 2011 at 03:02:00PM -0700, Scott Hawley wrote:
>>>>>> Do these all be need to be declared as private?
>>>>> 
>>>>> If the temporary variables are declared only inside the loop they are
>>>>> automatically thread-local. Oh wait, that is Fortran. Well - in that
>>>>> case you should either declare all private, or (maybe easier) put the
>>>>> include files into separate functions, declare the temporary variables
>>>>> only there and call the functions from within the loop, in which case
>>>>> they also don't have to be specified for openmp (as long as they are
>>>>> not static).
>>>>> 
>>>>>> i certainly don't want the various processors overwriting each others'
>>>>>> work, which might be what they're doing. -- maybe they're even
>>>>>> generating NaNs which would slow things down a bit!
>>>> 
>>>> Alternatively you may use the DEFAULT(PRIVATE) clause, so that you only
>>>> have to specify the shared variables. However, in that case you have to
>>>> make sure to really declare all the shared variables as shared, since
>>>> otherwise all processors will have to allocate storage and if they are
>>>> 3d variables this will slow down the code and increase memory
>>>> consumption. Also private variables have undefined values on entry to
>>>> the parallel region so not declaring all shared variables properly can
>>>> also adversely affect the result. So be careful.
>>>> 
>>>>> You should see that in the results though. It might make sense to first
>>>>> make sure that the results with different numbers of threads are the
>>>>> same (depending on the problem you might actually get bit-by-bit
>>>>> identical results), and work on optimization later. I agree that your
>>>>> slow-down actually points towards some bug.
>>>>> 
>>>>> Frank
>>>> 
>>>> Cheers,
>>>> 
>>>> Peter
>>> 
>>> <PGP.sig><ATT00001..txt>
> 
> -- 
> +++++++++++++++++++++++++++++++++++++++++++++++++
> Dr. Alexander Beck-Ratzka - team leader eScience group
> 
> MPI for Gravitational Physics (Albert Einstein Institute)
> Am Mühlenberg 1
> D-14476 Potsdam
> 
> Tel.: 0049 -(0)331 - 567-7192
> Email: alexander.beck-ratzka at aei.mpg.de
> +++++++++++++++++++++++++++++++++++++++++++++++++
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 535 bytes
Desc: This is a digitally signed message part
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20110519/8181d056/attachment.bin