[Users] OpenMP is making it slower?

Tue May 17 17:02:00 CDT 2011

Erik,
   Thanks for your ideas.  #3 may be the most significant.

1. I switched the OMP directives to the outer loop, with the main result being, of course, that the "% done" line skips around, but NO change in execution speed.

2. I also increased the number of private variables as shown below.  Again no change in speed.   And by this I mean:
1 thread  -  the routine takes 11.3 seconds
2 threads  - the routine takes 47.7 seconds
4 threads - the routine takes  40.1 seconds

These results are using code at the beginning of the loops which now reads...
!$OMP PARALLEL DO SHARED(mask,agxx,agxy,agxz,agyy,agyz,agzz,
!$OMP&   aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,ax,ay,az), 
!$OMP& SCHEDULE(STATIC,chunk) PRIVATE(k,j,i,gxx,gxy,gxz,gyy,gyz,gzz,
!$OMP&     x, y, z) 
         do k = 1, nz 
            write(msg,*)'[1F      setbkgrnd:   ' //
     &              'nz = ',nz,',   % done = ',int(k*1.0d2/nz),'  '
            call writemessage(msg)

            do j = 1, ny
               do i = 1, nx
...

3. There is ALOT of computational work 'per point' done, at each value of i,j,k;  there are long formulas in the 
include files
                   include 'gd.inc'
                   include 'kd.inc'
Wait...these include files contain *tons* of temporary variables that maybe should be private -- they were generated by maple, variables like 't1' thru 't250'.  
Do these all be need to be declared as private?  
i certainly don't want the various processors overwriting each others' work, which might be what they're doing. -- maybe they're even generating NaNs which would slow things down a bit!

4. Yea, even 2 threads vs one thread is a significant slowdown, as noted above.

5. Earlier I misspoke: "It works fine on my mac" means OpenMP works fine on my Mac, for a *different* program I wrote.  This program *also* works fine on my linux box.
So it's *very likely* that my current issue is "user error", and not a misconfigured OpenMP.  lol.

Thanks,
Scott

--
Scott H. Hawley, Ph.D. 	 		Asst. Prof. of Physics                            
Chemistry & Physics Dept       		Office: Hitch 100D             
Belmont University                 	Tel:  +1-615-460-6206
Nashville, TN 37212 USA           	Fax: +1-615-460-5458
PGP Key at http://sks-keyservers.net

On May 17, 2011, at 12:46 PM, Erik Schnetter wrote:

> Scott
> 
> Thanks for showing the code.
> 
> I can think of several things that could go wrong:
> 
> 1. It takes some time to start up and shut down parallel threads.
> Therefore, people usually parallelise the outermost loop, i.e. the k
> loop in your case. Parallelising the j loop requires starting and
> stopping threads nz times, which adds overhead.
> 
> 2. I notice that you don't declare private variables. Variables are
> shared by default, and any local variables that you use inside the
> parallel region (and which are not arrays where you only access one
> element) need to be declared as private. In your case, these are
> probably x, y, z, gxx, gxy, gxz, etc. Did you compare results between
> serial and parallel runs? I would expect the results to differ, i.e.
> the current parallel code seems to have a serious error.
> 
> 3. How much computational work is done inside this loop? If most of
> the time is spent in memory access writing to the gij and Kij arrays,
> then OpenMP won't be able to help. Only if there is sufficient
> computation going on will you see a benefit.
> 
> 4. Since you say that you have 24 cores, I assume you have an AMD
> system. In this case, your machine consists of 4 subsystems that have
> 6 cores each, and communication between these 4 subsystems will be
> much slower than within each of these subsystems. People usually
> recommend to use not more than 6 OpenMP threads, and to ensure that
> these run within one of these subsystems. You can try setting the
> environment variable GOMP_CPU_AFFINITY='0-5' to force your threads to
> run on cores 0 to 5.
> 
> -erik
> 
> On Tue, May 17, 2011 at 1:00 PM, Scott Hawley <scott.hawley at belmont.edu> wrote:
>> No doubt someone will ask for the code itself.  The relevant part is given
>> below.  ( In old-school Fortran77)
>> Specifically, the 'problem' I'm noticing is that the "% done" messages
>> appear with less frequency and with lesser increment per wall clock time
>> with OMP_NUM_THREADS > 1 than for OMP_NUM_THREADS = 1.       The cpus get
>> used alot more --- 'top' shows up to 2000% cpu usage for 24 threads --- but
>> the wallclock time doesn't decrease at all.
>> Also note that whether I use the long OMP directive shown (with the 'shared'
>> declarations and schedule, etc) and the 'END PARALLEL DO' at the end,  or if
>> I just use a simple '!$OMP PARALLEL DO' and *nothing else*,   the execution
>> time is *identical*.
>> Thanks again!
>> -Scott
>> 
>>                      chunk = 8
>>          do k = 1, nz
>>             write(msg,*)'[1F      setbkgrnd:   ' //
>>      &              'nz = ',nz,',   % done = ',int(k*1.0d2/nz),'  '
>>             call writemessage(msg)
>> 
>> 
>> 
>> !$OMP PARALLEL DO SHARED(mask,agxx,agxy,agxz,agyy,agyz,agzz,
>> !$OMP&   aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,ax,ay,az),
>> !$OMP& SCHEDULE(STATIC,chunk) PRIVATE(j)
>>             do j = 1, ny
>>                do i = 1, nx
>> c                 if (ltrace) then
>> c                     write(msg,*) '---------------'
>> c                     call writemessage(msg)
>> c                 endif
>> 
>> 
>> 
>>                   if (mask(i,j,k) .ne. m_ex .and.
>>      &                (ibonly .eq. 0 .or.
>>      &                 mask(i,j,k) .eq. m_ib)) then
>>                    x = ax(i)
>>                    y = ay(j)
>>                    z = az(k)
>> c the following two include files just perform many pointwise calc's
>>                    include 'gd.inc'
>>                    include 'kd.inc'
>>                      agxx(i,j,k) = gxx
>>                      agxy(i,j,k) = gxy
>>                      agxz(i,j,k) = gxz
>>                      agyy(i,j,k) = gyy
>>                      agyz(i,j,k) = gyz
>>                      agzz(i,j,k) = gzz
>> 
>> 
>> 
>>                      aKxx(i,j,k) = Kxx
>>                      aKxy(i,j,k) = Kxy
>>                      aKxz(i,j,k) = Kxz
>>                      aKyy(i,j,k) = Kyy
>>                      aKyz(i,j,k) = Kyz
>>                      aKzz(i,j,k) = Kzz
>> 
>>                  else if (mask(i,j,k) .eq. m_ex) then
>> c                   Excised points
>>                      agxx(i,j,k) = exval
>>                      agxy(i,j,k) = exval
>>                      agxz(i,j,k) = exval
>>                      agyy(i,j,k) = exval
>>                      agyz(i,j,k) = exval
>>                      agzz(i,j,k) = exval
>> 
>> 
>> 
>>                      aKxx(i,j,k) = exval
>>                      aKxy(i,j,k) = exval
>>                      aKxz(i,j,k) = exval
>>                      aKyy(i,j,k) = exval
>>                      aKyz(i,j,k) = exval
>>                      aKzz(i,j,k) = exval
>>                  endif
>>                enddo
>>             enddo
>> !$OMP END PARALLEL DO
>>          enddo
>> 
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>> 
>> 
> 
> 
> 
> -- 
> Erik Schnetter <schnetter at cct.lsu.edu>   http://www.cct.lsu.edu/~eschnett/
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20110517/61377992/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 535 bytes
Desc: This is a digitally signed message part
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20110517/61377992/attachment-0001.bin