[Users] OpenMP is making it slower?
scott.hawley at belmont.edu
Tue May 17 17:02:00 CDT 2011
Thanks for your ideas. #3 may be the most significant.
1. I switched the OMP directives to the outer loop, with the main result being, of course, that the "% done" line skips around, but NO change in execution speed.
2. I also increased the number of private variables as shown below. Again no change in speed. And by this I mean:
1 thread - the routine takes 11.3 seconds
2 threads - the routine takes 47.7 seconds
4 threads - the routine takes 40.1 seconds
These results are using code at the beginning of the loops which now reads...
!$OMP PARALLEL DO SHARED(mask,agxx,agxy,agxz,agyy,agyz,agzz,
!$OMP& SCHEDULE(STATIC,chunk) PRIVATE(k,j,i,gxx,gxy,gxz,gyy,gyz,gzz,
!$OMP& x, y, z)
do k = 1, nz
write(msg,*)'[1F setbkgrnd: ' //
& 'nz = ',nz,', % done = ',int(k*1.0d2/nz),' '
do j = 1, ny
do i = 1, nx
3. There is ALOT of computational work 'per point' done, at each value of i,j,k; there are long formulas in the
Wait...these include files contain *tons* of temporary variables that maybe should be private -- they were generated by maple, variables like 't1' thru 't250'.
Do these all be need to be declared as private?
i certainly don't want the various processors overwriting each others' work, which might be what they're doing. -- maybe they're even generating NaNs which would slow things down a bit!
4. Yea, even 2 threads vs one thread is a significant slowdown, as noted above.
5. Earlier I misspoke: "It works fine on my mac" means OpenMP works fine on my Mac, for a *different* program I wrote. This program *also* works fine on my linux box.
So it's *very likely* that my current issue is "user error", and not a misconfigured OpenMP. lol.
Scott H. Hawley, Ph.D. Asst. Prof. of Physics
Chemistry & Physics Dept Office: Hitch 100D
Belmont University Tel: +1-615-460-6206
Nashville, TN 37212 USA Fax: +1-615-460-5458
PGP Key at http://sks-keyservers.net
On May 17, 2011, at 12:46 PM, Erik Schnetter wrote:
> Thanks for showing the code.
> I can think of several things that could go wrong:
> 1. It takes some time to start up and shut down parallel threads.
> Therefore, people usually parallelise the outermost loop, i.e. the k
> loop in your case. Parallelising the j loop requires starting and
> stopping threads nz times, which adds overhead.
> 2. I notice that you don't declare private variables. Variables are
> shared by default, and any local variables that you use inside the
> parallel region (and which are not arrays where you only access one
> element) need to be declared as private. In your case, these are
> probably x, y, z, gxx, gxy, gxz, etc. Did you compare results between
> serial and parallel runs? I would expect the results to differ, i.e.
> the current parallel code seems to have a serious error.
> 3. How much computational work is done inside this loop? If most of
> the time is spent in memory access writing to the gij and Kij arrays,
> then OpenMP won't be able to help. Only if there is sufficient
> computation going on will you see a benefit.
> 4. Since you say that you have 24 cores, I assume you have an AMD
> system. In this case, your machine consists of 4 subsystems that have
> 6 cores each, and communication between these 4 subsystems will be
> much slower than within each of these subsystems. People usually
> recommend to use not more than 6 OpenMP threads, and to ensure that
> these run within one of these subsystems. You can try setting the
> environment variable GOMP_CPU_AFFINITY='0-5' to force your threads to
> run on cores 0 to 5.
> On Tue, May 17, 2011 at 1:00 PM, Scott Hawley <scott.hawley at belmont.edu> wrote:
>> No doubt someone will ask for the code itself. The relevant part is given
>> below. ( In old-school Fortran77)
>> Specifically, the 'problem' I'm noticing is that the "% done" messages
>> appear with less frequency and with lesser increment per wall clock time
>> with OMP_NUM_THREADS > 1 than for OMP_NUM_THREADS = 1. The cpus get
>> used alot more --- 'top' shows up to 2000% cpu usage for 24 threads --- but
>> the wallclock time doesn't decrease at all.
>> Also note that whether I use the long OMP directive shown (with the 'shared'
>> declarations and schedule, etc) and the 'END PARALLEL DO' at the end, or if
>> I just use a simple '!$OMP PARALLEL DO' and *nothing else*, the execution
>> time is *identical*.
>> Thanks again!
>> chunk = 8
>> do k = 1, nz
>> write(msg,*)'[1F setbkgrnd: ' //
>> & 'nz = ',nz,', % done = ',int(k*1.0d2/nz),' '
>> call writemessage(msg)
>> !$OMP PARALLEL DO SHARED(mask,agxx,agxy,agxz,agyy,agyz,agzz,
>> !$OMP& aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,ax,ay,az),
>> !$OMP& SCHEDULE(STATIC,chunk) PRIVATE(j)
>> do j = 1, ny
>> do i = 1, nx
>> c if (ltrace) then
>> c write(msg,*) '---------------'
>> c call writemessage(msg)
>> c endif
>> if (mask(i,j,k) .ne. m_ex .and.
>> & (ibonly .eq. 0 .or.
>> & mask(i,j,k) .eq. m_ib)) then
>> x = ax(i)
>> y = ay(j)
>> z = az(k)
>> c the following two include files just perform many pointwise calc's
>> include 'gd.inc'
>> include 'kd.inc'
>> agxx(i,j,k) = gxx
>> agxy(i,j,k) = gxy
>> agxz(i,j,k) = gxz
>> agyy(i,j,k) = gyy
>> agyz(i,j,k) = gyz
>> agzz(i,j,k) = gzz
>> aKxx(i,j,k) = Kxx
>> aKxy(i,j,k) = Kxy
>> aKxz(i,j,k) = Kxz
>> aKyy(i,j,k) = Kyy
>> aKyz(i,j,k) = Kyz
>> aKzz(i,j,k) = Kzz
>> else if (mask(i,j,k) .eq. m_ex) then
>> c Excised points
>> agxx(i,j,k) = exval
>> agxy(i,j,k) = exval
>> agxz(i,j,k) = exval
>> agyy(i,j,k) = exval
>> agyz(i,j,k) = exval
>> agzz(i,j,k) = exval
>> aKxx(i,j,k) = exval
>> aKxy(i,j,k) = exval
>> aKxz(i,j,k) = exval
>> aKyy(i,j,k) = exval
>> aKyz(i,j,k) = exval
>> aKzz(i,j,k) = exval
>> !$OMP END PARALLEL DO
>> Users mailing list
>> Users at einsteintoolkit.org
> Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 535 bytes
Desc: This is a digitally signed message part
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20110517/61377992/attachment-0001.bin
More information about the Users