<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Erik,<div> Thanks for your ideas. #3 may be the most significant.</div><div><br></div><div>1. I switched the OMP directives to the outer loop, with the main result being, of course, that the "% done" line skips around, but NO change in execution speed.</div><div><br></div><div>2. I also increased the number of private variables as shown below. Again no change in speed. And by this I mean:</div><div>1 thread - the routine takes 11.3 seconds</div><div>2 threads - the routine takes 47.7 seconds</div><div>4 threads - the routine takes 40.1 seconds</div><div><br></div><div>These results are using code at the beginning of the loops which now reads...</div><div><div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; color: rgb(0, 132, 38); ">!$OMP PARALLEL DO SHARED(mask,agxx,agxy,agxz,agyy,agyz,agzz,</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; color: rgb(0, 132, 38); ">!$OMP& aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,ax,ay,az), </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; color: rgb(0, 132, 38); ">!$OMP& SCHEDULE(STATIC,chunk) PRIVATE(k,j,i,gxx,gxy,gxz,gyy,gyz,gzz,</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; color: rgb(0, 132, 38); ">!$OMP& x, y, z) </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; "> <span style="color: rgb(194, 42, 156); ">do</span> k = <span style="color: rgb(47, 47, 207); ">1</span>, nz </div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; color: rgb(217, 40, 35); "><span style="color: rgb(0, 0, 0); "> </span><span style="color: rgb(194, 42, 156); ">write</span><span style="color: rgb(0, 0, 0); ">(msg,*)</span>'[1F setbkgrnd: '<span style="color: rgb(0, 0, 0); "> //</span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; "> & <span style="color: rgb(217, 40, 35); ">'nz = '</span>,nz,<span style="color: rgb(217, 40, 35); ">', % done = '</span>,int(k*<span style="color: rgb(47, 47, 207); ">1.0</span>d<span style="color: rgb(47, 47, 207); ">2</span>/nz),<span style="color: rgb(217, 40, 35); ">' '</span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; "> <span style="color: rgb(194, 42, 156); ">call</span> writemessage(msg)</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; min-height: 13px; "> <br class="webkit-block-placeholder"></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; "> <span style="color: rgb(194, 42, 156); ">do</span> j = <span style="color: rgb(47, 47, 207); ">1</span>, ny</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; "> <span style="color: rgb(194, 42, 156); ">do</span> i = <span style="color: rgb(47, 47, 207); ">1</span>, nx</div></div><div>...</div><div><br></div><div>3. There is ALOT of computational work 'per point' done, at each value of i,j,k; there are long formulas in the </div><div>include files</div><div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; "> <span style="color: rgb(194, 42, 156); ">include</span> <span style="color: rgb(217, 40, 35); ">'gd.inc'</span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Menlo; "> <span style="color: rgb(194, 42, 156); ">include</span> <span style="color: rgb(217, 40, 35); ">'kd.inc'</span></div></div><div><font class="Apple-style-span" color="#D92823"><span class="Apple-style-span" style="color: rgb(0, 0, 0); ">Wait...these include files contain *tons* of temporary variables that maybe should be private -- they were generated by maple, variables like 't1' thru 't250'. </span></font></div><div><font class="Apple-style-span" color="#D92823"><span class="Apple-style-span" style="color: rgb(0, 0, 0); ">Do these all be need to be declared as private? </span></font></div><div><font class="Apple-style-span" color="#D92823"><span class="Apple-style-span" style="color: rgb(0, 0, 0); ">i certainly don't want the various processors overwriting each others' work, which might be what they're doing. -- maybe they're even generating NaNs which would slow things down a bit!</span></font></div><div><br class="webkit-block-placeholder"></div><div><br></div><div>4. Yea, even 2 threads vs one thread is a significant slowdown, as noted above.</div><div><br></div><div>5. Earlier I misspoke: "It works fine on my mac" means OpenMP works fine on my Mac, for a *different* program I wrote. This program *also* works fine on my linux box.</div><div>So it's *very likely* that my current issue is "user error", and not a misconfigured OpenMP. lol.</div><div><br></div><div><br></div><div>Thanks,</div><div>Scott</div><div><br></div></div><div>
<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; ">--<br>Scott H. Hawley, Ph.D. <span class="Apple-tab-span" style="white-space: pre; ">        </span> <span class="Apple-tab-span" style="white-space: pre; ">                </span>Asst. Prof. of Physics <br>Chemistry & Physics Dept <span class="Apple-tab-span" style="white-space: pre; ">                </span>Office: Hitch 100D <br>Belmont University <span class="Apple-tab-span" style="white-space: pre; ">        </span>Tel: +1-615-460-6206<br>Nashville, TN 37212 USA <span class="Apple-tab-span" style="white-space: pre; ">        </span>Fax: +1-615-460-5458<br>PGP Key at <a href="http://sks-keyservers.net">http://sks-keyservers.net</a><br><br><br></span>
</div>
<br><div><div>On May 17, 2011, at 12:46 PM, Erik Schnetter wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Scott<br><br>Thanks for showing the code.<br><br>I can think of several things that could go wrong:<br><br>1. It takes some time to start up and shut down parallel threads.<br>Therefore, people usually parallelise the outermost loop, i.e. the k<br>loop in your case. Parallelising the j loop requires starting and<br>stopping threads nz times, which adds overhead.<br><br>2. I notice that you don't declare private variables. Variables are<br>shared by default, and any local variables that you use inside the<br>parallel region (and which are not arrays where you only access one<br>element) need to be declared as private. In your case, these are<br>probably x, y, z, gxx, gxy, gxz, etc. Did you compare results between<br>serial and parallel runs? I would expect the results to differ, i.e.<br>the current parallel code seems to have a serious error.<br><br>3. How much computational work is done inside this loop? If most of<br>the time is spent in memory access writing to the gij and Kij arrays,<br>then OpenMP won't be able to help. Only if there is sufficient<br>computation going on will you see a benefit.<br><br>4. Since you say that you have 24 cores, I assume you have an AMD<br>system. In this case, your machine consists of 4 subsystems that have<br>6 cores each, and communication between these 4 subsystems will be<br>much slower than within each of these subsystems. People usually<br>recommend to use not more than 6 OpenMP threads, and to ensure that<br>these run within one of these subsystems. You can try setting the<br>environment variable GOMP_CPU_AFFINITY='0-5' to force your threads to<br>run on cores 0 to 5.<br><br>-erik<br><br>On Tue, May 17, 2011 at 1:00 PM, Scott Hawley <<a href="mailto:scott.hawley@belmont.edu">scott.hawley@belmont.edu</a>> wrote:<br><blockquote type="cite">No doubt someone will ask for the code itself. The relevant part is given<br></blockquote><blockquote type="cite">below. ( In old-school Fortran77)<br></blockquote><blockquote type="cite">Specifically, the 'problem' I'm noticing is that the "% done" messages<br></blockquote><blockquote type="cite">appear with less frequency and with lesser increment per wall clock time<br></blockquote><blockquote type="cite">with OMP_NUM_THREADS > 1 than for OMP_NUM_THREADS = 1. The cpus get<br></blockquote><blockquote type="cite">used alot more --- 'top' shows up to 2000% cpu usage for 24 threads --- but<br></blockquote><blockquote type="cite">the wallclock time doesn't decrease at all.<br></blockquote><blockquote type="cite">Also note that whether I use the long OMP directive shown (with the 'shared'<br></blockquote><blockquote type="cite">declarations and schedule, etc) and the 'END PARALLEL DO' at the end, or if<br></blockquote><blockquote type="cite">I just use a simple '!$OMP PARALLEL DO' and *nothing else*, the execution<br></blockquote><blockquote type="cite">time is *identical*.<br></blockquote><blockquote type="cite">Thanks again!<br></blockquote><blockquote type="cite">-Scott<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"> chunk = 8<br></blockquote><blockquote type="cite"> do k = 1, nz<br></blockquote><blockquote type="cite"> write(msg,*)'[1F setbkgrnd: ' //<br></blockquote><blockquote type="cite"> & 'nz = ',nz,', % done = ',int(k*1.0d2/nz),' '<br></blockquote><blockquote type="cite"> call writemessage(msg)<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">!$OMP PARALLEL DO SHARED(mask,agxx,agxy,agxz,agyy,agyz,agzz,<br></blockquote><blockquote type="cite">!$OMP& aKxx,aKxy,aKxz,aKyy,aKyz,aKzz,ax,ay,az),<br></blockquote><blockquote type="cite">!$OMP& SCHEDULE(STATIC,chunk) PRIVATE(j)<br></blockquote><blockquote type="cite"> do j = 1, ny<br></blockquote><blockquote type="cite"> do i = 1, nx<br></blockquote><blockquote type="cite">c if (ltrace) then<br></blockquote><blockquote type="cite">c write(msg,*) '---------------'<br></blockquote><blockquote type="cite">c call writemessage(msg)<br></blockquote><blockquote type="cite">c endif<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"> if (mask(i,j,k) .ne. m_ex .and.<br></blockquote><blockquote type="cite"> & (ibonly .eq. 0 .or.<br></blockquote><blockquote type="cite"> & mask(i,j,k) .eq. m_ib)) then<br></blockquote><blockquote type="cite"> x = ax(i)<br></blockquote><blockquote type="cite"> y = ay(j)<br></blockquote><blockquote type="cite"> z = az(k)<br></blockquote><blockquote type="cite">c the following two include files just perform many pointwise calc's<br></blockquote><blockquote type="cite"> include 'gd.inc'<br></blockquote><blockquote type="cite"> include 'kd.inc'<br></blockquote><blockquote type="cite"> agxx(i,j,k) = gxx<br></blockquote><blockquote type="cite"> agxy(i,j,k) = gxy<br></blockquote><blockquote type="cite"> agxz(i,j,k) = gxz<br></blockquote><blockquote type="cite"> agyy(i,j,k) = gyy<br></blockquote><blockquote type="cite"> agyz(i,j,k) = gyz<br></blockquote><blockquote type="cite"> agzz(i,j,k) = gzz<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"> aKxx(i,j,k) = Kxx<br></blockquote><blockquote type="cite"> aKxy(i,j,k) = Kxy<br></blockquote><blockquote type="cite"> aKxz(i,j,k) = Kxz<br></blockquote><blockquote type="cite"> aKyy(i,j,k) = Kyy<br></blockquote><blockquote type="cite"> aKyz(i,j,k) = Kyz<br></blockquote><blockquote type="cite"> aKzz(i,j,k) = Kzz<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"> else if (mask(i,j,k) .eq. m_ex) then<br></blockquote><blockquote type="cite">c Excised points<br></blockquote><blockquote type="cite"> agxx(i,j,k) = exval<br></blockquote><blockquote type="cite"> agxy(i,j,k) = exval<br></blockquote><blockquote type="cite"> agxz(i,j,k) = exval<br></blockquote><blockquote type="cite"> agyy(i,j,k) = exval<br></blockquote><blockquote type="cite"> agyz(i,j,k) = exval<br></blockquote><blockquote type="cite"> agzz(i,j,k) = exval<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"> aKxx(i,j,k) = exval<br></blockquote><blockquote type="cite"> aKxy(i,j,k) = exval<br></blockquote><blockquote type="cite"> aKxz(i,j,k) = exval<br></blockquote><blockquote type="cite"> aKyy(i,j,k) = exval<br></blockquote><blockquote type="cite"> aKyz(i,j,k) = exval<br></blockquote><blockquote type="cite"> aKzz(i,j,k) = exval<br></blockquote><blockquote type="cite"> endif<br></blockquote><blockquote type="cite"> enddo<br></blockquote><blockquote type="cite"> enddo<br></blockquote><blockquote type="cite">!$OMP END PARALLEL DO<br></blockquote><blockquote type="cite"> enddo<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">_______________________________________________<br></blockquote><blockquote type="cite">Users mailing list<br></blockquote><blockquote type="cite"><a href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a><br></blockquote><blockquote type="cite"><a href="http://lists.einsteintoolkit.org/mailman/listinfo/users">http://lists.einsteintoolkit.org/mailman/listinfo/users</a><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><br><br><br>-- <br>Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu">schnetter@cct.lsu.edu</a>> <a href="http://www.cct.lsu.edu/~eschnett/">http://www.cct.lsu.edu/~eschnett/</a><br><br></div></blockquote></div><br></body></html>