[Users] ET test failures on Stampede

Wed Nov 5 08:20:25 CST 2014

I don't think there's a trade-off involved. I attended one or two
presentations by mvapich developers, and the additional complexity
from handling MICs comes from correctly (and efficiently) routing data
between CPUs, MICs, and network interfaces within a node, where
multiple paths may exist, and where these paths have different
performance for different message sizes.

The slow-down may be caused by different default parameter settings in
1.9 and 2.0; maybe certain environment variables could change
performance.

In any way, my earlier tests of 2.0 on Stampede were probably tainted
by problems with PETSc and HDF5, and we should repeat them.

-erik

On Wed, Nov 5, 2014 at 5:04 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>
> On 31 Oct 2014, at 09:52, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>
>>
>> On 30 Oct 2014, at 21:55, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>
>>> Ian
>>>
>>> This new MPI version leads to problems running the benchmarks, and
>>> runs at half the speed. (This test was on a single node.)
>>
>> Ouch. I didn't notice that with my runs; either I wasn't paying attention or it didn't happen there.  I will check the next time I run on stampede.
>
> In my production simulations, I see a speed drop of 20% going from the current simfactory and stampede default of mvapich2/1.9 to mvapich2-x/2.0b as suggested by the TACC admins.  However, since the original simulations were hanging, I'm not sure which is better!
>
> Looking at the timers, prolongate is taking 868.4s with 1.9 and 1313.6s with 2.0b.  Sync is about the same speed on both, as are computational functions.  The -x suffix on the mvapich version seems to indicate that it supports the MICs; maybe there is some trade-off that is made.
>
>>
>>>
>>> -erik
>>>
>>> On Thu, Oct 30, 2014 at 11:34 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>>>>
>>>> On 30 Oct 2014, at 15:03, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>>>
>>>>> I've begun to run the automated tests for the ET on our production
>>>>> machines. Things look very good almost everywhere, except on Stampede,
>>>>> one of the machines that is most important to us. It seems that there
>>>>> are many test failures for GRHydro, and these seem to be caused by
>>>>> segfaults. Does anybody volunteer to investigate?
>>>>
>>>> I don't know anything about the problems with GRHydro.
>>>>
>>>> I was having problems a while back with the current simfactory default version of mvapich2, and TACC support suggested I try the mvapich2-x version.  The problem I saw was the MPI reductions would hang.  I have been using that for a few months with no problems.  The required change to the optionlist is:
>>>>
>>>>>
>>>>> < MPI_DIR  = /opt/apps/intel13/mvapich2/1.9
>>>>> ---
>>>>>> MPI_DIR  = /home1/apps/intel13/mvapich2-x/2.0b
>>>>>> MPI_LIB_DIRS = /home1/apps/intel13/mvapich2-x/2.0b/lib64
>>>>
>>>>
>>>> At least, this was before the changes to the MPI thorn.  It's possible that this is no longer enough.
>>>>
>>>> Should we change to this version in simfactory for the release?
>>>>
>>>> --
>>>> Ian Hinder
>>>> http://numrel.aei.mpg.de/people/hinder
>>>>
>>>
>>>
>>>
>>> --
>>> Erik Schnetter <schnetter at cct.lsu.edu>
>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>> --
>> Ian Hinder
>> http://numrel.aei.mpg.de/people/hinder
>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
> --
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>

-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/