[Users] ET_2013_11 run performance

Ian Hinder ian.hinder at aei.mpg.de
Thu Jan 23 04:58:14 CST 2014


On 23 Jan 2014, at 02:51, Erik Schnetter <schnetter at cct.lsu.edu> wrote:

> Luca
> 
> The emails depend on the settings in the submit script, i.e. the file simfactory/mdb/submitscripts/stampede.sub. The file "stampede.ini" should not matter.

I just checked, and there is nothing different between the stampede.ini files for the two versions that should affect whether the job emails are sent.  There is also no change at all between the submit scripts between the two versions.  

Assuming that you have two simulations, one where the emails are sent, and one where they are not, can you perform a diff between the recorded submission script in the simulation SIMFACTORY directory between the two simulations and post the output to the list?  

If you are using the versions from the repository, you should get:

> --- a/mdb/machines/stampede.ini
> +++ b/mdb/machines/stampede.ini
> @@ -1,6 +1,6 @@
>  [stampede]
>  
> -# last-tested-on: 2013-04-29
> +# last-tested-on: 2013-11-04
>  # last-tested-by: Erik Schnetter <schnetter at gmail.com>
>  
>  # NOTE: This machine configuration uses only the regular CPUs of
> @@ -17,7 +17,7 @@ status          = experimental
>  # Access to this machine
>  hostname        = stampede.tacc.utexas.edu
>  rsynccmd        = /home1/00507/eschnett/rsync-3.0.9/bin/rsync
> -envsetup        = module unload mvapich2 && module load impi
> +envsetup        = module load intel/13.1.1.163 && module unload mvapich2 && module load impi/4.1.1.036 && module load papi
>  aliaspattern    = ^login[1234](\.stampede\.tacc\.utexas\.edu)?$
>  
>  # Source tree management
> @@ -40,6 +40,16 @@ disabled-thorns = <<EOT
>          LSUDevelopment/WaveToyNoGhostsPETSc
>          TAT/TATPETSc
>  EOT
> +enabled-thorns = <<EOT
> +#    CactusTest/TestAllTypes
> +#    ExternalLibraries/OpenCL
> +#        CactusExamples/WaveToyOpenCL
> +#        CactusUtils/Accelerator
> +#        CactusUtils/OpenCLRunTime
> +#        McLachlan/ML_BSSN_CL
> +#        McLachlan/ML_BSSN_CL_Helper
> +#        McLachlan/ML_WaveToy_CL
> +EOT
>  optionlist      = stampede.cfg
>  submitscript    = stampede.sub
>  runscript       = stampede.run
> @@ -76,15 +86,15 @@ nodes           = 6400
>  min-ppn         = 16
>  allocation      = NO_ALLOCATION
>  queue           = normal        # [normal, large, development]
> -maxwalltime     = 24:00:00      # development has 4:0:0
> +maxwalltime     = 48:00:00      # development has 4:0:0
>  maxqueueslots   = 49
> -submit          = sbatch @SCRIPTFILE@
> +submit          = sbatch @SCRIPTFILE@; sleep 60
>  getstatus       = squeue -j @JOB_ID@
>  stop            = scancel @JOB_ID@
>  submitpattern   = Submitted batch job ([0-9]+)
> -statuspattern   = ' @JOB_ID@ '
> +statuspattern   = '@JOB_ID@ '
>  queuedpattern   = ' PD '
> -runningpattern  = ' R '
> +runningpattern  = ' (CF|CG|R|TO) '
>  holdingpattern  = ' S '
>  #exechost        = head -n 1 SIMFACTORY/NODES
>  #exechostpattern = ^(\S+)


Since you say that changing the stampede.ini file causes the emails to appear, please can you also post the diff between the two stampede.ini files that you are using?

Maybe there is some weird interaction between the queuing system and some environment settings in stampede.ini, e.g. an environment variable set by the modules.

I never receive email notifications from stampede, even though I am using the default submission script.  I assumed it was just broken.  

> 
> -erik
> 
> On Jan 22, 2014, at 20:41 , Luca Baiotti <baiotti at ile.osaka-u.ac.jp> wrote:
> 
>> On 1/20/14 10:26 PM, Ian Hinder wrote:
>>> 
>>> On 20 Jan 2014, at 14:23, Yosef Zlochower <yosef at astro.rit.edu
>>> <mailto:yosef at astro.rit.edu>> wrote:
>>> 
>>>> On 01/20/2014 08:06 AM, Ian Hinder wrote:
>>>>> On 20 Jan 2014, at 06:14, James Healy <jchsma at rit.edu
>>>>> <mailto:jchsma at rit.edu>> wrote:
>>>>> 
>>>>>> Hello all,
>>>>>> 
>>>>>> On Thursday morning, I pulled a fresh checkout of the newest version of
>>>>>> the Einstein Toolkit (ET_2013_11) to use with RIT's LazEv code. I
>>>>>> compiled it on stampede using the current stampede.cfg located in
>>>>>> simfactory/mdb/optionlists which uses Intel MPI version 4.1.0.030 and
>>>>>> the intel compilers version 13.1.1.163 (enabled through a module load).
>>>>>> I submitted a short job which I ran previously with ET_2013_05.  The
>>>>>> results come out the same.  However, the run speed as reported in
>>>>>> Carpet::physical_time_per_hour is poor. It starts off good,
>>>>>> approximately the same as with the previous build, but over time drops
>>>>>> to as low as half the speed over 24 hours of evolution. On recovery from
>>>>>> checkpoint, the speed is even worse, dropping to below 1/4 of the
>>>>>> original run speed.
>>>>>> 
>>>>>> So, I tried using the previous stampede.cfg included in the ET_2013_05
>>>>>> branch of simfactory, the same one I used to compile my ET_2013_05
>>>>>> build.  This cfgfile uses the same version of IMPI but different Intel
>>>>>> compilers (version 13.0.2.146). The run speed shows the same trends as
>>>>>> when using the newer config file.
>>>>> Hi Jim,
>>>>> 
>>>>> I'm quite confused by this problem report.  I guess that you are
>>>>> meaning the following:
>>>>> 
>>>>> - You get the slowdown with the current ET_2013_11 release
>>>>> - You don't get the slowdown with the ET_2013_05 release
>>>>> - You do get the slowdown if you use the current ET_2013_11 release
>>>>> with the ET_2013_05 stampede.cfg
>>>>> 
>>>>> Is that correct?
>>>>> 
>>>>> I consider Intel MPI to be unusable on Stampede, and that it always
>>>>> has been.  I used to get random crashes, hangs and slowdowns.  I also
>>>>> experienced similar problems with Intel MPI on SuperMUC.  For any
>>>>> serious work, I have always used MVAPICH2 on Stampede.  In the
>>>>> current ET trunk Intel MPI has been replaced with MVAPICH2.  I would
>>>>> try the current trunk and see if this fixes your problems.  You can
>>>>> also use just the stampede files from the current trunk with the
>>>>> ET_2013_11 release (make sure you use the ones listed in stampede.ini).
>>>> Interesting. I haven't been able to get a run to work with mvapich2
>>>> because of an issue with the runs
>>>> dying during checkpoint. Which config file are you using (module
>>>> loaded, etc)? How much ram per node
>>>> do your production runs typically use?
>>> 
>>> I'm using exactly the default simfactory config from the current trunk,
>>> so you can see the modules etc there.  Checkpointing (and recovery works
>>> fine).  I usually aim for something like 75% memory usage for production
>>> runs.
>> 
>> Hello, I would like to report a different problem with the simfactory 
>> settings for stampede: with ET Noether or trunk the job start/end emails 
>> are not sent (or at least they do not reach the Osaka University server; 
>> I had the systems administrators check).
>> I receive the emails if I use the simfactory of Gauss. In particular, if 
>> I copy just the stampede.ini from Gauss to Noether (and no other files) 
>> and recompile, I do receive the emails.
>> 
>> Luca
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
> 
> -- 
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
> 
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://pgp.mit.edu/.
> 
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20140123/1ebe39d9/attachment.bin 


More information about the Users mailing list