[ET Trac] [Einstein Toolkit] #2008: NaNs when running static tov on >40 cores

Einstein Toolkit trac-noreply at einsteintoolkit.org
Wed Feb 22 08:22:26 CST 2017


#2008: NaNs when running static tov on >40 cores
------------------------------------+---------------------------------------
  Reporter:  allgwy001@…            |       Owner:  knarf     
      Type:  defect                 |      Status:  assigned  
  Priority:  unset                  |   Milestone:            
 Component:  Other                  |     Version:  ET_2016_05
Resolution:                         |    Keywords:            
------------------------------------+---------------------------------------

Comment (by hinder):

 I can confirm that this is a real problem.  I have run your parameter file
 using the current master branch of the ET on 44 processes, and I get NaNs
 at iteration 736.  This is earlier than you saw them, suggesting some sort
 of non-deterministic effect.  The command I used to run this was

     sim create-submit static_tov_mod_2 --parfile par/static_tov_mod_2.par
 --procs 44 --num-threads 1

 You can see the number of processes being used in the output file:

 {{{
 $ grep "Carpet is running on"
 simulations/static_tov_mod_2/output-0000/static_tov_mod_2.out
 INFO (Carpet): Carpet is running on 44 processes
 }}}

 When I ran on 40 processes, this didn't happen.

 This seems to be a bug in Carpet.

 If I instead run with Carpet::processor_topology = "recursive", which uses
 a different algorithm for splitting the domain among processes, the code
 instead aborts with an error:

 {{{
 terminate called after throwing an instance of 'std::out_of_range'
   what():  vector::_M_range_check: __n (which is 44) >= this->size()
 (which is 44)
 }}}

 Backtrace is:

 {{{
 Backtrace from rank 0 pid 18049:
 1. /usr/lib64/libc.so.6(+0x35670) [0x7f6cf5284670]
 2. /usr/lib64/libc.so.6(gsignal+0x37) [0x7f6cf52845f7]
 3. /usr/lib64/libc.so.6(abort+0x148) [0x7f6cf5285ce8]
 4. __gnu_cxx::__verbose_terminate_handler()
 [/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)
 [0x7f6cf5887d2d]]
 5. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dd86) [0x7f6cf5885d86]
 6. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5ddd1) [0x7f6cf5885dd1]
 7. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dfe9) [0x7f6cf5885fe9]
 8. std::__throw_out_of_range_fmt(char const*, ...)
 [/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZSt24__throw_out_of_range_fmtPKcz+0x11f)
 [0x7f6cf58dbfef]]
 9. std::vector<int, std::allocator<int> >::_M_range_check(unsigned long)
 const
 [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZNKSt6vectorIiSaIiEE14_M_range_checkEm+0x20)
 [0x1342e60]]
 a. Carpet::SplitRegionsMaps_Recursively(_cGH const*,
 std::vector<std::vector<region_t, std::allocator<region_t> >,
 std::allocator<std::vector<region_t, std::allocator<region_t> > > >&,
 std::vector<std::vector<region_t, std::allocator<region_t> >,
 std::allocator<std::vector<region_t, std::allocator<region_t> > > >&)
 [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet28SplitRegionsMaps_RecursivelyEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11d7)
 [0x2f48f87]]
 b. Carpet::SplitRegionsMaps(_cGH const*, std::vector<std::vector<region_t,
 std::allocator<region_t> >, std::allocator<std::vector<region_t,
 std::allocator<region_t> > > >&, std::vector<std::vector<region_t,
 std::allocator<region_t> >, std::allocator<std::vector<region_t,
 std::allocator<region_t> > > >&)
 [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet16SplitRegionsMapsEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11c)
 [0x2ef6e5c]]
 c.
 /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim()
 [0x2f05767]
 d.
 /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim()
 [0x2f03b1c]
 e. Carpet::SetupGH(tFleshConfig*, int, _cGH*)
 [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet7SetupGHEP12tFleshConfigiP4_cGH+0x18e6)
 [0x2eff606]]
 f.
 /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CCTKi_SetupGHExtensions+0xc4)
 [0xd20924]
 10.
 /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CactusDefaultSetupGH+0x30e)
 [0xd3ad2e]
 11. Carpet::Initialise(tFleshConfig*)
 [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet10InitialiseEP12tFleshConfig+0x54)
 [0x2ee2414]]
 12.
 /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(main+0x96)
 [0xcbbef6]
 13. /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6cf5270b15]
 14.
 /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim()
 [0xcbbbe9]
 }}}

 "recursive" works on 40 processes.  It's possible that there is no
 sensible way to split such a small domain among 44 processes, and that
 aborting with an error is the correct behaviour.  If that is the case,
 then this should be caught at a higher level in the code and a more
 sensible error message should be generated.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/2008#comment:4>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list