[ET Trac] [Einstein Toolkit] #2008: NaNs when running static tov on >40 cores
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Wed Feb 22 08:22:26 CST 2017
#2008: NaNs when running static tov on >40 cores
------------------------------------+---------------------------------------
Reporter: allgwy001@… | Owner: knarf
Type: defect | Status: assigned
Priority: unset | Milestone:
Component: Other | Version: ET_2016_05
Resolution: | Keywords:
------------------------------------+---------------------------------------
Comment (by hinder):
I can confirm that this is a real problem. I have run your parameter file
using the current master branch of the ET on 44 processes, and I get NaNs
at iteration 736. This is earlier than you saw them, suggesting some sort
of non-deterministic effect. The command I used to run this was
sim create-submit static_tov_mod_2 --parfile par/static_tov_mod_2.par
--procs 44 --num-threads 1
You can see the number of processes being used in the output file:
{{{
$ grep "Carpet is running on"
simulations/static_tov_mod_2/output-0000/static_tov_mod_2.out
INFO (Carpet): Carpet is running on 44 processes
}}}
When I ran on 40 processes, this didn't happen.
This seems to be a bug in Carpet.
If I instead run with Carpet::processor_topology = "recursive", which uses
a different algorithm for splitting the domain among processes, the code
instead aborts with an error:
{{{
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 44) >= this->size()
(which is 44)
}}}
Backtrace is:
{{{
Backtrace from rank 0 pid 18049:
1. /usr/lib64/libc.so.6(+0x35670) [0x7f6cf5284670]
2. /usr/lib64/libc.so.6(gsignal+0x37) [0x7f6cf52845f7]
3. /usr/lib64/libc.so.6(abort+0x148) [0x7f6cf5285ce8]
4. __gnu_cxx::__verbose_terminate_handler()
[/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)
[0x7f6cf5887d2d]]
5. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dd86) [0x7f6cf5885d86]
6. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5ddd1) [0x7f6cf5885dd1]
7. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dfe9) [0x7f6cf5885fe9]
8. std::__throw_out_of_range_fmt(char const*, ...)
[/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZSt24__throw_out_of_range_fmtPKcz+0x11f)
[0x7f6cf58dbfef]]
9. std::vector<int, std::allocator<int> >::_M_range_check(unsigned long)
const
[/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZNKSt6vectorIiSaIiEE14_M_range_checkEm+0x20)
[0x1342e60]]
a. Carpet::SplitRegionsMaps_Recursively(_cGH const*,
std::vector<std::vector<region_t, std::allocator<region_t> >,
std::allocator<std::vector<region_t, std::allocator<region_t> > > >&,
std::vector<std::vector<region_t, std::allocator<region_t> >,
std::allocator<std::vector<region_t, std::allocator<region_t> > > >&)
[/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet28SplitRegionsMaps_RecursivelyEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11d7)
[0x2f48f87]]
b. Carpet::SplitRegionsMaps(_cGH const*, std::vector<std::vector<region_t,
std::allocator<region_t> >, std::allocator<std::vector<region_t,
std::allocator<region_t> > > >&, std::vector<std::vector<region_t,
std::allocator<region_t> >, std::allocator<std::vector<region_t,
std::allocator<region_t> > > >&)
[/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet16SplitRegionsMapsEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11c)
[0x2ef6e5c]]
c.
/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim()
[0x2f05767]
d.
/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim()
[0x2f03b1c]
e. Carpet::SetupGH(tFleshConfig*, int, _cGH*)
[/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet7SetupGHEP12tFleshConfigiP4_cGH+0x18e6)
[0x2eff606]]
f.
/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CCTKi_SetupGHExtensions+0xc4)
[0xd20924]
10.
/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CactusDefaultSetupGH+0x30e)
[0xd3ad2e]
11. Carpet::Initialise(tFleshConfig*)
[/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet10InitialiseEP12tFleshConfig+0x54)
[0x2ee2414]]
12.
/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(main+0x96)
[0xcbbef6]
13. /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6cf5270b15]
14.
/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim()
[0xcbbbe9]
}}}
"recursive" works on 40 processes. It's possible that there is no
sensible way to split such a small domain among 44 processes, and that
aborting with an error is the correct behaviour. If that is the case,
then this should be caught at a higher level in the code and a more
sensible error message should be generated.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/2008#comment:4>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list