[Users] Setting up ETK on Discoverer

Tue Sep 6 13:48:00 CDT 2022

Jay

The `-mtune` options in your option list are strange. I would leave
them out. Their meaning is very similar to the `-march` option, and
using `native` as value here while you specify a particular
architecture in `-march` doesn't really make sense. However, if these
options were wrong, then the code would crash at the same time, with a
different signal (SIGILL), not a SIGBUS.

SIGBUS means that some memory access is wrong. This could be a bug in
the ET, or a bug in one of the system libraries you are using. It
could also be a hardware problem; if this is a new system, then one of
the nodes might be bad. You could keep the list of nodes assigned to
you, and keep track of which node reports the error. There is a Carpet
option (maybe `Carpet::verbose`?) that lists the names of all the
nodes you're using. The error message you attached shows that MPI rank
95 reported the problem. You now need to translate this to a node
name. (The information you sent does not contain enough information.)

To debug, you can choose a configuration that crashes quickly, and
then simplify your setup: Use fewer thorns (e.g. disable I/O, wave
extraction, etc.). Use a bisection method until a much simpler setup
crashes consistently. You can also build a debug configuration (create
a new configuration `newspritzgnu-debug`, and pass `--debug` to
simfactory when building).

Good luck!

-erik

On Tue, Sep 6, 2022 at 1:40 PM Jay Vijay Kalinani
<jayvijay.kalinani at phd.unipd.it> wrote:
>
> Hi all,
>
> I am trying to perform simulations on the Discoverer cluster (https://docs.discoverer.bg/index.html) using the latest Einstein Toolkit (ETK) release (ET_2022_05) and the Spritz GRMHD code.
> To compile ETK on Discoverer, I am attaching the simfactory configuration files which I had newly prepared. I am also attaching the list of modules which were loaded.
>
> To submit the simulation, for instance, I use the following simfactory command:
>
> sim submit BNS_IF_fluxCT_dx018_q10_RPA_RotGas_E8e49 --parfile=./par/BNS_IF_fluxCT_dx018_q10_RPA_RotGas_E8e49.par --config=newspritzgnu --machine=discoverer --procs=256 --num-threads=1 --ppn-used=128 --walltime=24:00:00
>
>
>
> Unfortunately, my simulations crash after running for some time. My guess is that I might not be correctly setting the configuration options or flags during compilation, which might affect my simulation during runtime, but I am not completely certain.
> I am attaching the output file, the error file, the parfile as well as the generated backtrace for the simulation which used 256 procs. I also looked at the hexadecimal addresses in the backtrace with addr2line, but unfortunately all of them return "??:0"
>
> I also noticed that when changing the number of processors, the simulation crashes at different times. But if I keep the number of processors as the same, the simulation always crashes at the same point.
> For instance, simulation with 256 processors ran for about 2 hours on the cluster, and crashed after completing about 4600 iterations. One submitted with 1280 processors ran for about 12 hours and crashed after completing about 32000 iterations. Simulation with 1792 processors instead crashed soon after the start of the simulation (within few minutes), even before reaching iteration 0. For all cases, I always set number of threads as 1.
>
> If you have any suggestions or insights on why the simulations crash and in case I have any incorrect settings in the configuration files, kindly let me know. I would greatly appreciate your help. If you need any further information from my side, please let me know too.
>
> Thank you very much.
> Kind regards,
> Jay Kalinani
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users

-- 
Erik Schnetter <schnetter at gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/