[ET Trac] #2774: OpenMP Heisenbug with Default Thornfile

José Ferreira trac-noreply at einsteintoolkit.org
Wed Feb 7 03:12:08 CST 2024


#2774: OpenMP Heisenbug with Default Thornfile

 Reporter: José Ferreira
   Status: submitted
Milestone: 
  Version: ET_2023_11
     Type: bug
 Priority: critical
Component: 

Hello, I am facing an issue where running a simulation with the toolkit, using the default thornlist, can either run as expected, or crash right in the beginning of the simulation. This took place with `ET_TOV.par`, that ships with the toolkit, and a thornfile that evolves a constant scalar field, that ships with the Scalar thorn also included in the toolkit.

I believe that the culprit is OpenMP because if I disable threading during run-time, by setting `OMP_NUM_THREADS=1`, or if I disable OpenMP during compile-time, the simulations run as expected.

This bug takes place both in my laptop and in my desktop, which share similar operating systems and software stack.

‌

In the following sections, I will write, line by line, the steps that I have performed in order to reproduce the bug, and my attempt of tracking it down.

‌

# Installing and Compiling the Toolkit

To avoid compiling CarpetX thorns that fail to compile in my system, for some reason, I start by downloading the previous version of the toolkit with

```shell
$ curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2023_05/GetComponents
$ chmod a+x GetComponents
$ ./GetComponents --parallel https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2023_05/einsteintoolkit.th
```

and then change to the Cactus root directory

```shell
$ cd Cactus
```

I create the options file `arch.cfg` , which should already be present in the parent directory, with the following written in it

```
# Cactus configuration for Arch and Arch-based distros

## Decide which flags will be used at compile-time
OPTIMISE = yes
WARN     = yes
DEBUG    = no
PROFILE  = no
OPENMP   = yes

## Compilers
CPP = cpp
FPP = cpp
CC  = gcc
CXX = g++
F77 = gfortran
F90 = gfortran

## Default flags
CPPFLAGS = -DMPICH_IGNORE_CXX_SEEK
FPPFLAGS = -traditional
CFLAGS   = -g3 -march=native -std=gnu99
CXXFLAGS = -g3 -march=native -std=gnu++0x
F77FLAGS = -g3 -march=native -fcray-pointer -m128bit-long-double -ffixed-line-length-none -fno-range-check
F90FLAGS = -g3 -march=native -fcray-pointer -m128bit-long-double -ffixed-line-length-none -fno-range-check
LDFLAGS  = -rdynamic

## Optimization flags
CPP_OPTIMISE_FLAGS = -DKRANC_VECTORS # -DCARPET_OPTIMISE -DNDEBUG
FPP_OPTIMISE_FLAGS =                 # -DCARPET_OPTIMISE -DNDEBUG
C_OPTIMISE_FLAGS   = -Ofast
CXX_OPTIMISE_FLAGS = -Ofast
F77_OPTIMISE_FLAGS = -Ofast
F90_OPTIMISE_FLAGS = -Ofast

## Warning flags
CPP_WARN_FLAGS = -Wall
FPP_WARN_FLAGS = -Wall
C_WARN_FLAGS   = -Wall
CXX_WARN_FLAGS = -Wall
F77_WARN_FLAGS = -Wall
F90_WARN_FLAGS = -Wall

## Debug flags
CPP_DEBUG_FLAGS = -DCARPET_DEBUG -fsanitize=undefined -fsanitize=thread
FPP_DEBUG_FLAGS = -DCARPET_DEBUG -fsanitize=undefined -fsanitize=thread
C_DEBUG_FLAGS   = -O0            -fsanitize=undefined -fsanitize=thread
CXX_DEBUG_FLAGS = -O0            -fsanitize=undefined -fsanitize=thread
F77_DEBUG_FLAGS = -O0            -fsanitize=undefined -fsanitize=thread
F90_DEBUG_FLAGS = -O0            -fsanitize=undefined -fsanitize=thread

## Code profiling flags
CPP_PROFILE_FLAGS =
FPP_PROFILE_FLAGS =
C_PROFILE_FLAGS   = -pg
CXX_PROFILE_FLAGS = -pg
F77_PROFILE_FLAGS = -pg
F90_PROFILE_FLAGS = -pg

## OpenMP
CPP_OPENMP_FLAGS = -fopenmp
FPP_OPENMP_FLAGS = -fopenmp
C_OPENMP_FLAGS   = -fopenmp
CXX_OPENMP_FLAGS = -fopenmp
F77_OPENMP_FLAGS = -fopenmp
F90_OPENMP_FLAGS = -fopenmp

## Libraries location
LIBDIRS      =
MPI_DIR      = /usr
HDF5_DIR     = /usr
PTHREADS_DIR = NO_BUILD
LIBS              = gfortran open-pal z 
C_LINE_DIRECTIVES = yes                 
F_LINE_DIRECTIVES = yes                 
```

I then create the configuration folder with

```shell
$ make base-config options=../arch.cfg THORNLIST=thornlists/einsteintoolkit.th
```

which reveals no errors, with the terminal output attached in file `make-config.out`, and then make the binary with

```shell
$ make -j $(nproc) base
```

that also reveals no errors, and the output is attached in `make-binary.out`.

‌

To ensure full reproducibility, instead of doing this manually I created a very simple script, attached as `run.sh`, that reproduces the steps laid out in this section.

‌

# Running the Toolkit and Finding the Bug

With no errors so far, and the binary `exe/cactus_base` in place, I will run one of the par files provided by default in the toolkit, in this case, `par/ET_TOV.par`

```shell
$ exe/cactus_base par/tov_ET.par
```

which crashes, producing something that says

```
Rank 0 with PID 2329840 received signal 11
Writing backtrace to tov_ET/backtrace.0.txt
[1]    2329840 segmentation fault (core dumped)
```

The full output for this simulation is sent in the file `run.out` that is being sent as an attachment, along with `backtrace.0.txt`.

‌

Interestingly, if insist on running this simulation for a few times, I will eventually find one where the bug doesn’t take place, and the simulation seems to be running as expected.

Therefore, this is not just a classical bug, it’s a Heisenbug!

‌

# Tracking the Bug

If I am to run the binary disabling OpenMP at runtime, i.e.

```shell
$ OMP_NUM_THREADS=1 exe/cactus_base par/ET_TOV.par
```

then I consistently get no errors, even after running the code a few dozens of time.

This is why I am lead to believe that the culprit of the Heisenbug is OpenMP.

‌

I created a new configuration of the toolkit with `DEBUG=yes` and `OPTIMIZE=no` in the options file above, which created the binary `exe/cactus_base-debug`. For completeness, the output of the making of the configuration is sent in `make-config-debug.out` and for the binary in `make-binary-debug.out`, although no errors were produced.

I don’t have that much experience in debugging low-level code, so I decided to disable optimizations and add `-fsanitize=undefined` and `-fsanitize=thread` to GCC , which looked reasonable to me.

‌

Running the debug version of the binary with the same parfile

```shell
$ exe/cactus_base-debug par/ET_TOV.par
```

reveals the same error on loop, and the program never terminates \(and by never I mean in around one minute\), and the errors look something like

```
==================
WARNING: ThreadSanitizer: data race (pid=2419680)
  Read of size 8 at 0x7ffe815ca630 by thread T5:
    #0 grhydro_atmospherereset_._omp_fn.0 /home/undercover/misc/tmp/Cactus/arrangements/EinsteinEvolve/GRHydro/src/GRHydro_UpdateMask.F90:323 (cactus_base-debug+0xd14fbb9) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #1 gomp_thread_start /usr/src/debug/gcc/gcc/libgomp/team.c:129 (libgomp.so.1+0x20c95) (BuildId: 919d8c8c3093e63652b89795375dcf12dd9cb1d4)

  Previous write of size 8 at 0x7ffe815ca630 by main thread:
    #0 grhydro_atmospherereset_ /home/undercover/misc/tmp/Cactus/arrangements/EinsteinEvolve/GRHydro/src/GRHydro_UpdateMask.F90:321 (cactus_base-debug+0xd11394a) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #1 CCTKi_BindingsFortranWrapperGRHydro /home/undercover/misc/tmp/Cactus/configs/base-debug/bindings/Variables/GRHydro.c:37 (cactus_base-debug+0x155440a4) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #2 CCTK_CallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:323 (cactus_base-debug+0x152bc047) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #3 CallScheduledFunction /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/CallFunction.cc:440 (cactus_base-debug+0xa36120b) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #4 Carpet::CallFunction(void*, cFunctionData*, void*) /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/CallFunction.cc:373 (cactus_base-debug+0xa35f44b) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #5 CCTKi_ScheduleCallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:3096 (cactus_base-debug+0x152c6b74) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #6 ScheduleTraverseFunction /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:595 (cactus_base-debug+0x152d6b94) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #7 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:369 (cactus_base-debug+0x152d5be6) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #8 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:385 (cactus_base-debug+0x152d673d) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #9 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:385 (cactus_base-debug+0x152d673d) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #10 CCTKi_DoScheduleTraverse /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:159 (cactus_base-debug+0x152d4d7e) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #11 ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:1400 (cactus_base-debug+0x152bedf0) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #12 CCTK_ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:919 (cactus_base-debug+0x152bcedf) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #13 ScheduleTraverse /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/Initialise.cc:1393 (cactus_base-debug+0xa3f8929) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #14 CallRestrict /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/Initialise.cc:529 (cactus_base-debug+0xa3e8be8) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #15 Carpet::Initialise(tFleshConfig*) /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/Initialise.cc:121 (cactus_base-debug+0xa3ddc37) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #16 main /home/undercover/misc/tmp/Cactus/src/main/flesh.cc:80 (cactus_base-debug+0x1528f2e7) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)

  Location is stack of main thread.

  Location is global '<null>' at 0x000000000000 ([stack]+0xf7630)

  Thread T5 (tid=2419694, running) created by main thread at:
    #0 pthread_create /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1036 (libtsan.so.2+0x44219) (BuildId: 7e8fcb9ed0a63b98f2293e37c92ac955413efd9e)
    #1 gomp_team_start /usr/src/debug/gcc/gcc/libgomp/team.c:858 (libgomp.so.1+0x212df) (BuildId: 919d8c8c3093e63652b89795375dcf12dd9cb1d4)
    #2 CarpetLib::dist::pseudoinit(ompi_communicator_t*) /home/undercover/misc/tmp/Cactus/arrangements/Carpet/CarpetLib/src/dist.cc:200 (cactus_base-debug+0xac0eac8) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #3 CarpetMultiModelStartup /home/undercover/misc/tmp/Cactus/arrangements/Carpet/Carpet/src/CarpetStartup.cc:29 (cactus_base-debug+0xa3733e3) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #4 CCTK_CallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:309 (cactus_base-debug+0x152bbf51) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #5 CCTKi_ScheduleCallFunction /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:3096 (cactus_base-debug+0x152c6b74) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #6 ScheduleTraverseFunction /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:595 (cactus_base-debug+0x152d6b94) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #7 ScheduleTraverseGroup /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:369 (cactus_base-debug+0x152d5be6) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #8 CCTKi_DoScheduleTraverse /home/undercover/misc/tmp/Cactus/src/schedule/ScheduleTraverse.c:159 (cactus_base-debug+0x152d4d7e) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #9 ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:1400 (cactus_base-debug+0x152bedf0) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #10 CCTK_ScheduleTraverse /home/undercover/misc/tmp/Cactus/src/main/ScheduleInterface.c:919 (cactus_base-debug+0x152bcedf) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #11 CCTKi_CallStartupFunctions /home/undercover/misc/tmp/Cactus/src/main/CallStartupFunctions.c:50 (cactus_base-debug+0x1527f958) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #12 CCTKi_InitialiseCactus /home/undercover/misc/tmp/Cactus/src/main/InitialiseCactus.c:117 (cactus_base-debug+0x152a179e) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)
    #13 main /home/undercover/misc/tmp/Cactus/src/main/flesh.cc:64 (cactus_base-debug+0x1528f271) (BuildId: 9f545ec6a5fce94a14678be3027bcefa0e2d6645)

SUMMARY: ThreadSanitizer: data race /home/undercover/misc/tmp/Cactus/arrangements/EinsteinEvolve/GRHydro/src/GRHydro_UpdateMask.F90:323 in grhydro_atmospherereset_._omp_fn.0
==================
```

The full output of this run until I stopped is attached in `run-debug.out` \(it’s rather large for a text file, sorry\).

‌

Once again, by disabling OpenMP at runtime everything seems fine, with the exception of the error

```
Cactus/arrangements/CactusNumerical/MoL/src/Operators.c:332:31: runtime error: variable length array bound evaluates to non-positive value 0
```

which I don’t think is problematic, but I’ve decided to share anyways.

‌

If you have any general or specific tips, tricks or hacks to more accurately track down this bug, or on how to interpret the output of the previous tracebacks, would be much appreciated.

‌

# Machine information

I’ve witness this behavior in two different machines, with similar OS’s:

* Legion

    * Type: Laptop
    * Processor: Quad-core Intel\(R\) Core\(TM\) i5-7300HQ CPU @ 2.50GHz \(no hyper-threading\)
    * GPU: Integrated \+ Nvidia 1050 Ti Mobile
    * OS: Manjaro \(x86\_64\)
    * Kernel: Linux LTS 5.10.206
    * GCC: 13.2.1
    * OpenMP: 16.0.6
    * OpenBLAS: 0.3.26
    * hwloc: 2.10.0
    
* Gravitino

    * Type: Desktop
    * Processor: Octa-core Intel\(R\) Core\(TM\) i7-9700 @ 3.00GHz \(no hyper-threading\)
    * GPU: Nvidia 1050 Ti
    * OS: Arch Linux \(x86\_64\)
    * Kernel: Linux LTS 6.6.15
    * GCC: 13.2.1
    * OpenMP: 16.0.6
    * OpenBLAS: 0.3.26
    * hwloc: 2.10.0
    

There are no virtual environments, everything is managed by the global package manager using the latest releases in their corresponding repositories, and all binaries are linked against system libraries.

‌

If you need any more information about any of the machines, or any of the steps provided above, do no hesitate in replying to this thread.

Thank you!
attachment: backtrace.0.txt (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/backtrace.0.txt)
attachment: make-binary.out (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/make-binary.out)
attachment: make-binary-debug.out (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/make-binary-debug.out)
attachment: make-config.out (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/make-config.out)
attachment: make-config-debug.out (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/make-config-debug.out)
attachment: run.out (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/run.out)
attachment: run.sh (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/run.sh)
attachment: run-debug.out (https://api.bitbucket.org/2.0/repositories/einsteintoolkit/tickets/issues/2774/attachments/run-debug.out)


--
Ticket URL: https://bitbucket.org/einsteintoolkit/tickets/issues/2774/openmp-heisenbug-with-default-thornfile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/trac/attachments/20240207/f78391df/attachment.htm>


More information about the Trac mailing list