Merge remote-tracking branch 'origin/doc-ci-segv-debug' into integration_2024_w10

fe9d871a · Robert Schmidt · 127ae812 · e615321e · fe9d871a · fe9d871a
Commit fe9d871a authored Mar 11, 2024 by Robert Schmidt
Hide whitespace changes
Inline Side-by-side

Showing with 114 additions and 33 deletions

doc/TESTBenches.md doc/TESTBenches.md +58 -33

docker/debug_core_image.sh docker/debug_core_image.sh +56 -0

No files found.
--- a/doc/TESTBenches.md
+++ b/doc/TESTBenches.md
@@ -255,7 +255,44 @@ Some tests are run from source (e.g.
 `ci-scripts/xml_files/gnb_phytest_usrp_run.xml`), which directly give the
 options they are run with.
-## How to retrieve core dumps (for CI team members)
+## How to debug CI failures
+It is possible to debug CI failures using the generated core dump and the image
+used for the run. A script is provided (see developer instructions below) that,
+provided the core dump file, container image, and the source tree, executes
+`gdb` inside the container; using the core dump information, a developer can
+investigate the cause of failure.
+### Developer instructions
+The CI team will send you a docker image and a core dump file, and the commit
+as of which the pipeline failed. Let's assume the coredump is stored at
+`/tmp/coredump.tar.xz`, and the image is in `/tmp/oai-nr-ue.tar.gz`. First, you
+should check out the corresponding branch (or directly the commit), let's say
+in `~/oai-branch-fail`. Now, unpack the core dump, load the image into docker,
+and use the script [`docker/debug_core_image.sh`](../docker/debug_core_image.sh)
+to open gdb, as follows:
+```
+cd /tmp
+tar -xJf /tmp/coredump.tar.xz
+docker load < /tmp/oai-nr-ue.tar.gz
+~/oai-branch-fail/docker/debug_core_image.sh <image> /tmp/coredump ~/oai-branch-fail
+```
+where you replace `<image>` with the image loaded in `docker load`. The script
+will start the container and open gdb; you should see information about where
+the failure (e.g., segmentation fault) happened. If you just see `??`, the core
+dump and container image don't match. Be also on the lookout for the
+corresponding message from gdb:
+```
+warning: core file may not match specified executable file.
+```
+Once you quit `gdb`, the container image will be removed automatically.
+### CI team instructions
 The entrypoint scripts of all containers print the core pattern that is used on
 the running machine. Search for `core_pattern` at the start of the container
@@ -267,19 +304,14 @@ logs to retrieve the possible location. Possible locations might be:
 - abrt: see [documentation](https://abrt.readthedocs.io/en/latest/usage.html)
 - apport: see [documentation](https://wiki.ubuntu.com/Apport)
-You furthermore have to extract the executable that caused the core dump.
+See below for instructions on how to retrieve the core dump. Further, download
-Download the container image, and extract, e.g.:
+the image and store it to a file using `docker save`. Make sure to pick the
+right image (Ubuntu or RHEL)!
-```
+#### Core dump in a file
-docker create --name c1 porcepix.sboai.cs.eurecom.fr/oai-gnb:develop-c99db698
-docker cp c1:/opt/oai-gnb/bin/nr-softmodem /tmp
-docker rm c1
-```
-### Core dump in a file
 **This is not recommended, as files could pile up and fill the system disk
-completely!** Prefer systemd or abrt instead.
+completely!** Prefer another method further down.
 If the core pattern is a path: it should at least include the time in the
 pattern name (suggested pattern: `/tmp/core.%e.%p.%t`) to correlate the time
@@ -287,38 +319,31 @@ the segfault occurred with the CI logs. If you identified the core dump,
 copy the core dump from that machine; if identification is difficult, consider
 rerunning the pipeline.
-### Core dump via systemd
+#### Core dump via systemd
-Run this command to list all core dumps:
+Use the first command to list all core dumps. Scroll down to the core dump of
+interest (it lists the executables in the last column; use the time to
+correlate the segfault and the CI run).  Take the PID of the executable (first
+column after the time). Dump the core dump to a location of your choice.
 ```
 sudo coredumpctl list
-```
-Scroll to the end and find the core dump of interest (it lists the executables
-in the last column; use the time to correlate the segfault and the CI run).
-Take the PID of the executable (first column after the time). Dump the core
-dump to a location of your choice:
-```
 sudo coredumpctl dump <PID> > /tmp/coredump
 ```
-### Core dump via abrt (automatic bug reporting tool)
+#### Core dump via abrt (automatic bug reporting tool)
 TBD: use the documentation page for the moment.
-### Core dump via apport
+#### Core dump via apport
-On Ubuntu machines, apport first needs to be enabled to collect core dumps:
-```
-sudo systemctl enable apport.service
-```
-and [needs to be enabled](https://wiki.ubuntu.com/Apport#How_to_enable_apport).
-Then, show a list of core dumps using
+I did not find an easy way to use apport. Anyway, the systemd approach works
+fine. So remove apport, install systemd-coredump, and verify it is the new
+coredump handler:
 ```
-sudo apport-cli
+sudo systemctl stop apport
+sudo systemctl mask --now apport
+sudo apt install systemd-coredump
+# Verify this changed the core pattern to a pipe to systemd-coredump
+sysctl kernel.core_pattern
 ```
--- a/docker/debug_core_image.sh
+++ b/docker/debug_core_image.sh
+#!/bin/bash
+if [ $# -ne 3 ]; then
+  echo "usage: $0 <image> <coredump> <path-to-sources>"
+  exit 1
+fi
+die() {
+  echo $1
+  exit 1
+}
+IMAGE=$1
+COREDUMP=$2
+SOURCES=$3
+set -x
+# the image/build_oai builds in cmake_targets/ran_build/build, so source
+# information is relative to this path. In case the user did not compile on
+# their computer, this directory will not exist. still allow to find it by
+# creating it
+BUILD_DIR=$SOURCES/cmake_targets/ran_build/build
+mkdir -p $BUILD_DIR || die "cannot create $BUILD_DIR: is $SOURCES valid?"
+# check if coredump is valid file
+[ -f $COREDUMP ] || die "no such file: $COREDUMP"
+# check if image exists, and determine type (gnb, nr-ue) for correct invocation
+# of gdb
+docker image inspect $IMAGE > /dev/null || exit 1
+if [ $(grep "oai-gnb:" <<< $IMAGE) ] || [ $(grep "oai-gnb-aerial:" <<< $IMAGE) ]; then
+  EXEC=bin/nr-softmodem
+  TYPEPATH=oai-gnb
+elif [ $(grep "oai-nr-ue:" <<< $IMAGE) ]; then
+  EXEC=bin/nr-uesoftmodem
+  TYPEPATH=oai-nr-ue
+elif [ $(grep "oai-enb:" <<< $IMAGE) ]; then
+  EXEC=bin/lte-softmodem
+  TYPEPATH=oai-enb
+elif [ $(grep "oai-lte-ue:" <<< $IMAGE) ]; then
+  EXEC=bin/lte-uesoftmodem
+  TYPEPATH=oai-lte-ue
+else
+  die "cannot determine if image is gnb or nr-ue: must match \"oai-gnb:\" or \"oai-nr-ue:\""
+fi
+# run gdb inside a container. We mount the coredump and the sources inside the
+# container, and run gdb with the core dump, the correct executable, and using
+# the source directory to show correct line numbers
+docker run --rm -it \
+  -v $COREDUMP:/tmp/coredump \
+  -v $SOURCES:/opt/$TYPEPATH/src \
+  --entrypoint bash \
+  $IMAGE \
+  -c "gdb --dir=src/cmake_targets/ran_build/build $EXEC /tmp/coredump"