Commit fe9d871a authored by Robert Schmidt's avatar Robert Schmidt

Merge remote-tracking branch 'origin/doc-ci-segv-debug' into integration_2024_w10

parents 127ae812 e615321e
...@@ -255,7 +255,44 @@ Some tests are run from source (e.g. ...@@ -255,7 +255,44 @@ Some tests are run from source (e.g.
`ci-scripts/xml_files/gnb_phytest_usrp_run.xml`), which directly give the `ci-scripts/xml_files/gnb_phytest_usrp_run.xml`), which directly give the
options they are run with. options they are run with.
## How to retrieve core dumps (for CI team members) ## How to debug CI failures
It is possible to debug CI failures using the generated core dump and the image
used for the run. A script is provided (see developer instructions below) that,
provided the core dump file, container image, and the source tree, executes
`gdb` inside the container; using the core dump information, a developer can
investigate the cause of failure.
### Developer instructions
The CI team will send you a docker image and a core dump file, and the commit
as of which the pipeline failed. Let's assume the coredump is stored at
`/tmp/coredump.tar.xz`, and the image is in `/tmp/oai-nr-ue.tar.gz`. First, you
should check out the corresponding branch (or directly the commit), let's say
in `~/oai-branch-fail`. Now, unpack the core dump, load the image into docker,
and use the script [`docker/debug_core_image.sh`](../docker/debug_core_image.sh)
to open gdb, as follows:
```
cd /tmp
tar -xJf /tmp/coredump.tar.xz
docker load < /tmp/oai-nr-ue.tar.gz
~/oai-branch-fail/docker/debug_core_image.sh <image> /tmp/coredump ~/oai-branch-fail
```
where you replace `<image>` with the image loaded in `docker load`. The script
will start the container and open gdb; you should see information about where
the failure (e.g., segmentation fault) happened. If you just see `??`, the core
dump and container image don't match. Be also on the lookout for the
corresponding message from gdb:
```
warning: core file may not match specified executable file.
```
Once you quit `gdb`, the container image will be removed automatically.
### CI team instructions
The entrypoint scripts of all containers print the core pattern that is used on The entrypoint scripts of all containers print the core pattern that is used on
the running machine. Search for `core_pattern` at the start of the container the running machine. Search for `core_pattern` at the start of the container
...@@ -267,19 +304,14 @@ logs to retrieve the possible location. Possible locations might be: ...@@ -267,19 +304,14 @@ logs to retrieve the possible location. Possible locations might be:
- abrt: see [documentation](https://abrt.readthedocs.io/en/latest/usage.html) - abrt: see [documentation](https://abrt.readthedocs.io/en/latest/usage.html)
- apport: see [documentation](https://wiki.ubuntu.com/Apport) - apport: see [documentation](https://wiki.ubuntu.com/Apport)
You furthermore have to extract the executable that caused the core dump. See below for instructions on how to retrieve the core dump. Further, download
Download the container image, and extract, e.g.: the image and store it to a file using `docker save`. Make sure to pick the
right image (Ubuntu or RHEL)!
``` #### Core dump in a file
docker create --name c1 porcepix.sboai.cs.eurecom.fr/oai-gnb:develop-c99db698
docker cp c1:/opt/oai-gnb/bin/nr-softmodem /tmp
docker rm c1
```
### Core dump in a file
**This is not recommended, as files could pile up and fill the system disk **This is not recommended, as files could pile up and fill the system disk
completely!** Prefer systemd or abrt instead. completely!** Prefer another method further down.
If the core pattern is a path: it should at least include the time in the If the core pattern is a path: it should at least include the time in the
pattern name (suggested pattern: `/tmp/core.%e.%p.%t`) to correlate the time pattern name (suggested pattern: `/tmp/core.%e.%p.%t`) to correlate the time
...@@ -287,38 +319,31 @@ the segfault occurred with the CI logs. If you identified the core dump, ...@@ -287,38 +319,31 @@ the segfault occurred with the CI logs. If you identified the core dump,
copy the core dump from that machine; if identification is difficult, consider copy the core dump from that machine; if identification is difficult, consider
rerunning the pipeline. rerunning the pipeline.
### Core dump via systemd #### Core dump via systemd
Run this command to list all core dumps: Use the first command to list all core dumps. Scroll down to the core dump of
interest (it lists the executables in the last column; use the time to
correlate the segfault and the CI run). Take the PID of the executable (first
column after the time). Dump the core dump to a location of your choice.
``` ```
sudo coredumpctl list sudo coredumpctl list
```
Scroll to the end and find the core dump of interest (it lists the executables
in the last column; use the time to correlate the segfault and the CI run).
Take the PID of the executable (first column after the time). Dump the core
dump to a location of your choice:
```
sudo coredumpctl dump <PID> > /tmp/coredump sudo coredumpctl dump <PID> > /tmp/coredump
``` ```
### Core dump via abrt (automatic bug reporting tool) #### Core dump via abrt (automatic bug reporting tool)
TBD: use the documentation page for the moment. TBD: use the documentation page for the moment.
### Core dump via apport #### Core dump via apport
On Ubuntu machines, apport first needs to be enabled to collect core dumps:
```
sudo systemctl enable apport.service
```
and [needs to be enabled](https://wiki.ubuntu.com/Apport#How_to_enable_apport).
Then, show a list of core dumps using
I did not find an easy way to use apport. Anyway, the systemd approach works
fine. So remove apport, install systemd-coredump, and verify it is the new
coredump handler:
``` ```
sudo apport-cli sudo systemctl stop apport
sudo systemctl mask --now apport
sudo apt install systemd-coredump
# Verify this changed the core pattern to a pipe to systemd-coredump
sysctl kernel.core_pattern
``` ```
#!/bin/bash
if [ $# -ne 3 ]; then
echo "usage: $0 <image> <coredump> <path-to-sources>"
exit 1
fi
die() {
echo $1
exit 1
}
IMAGE=$1
COREDUMP=$2
SOURCES=$3
set -x
# the image/build_oai builds in cmake_targets/ran_build/build, so source
# information is relative to this path. In case the user did not compile on
# their computer, this directory will not exist. still allow to find it by
# creating it
BUILD_DIR=$SOURCES/cmake_targets/ran_build/build
mkdir -p $BUILD_DIR || die "cannot create $BUILD_DIR: is $SOURCES valid?"
# check if coredump is valid file
[ -f $COREDUMP ] || die "no such file: $COREDUMP"
# check if image exists, and determine type (gnb, nr-ue) for correct invocation
# of gdb
docker image inspect $IMAGE > /dev/null || exit 1
if [ $(grep "oai-gnb:" <<< $IMAGE) ] || [ $(grep "oai-gnb-aerial:" <<< $IMAGE) ]; then
EXEC=bin/nr-softmodem
TYPEPATH=oai-gnb
elif [ $(grep "oai-nr-ue:" <<< $IMAGE) ]; then
EXEC=bin/nr-uesoftmodem
TYPEPATH=oai-nr-ue
elif [ $(grep "oai-enb:" <<< $IMAGE) ]; then
EXEC=bin/lte-softmodem
TYPEPATH=oai-enb
elif [ $(grep "oai-lte-ue:" <<< $IMAGE) ]; then
EXEC=bin/lte-uesoftmodem
TYPEPATH=oai-lte-ue
else
die "cannot determine if image is gnb or nr-ue: must match \"oai-gnb:\" or \"oai-nr-ue:\""
fi
# run gdb inside a container. We mount the coredump and the sources inside the
# container, and run gdb with the core dump, the correct executable, and using
# the source directory to show correct line numbers
docker run --rm -it \
-v $COREDUMP:/tmp/coredump \
-v $SOURCES:/opt/$TYPEPATH/src \
--entrypoint bash \
$IMAGE \
-c "gdb --dir=src/cmake_targets/ran_build/build $EXEC /tmp/coredump"
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment