Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] smp ostest fails randomly with message signest_test: ERROR x ODD signals were nested #15059

Closed
1 task done
chirping78 opened this issue Dec 5, 2024 · 13 comments
Closed
1 task done
Assignees
Labels
Arch: xtensa Issues related to the Xtensa architecture Area: Kernel Kernel issues OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working

Comments

@chirping78
Copy link
Contributor

Description / Steps to reproduce the issue

Currently esp32s3-devkit:smp ostest fails randomly with message signest_test: ERROR x ODD signals were nested.
x is not a fixed number.

The failing rate is about 1/10, i.e. out of 10 times running ostest, 1 times it fails with above error message.
Does anyone look into this issue?

Today's version snapshot: nuttx 060fda0, apps 6600a5fd0.

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Ubuntu 20.04

NuttX Version

master

Issue Architecture

[Arch: xtensa]

Issue Area

[Area: Kernel]

Verification

  • I have verified before submitting the report.
@chirping78 chirping78 added the Type: Bug Something isn't working label Dec 5, 2024
@github-actions github-actions bot added Arch: xtensa Issues related to the Xtensa architecture Area: Kernel Kernel issues OS: Linux Issues related to Linux (building system, etc) labels Dec 5, 2024
@chirping78
Copy link
Contributor Author

Searched in the issue library, and found two related issues, but:
#14749: focused on the error status reporting.
#14807: point to smp signal race condition. may or not related to this?

I'm not sure whether there is issue focus on this error, so create this new one.

cc @xiaoxiang781216

@xiaoxiang781216
Copy link
Contributor

@tmedicci could you look at this problem? We are evaluating Xtensa ESP SMP capability and stability.

@tmedicci
Copy link
Contributor

tmedicci commented Dec 5, 2024

@tmedicci could you look at this problem? We are evaluating Xtensa ESP SMP capability and stability.

I will test it locally. Our CI didn't show any problem related to that recently.

@chirping78
Copy link
Contributor Author

The failing rate is about 1/10, i.e. out of 10 times running ostest, 1 times it fails with above error message.

@tmedicci As above statement, the failing rate is about 1/10.
If your CI process is after builting, run once ostest, maybe you will not meet the error.
Only run ostest many times will show the error.

@tmedicci
Copy link
Contributor

tmedicci commented Dec 6, 2024

The failing rate is about 1/10, i.e. out of 10 times running ostest, 1 times it fails with above error message.

@tmedicci As above statement, the failing rate is about 1/10. If your CI process is after builting, run once ostest, maybe you will not meet the error. Only run ostest many times will show the error.

Hi @chirping78 ,

I let it running yesterday during the night: it ran 426 times without any error until I stopped it today in the morning. I just ran it multiple times, without rebooting the device, checking both the return code and the message (looking for ERROR message).

Would you mind adding the nxdiag app and share the results?

@tmedicci
Copy link
Contributor

tmedicci commented Dec 6, 2024

Please, check my firmware if possible:

nuttx.zip

(If you don't mind, share yours too, please)

@chirping78
Copy link
Contributor Author

nuttx.zip This is my firmware, built from version: nuttx 060fda0, apps 6600a5fd0

Which git hashs are your firmware built from?

@tmedicci
Copy link
Contributor

tmedicci commented Dec 9, 2024

Today's version snapshot: nuttx 060fda0, apps 6600a5fd0.

The same!

That's why I asked you about the nxdiag information: maybe it's related to the toolchain/system configuration. Can you add it and share the results, please?

@tmedicci
Copy link
Contributor

tmedicci commented Dec 9, 2024

nuttx.zip This is my firmware, built from version: nuttx 060fda0, apps 6600a5fd0

Which git hashs are your firmware built from?

Just confirming: I was able to verify the firmware you've built failing on our internal tests. Considering that we are both using the same hashes for nuttx and nuttx-apps, it's important to verify your build environment with nxdiag.

@chirping78
Copy link
Contributor Author

This is the nxdiag report: nxdiag-1211.log, please check whether there is software in doubt.

It's still built from version: nuttx 060fda0, apps 6600a5fd0, the dirty part is after enable nxdiag, I have made savedefconfig, and copied the defconfig.

@tmedicci
Copy link
Contributor

Thanks @chirping78 !

From your logs, I saw you have been using a newer toolchain version: xtensa-esp32s3-elf-gcc: xtensa-esp-elf-gcc (crosstool-NG esp-13.2.0_20240530) 13.2.0. We recommend using the toolchain based on GCC 12.2.0 (the one which is also tested with NuttX's CI:

curl -s -L "https://github.com/espressif/crosstool-NG/releases/download/esp-12.2.0_20230208/xtensa-esp32s3-elf-12.2.0_20230208-x86_64-linux-gnu.tar.xz" \
). Maybe this is related to the issue.

Can you try building your firmware within the docker container used by the NuttX's CI:

sudo docker run -it ghcr.io/apache/nuttx/apache-nuttx-ci-linux:latest \
  /bin/bash -c "
  cd /tmp ; ls -la
  pwd ;
  git clone https://github.com/apache/nuttx nuttx;
  pushd nuttx; git checkout 060fda032bc446e8deb1322ff78ec1fedee19a45; popd;
  git clone https://github.com/apache/nuttx-apps apps;
  pushd apps; git checkout 6600a5fd0; popd;
  cd nuttx
  make -j distclean && ./tools/configure.sh esp32s3-devkit:smp && make -j$(nproc);
"

(the firmware will be at your PC's /tmp/nutttx folder)

@chirping78
Copy link
Contributor Author

@tmedicci per your suggestion, I downgraded toolchain to CI's gcc version.
Then I run 10 times ostest, they all passed. So this is really related to toolchain version?

Consider the new toolchain will cause the error, it's:

  • Either the new toolchain has bug;
  • Or the codes have bug (the new toolchain has more aggressive optimization, this means the codes missed something like volatile, barrier, or lock)

@tmedicci
Copy link
Contributor

Or the codes have bug (the new toolchain has more aggressive optimization, this means the codes missed something like volatile, barrier, or lock)

Yes, I'd bet this option. The recommended toolchain version is related to Espressif's HAL version. Once we update the HAL, we will update the toolchain too and make the necessary improvements.

For now, I think you should use the recommended toolchain which does not show any bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: xtensa Issues related to the Xtensa architecture Area: Kernel Kernel issues OS: Linux Issues related to Linux (building system, etc) Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants