Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Another data race for GCS health check manager #49473

Closed
dentiny opened this issue Dec 28, 2024 · 1 comment
Closed

[core] Another data race for GCS health check manager #49473

dentiny opened this issue Dec 28, 2024 · 1 comment
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core

Comments

@dentiny
Copy link
Contributor

dentiny commented Dec 28, 2024

What happened + What you expected to happen

/workspaces/ray (master) $ bazel-bin/gcs_health_check_manager_test
Running main() from gmock_main.cc
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from GcsHealthCheckManagerTest
[ RUN      ] GcsHealthCheckManagerTest.TestBasic
[2024-12-28 09:58:55,830 I 100315 100315] gcs_health_check_manager_test.cc:85: Get port 58765
[2024-12-28 09:58:55,866 W 100315 100315] grpc_server.cc:111: No service is found when start grpc server bb3c82b9989704f3a0dbaac4d4ab8fbb2e9b8b592156b5c38bea1b63
[2024-12-28 09:58:55,871 I 100315 100315] grpc_server.cc:134: bb3c82b9989704f3a0dbaac4d4ab8fbb2e9b8b592156b5c38bea1b63 server started, listening on port 58765.
[2024-12-28 09:58:56,000 W 100315 100315] gcs_health_check_manager.cc:154: Health check failed for node bb3c82b9989704f3a0dbaac4d4ab8fbb2e9b8b592156b5c38bea1b63, remaining checks 4, status 4, response status 0, status message Deadline Exceeded, status details 
==================
WARNING: ThreadSanitizer: data race (pid=100315)
  Write of size 8 at 0x7b4000008020 by main thread:
    #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:707 (libtsan.so.0+0x35f25)
    #1 gpr_free external/com_github_grpc_grpc/src/core/lib/gpr/alloc.cc:49 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgpr.so+0x10a29)
    #2 gpr_free_aligned external/com_github_grpc_grpc/src/core/lib/gpr/alloc.cc:70 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgpr.so+0x10ba7)
    #3 grpc_core::Arena::~Arena() external/com_github_grpc_grpc/src/core/lib/resource_quota/arena.cc:54 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Ssrc_Score_Slibarena.so+0x37cf)
    #4 grpc_core::Arena::Destroy() external/com_github_grpc_grpc/src/core/lib/resource_quota/arena.cc:93 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Ssrc_Score_Slibarena.so+0x3a94)
    #5 grpc_core::Call::DeleteThis() external/com_github_grpc_grpc/src/core/lib/surface/call.cc:413 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Ubase.so+0x5ca175)
    #6 grpc_core::FilterStackCall::ReleaseCall(void*, absl::lts_20230802::Status) external/com_github_grpc_grpc/src/core/lib/surface/call.cc:935 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Ubase.so+0x5cb808)
    #7 exec_ctx_run external/com_github_grpc_grpc/src/core/lib/iomgr/exec_ctx.cc:45 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xaca5)
    #8 grpc_core::ExecCtx::Flush() external/com_github_grpc_grpc/src/core/lib/iomgr/exec_ctx.cc:72 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xae6e)
    #9 grpc_core::ExecCtx::~ExecCtx() external/com_github_grpc_grpc/src/core/lib/iomgr/exec_ctx.h:117 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc++_Ubinder.so+0x2525bc)
    #10 grpc_core::FilterStackCall::ExternalUnref() external/com_github_grpc_grpc/src/core/lib/surface/call.cc:967 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Ubase.so+0x5cbc7d)
    #11 grpc_call_unref external/com_github_grpc_grpc/src/core/lib/surface/call.cc:3521 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Ubase.so+0x5d8fec)
    #12 grpc::ClientContext::~ClientContext() external/com_github_grpc_grpc/src/cpp/client/client_context.cc:80 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc++_Ubase.so+0x1de745)
    #13 ray::gcs::GcsHealthCheckManager::HealthCheckContext::StartHealthCheck() src/ray/gcs/gcs_server/gcs_health_check_manager.cc:128 (liblibgcs_Userver_Ulib.so+0x11ad02c)
    #14 operator()<boost::system::error_code> src/ray/gcs/gcs_server/gcs_health_check_manager.cc:172 (liblibgcs_Userver_Ulib.so+0x11aed31)
    #15 operator() external/boost/boost/asio/detail/bind_handler.hpp:171 (liblibgcs_Userver_Ulib.so+0x11b18b0)
    #16 asio_handler_invoke<boost::asio::detail::binder1<ray::gcs::GcsHealthCheckManager::HealthCheckContext::StartHealthCheck()::<lambda(grpc::Status)>::<lambda()>::<lambda(auto:24)>, boost::system::error_code> > external/boost/boost/asio/handler_invoke_hook.hpp:88 (liblibgcs_Userver_Ulib.so+0x11b15ad)
    #17 invoke<boost::asio::detail::binder1<ray::gcs::GcsHealthCheckManager::HealthCheckContext::StartHealthCheck()::<lambda(grpc::Status)>::<lambda()>::<lambda(auto:24)>, boost::system::error_code>, ray::gcs::GcsHealthCheckManager::HealthCheckContext::StartHealthCheck()::<lambda(grpc::Status)>::<lambda()>::<lambda(auto:24)> > external/boost/boost/asio/detail/handler_invoke_helpers.hpp:54 (liblibgcs_Userver_Ulib.so+0x11b13b1)
    #18 complete<boost::asio::detail::binder1<ray::gcs::GcsHealthCheckManager::HealthCheckContext::StartHealthCheck()::<lambda(grpc::Status)>::<lambda()>::<lambda(auto:24)>, boost::system::error_code> > external/boost/boost/asio/detail/handler_work.hpp:520 (liblibgcs_Userver_Ulib.so+0x11b1190)
    #19 do_complete external/boost/boost/asio/detail/wait_handler.hpp:76 (liblibgcs_Userver_Ulib.so+0x11b0c52)
    #20 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) external/boost/boost/asio/detail/scheduler_operation.hpp:40 (libexternal_Sboost_Slibasio.so+0x888ba)
    #21 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) external/boost/boost/asio/detail/impl/scheduler.ipp:492 (libexternal_Sboost_Slibasio.so+0x71241)
    #22 boost::asio::detail::scheduler::run_one(boost::system::error_code&) external/boost/boost/asio/detail/impl/scheduler.ipp:231 (libexternal_Sboost_Slibasio.so+0x6ff5a)
    #23 boost::asio::io_context::run_one() external/boost/boost/asio/impl/io_context.ipp:78 (libexternal_Sboost_Slibasio.so+0x63675)
    #24 GcsHealthCheckManagerTest::Run(unsigned long) src/ray/gcs/gcs_server/test/gcs_health_check_manager_test.cc:128 (gcs_health_check_manager_test+0x6eac8)
    #25 GcsHealthCheckManagerTest_TestBasic_Test::TestBody() src/ray/gcs/gcs_server/test/gcs_health_check_manager_test.cc:153 (gcs_health_check_manager_test+0x5ef74)
    #26 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2612 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x14b085)
    #27 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2648 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x14202d)
    #28 testing::Test::Run() external/com_google_googletest/googletest/src/gtest.cc:2687 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x11a7a1)
    #29 testing::TestInfo::Run() external/com_google_googletest/googletest/src/gtest.cc:2836 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x11b545)
    #30 testing::TestSuite::Run() external/com_google_googletest/googletest/src/gtest.cc:3015 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x11c301)
    #31 testing::internal::UnitTestImpl::RunAllTests() external/com_google_googletest/googletest/src/gtest.cc:5920 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x12e5c4)
    #32 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2612 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x14caf1)
    #33 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2648 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x143c61)
    #34 testing::UnitTest::Run() external/com_google_googletest/googletest/src/gtest.cc:5484 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x12c539)
    #35 RUN_ALL_TESTS() external/com_google_googletest/googletest/include/gtest/gtest.h:2317 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so+0xe0b)
    #36 main external/com_google_googletest/googlemock/src/gmock_main.cc:71 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so+0xd06)

  Previous write of size 8 at 0x7b4000008020 by thread T1:
    #0 grpc_core::RefCounted<grpc_core::RetryFilter::LegacyCallData::CallAttempt::BatchData, grpc_core::PolymorphicRefCount, grpc_core::UnrefCallDtor>::~RefCounted() external/com_github_grpc_grpc/src/core/lib/gprpp/ref_counted.h:278 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Uclient_Uchannel.so+0x4cb3cc)
    #1 grpc_core::RetryFilter::LegacyCallData::CallAttempt::BatchData::~BatchData() external/com_github_grpc_grpc/src/core/ext/filters/client_channel/retry_filter_legacy_call_data.cc:764 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Uclient_Uchannel.so+0x4c0624)
    #2 void grpc_core::UnrefCallDtor::operator()<grpc_core::RetryFilter::LegacyCallData::CallAttempt::BatchData>(grpc_core::RetryFilter::LegacyCallData::CallAttempt::BatchData*) external/com_github_grpc_grpc/src/core/lib/gprpp/ref_counted.h:241 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Uclient_Uchannel.so+0x4d1271)
    #3 grpc_core::RefCounted<grpc_core::RetryFilter::LegacyCallData::CallAttempt::BatchData, grpc_core::PolymorphicRefCount, grpc_core::UnrefCallDtor>::Unref() external/com_github_grpc_grpc/src/core/lib/gprpp/ref_counted.h:297 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Uclient_Uchannel.so+0x4cf44e)
    #4 grpc_core::RefCountedPtr<grpc_core::RetryFilter::LegacyCallData::CallAttempt::BatchData>::~RefCountedPtr() external/com_github_grpc_grpc/src/core/lib/gprpp/ref_counted_ptr.h:103 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Uclient_Uchannel.so+0x4cd15e)
    #5 grpc_core::RetryFilter::LegacyCallData::CallAttempt::BatchData::OnComplete(void*, absl::lts_20230802::Status) external/com_github_grpc_grpc/src/core/ext/filters/client_channel/retry_filter_legacy_call_data.cc:1348 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Uclient_Uchannel.so+0x4c3cfe)
    #6 exec_ctx_run external/com_github_grpc_grpc/src/core/lib/iomgr/exec_ctx.cc:45 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xaca5)
    #7 grpc_core::ExecCtx::Flush() external/com_github_grpc_grpc/src/core/lib/iomgr/exec_ctx.cc:72 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xae6e)
    #8 grpc_core::Executor::RunClosures(char const*, grpc_closure_list) external/com_github_grpc_grpc/src/core/lib/iomgr/executor.cc:130 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xc9ab)
    #9 grpc_core::Executor::ThreadMain(void*) external/com_github_grpc_grpc/src/core/lib/iomgr/executor.cc:246 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xd7bf)
    #10 operator() external/com_github_grpc_grpc/src/core/lib/gprpp/posix/thd.cc:145 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgpr.so+0x1c21a)
    #11 _FUN external/com_github_grpc_grpc/src/core/lib/gprpp/posix/thd.cc:150 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgpr.so+0x1c298)

  Thread T1 'default-executo' (tid=100340, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:962 (libtsan.so.0+0x5ea79)
    #1 ThreadInternalsPosix external/com_github_grpc_grpc/src/core/lib/gprpp/posix/thd.cc:113 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgpr.so+0x1c614)
    #2 grpc_core::Thread::Thread(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&) external/com_github_grpc_grpc/src/core/lib/gprpp/posix/thd.cc:199 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgpr.so+0x1ca8b)
    #3 grpc_core::Executor::SetThreading(bool) external/com_github_grpc_grpc/src/core/lib/iomgr/executor.cc:164 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xce56)
    #4 grpc_core::Executor::Init() external/com_github_grpc_grpc/src/core/lib/iomgr/executor.cc:97 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xc7b5)
    #5 grpc_core::Executor::InitAll() external/com_github_grpc_grpc/src/core/lib/iomgr/executor.cc:384 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibexec_Uctx.so+0xe096)
    #6 grpc_iomgr_init() external/com_github_grpc_grpc/src/core/lib/iomgr/iomgr.cc:58 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc_Ubase.so+0x580359)
    #7 grpc_init external/com_github_grpc_grpc/src/core/lib/surface/init.cc:144 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc.so+0x87f8)
    #8 grpc::internal::GrpcLibrary::GrpcLibrary(bool) external/com_github_grpc_grpc/include/grpcpp/impl/grpc_library.h:36 (liblibgcs_Userver_Ulib.so+0x1325efb)
    #9 grpc::ChannelCredentials::ChannelCredentials() external/com_github_grpc_grpc/include/grpcpp/security/credentials.h:69 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc++_Ubase.so+0x269555)
    #10 InsecureChannelCredentialsImpl external/com_github_grpc_grpc/src/cpp/client/insecure_credentials.cc:35 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc++_Ubase.so+0x268ef4)
    #11 grpc::InsecureChannelCredentials() external/com_github_grpc_grpc/src/cpp/client/insecure_credentials.cc:69 (libexternal_Scom_Ugithub_Ugrpc_Ugrpc_Slibgrpc++_Ubase.so+0x268fa3)
    #12 GcsHealthCheckManagerTest::AddServer(bool) src/ray/gcs/gcs_server/test/gcs_health_check_manager_test.cc:89 (gcs_health_check_manager_test+0x6e2f1)
    #13 GcsHealthCheckManagerTest_TestBasic_Test::TestBody() src/ray/gcs/gcs_server/test/gcs_health_check_manager_test.cc:145 (gcs_health_check_manager_test+0x5ed28)
    #14 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2612 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x14b085)
    #15 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2648 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x14202d)
    #16 testing::Test::Run() external/com_google_googletest/googletest/src/gtest.cc:2687 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x11a7a1)
    #17 testing::TestInfo::Run() external/com_google_googletest/googletest/src/gtest.cc:2836 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x11b545)
    #18 testing::TestSuite::Run() external/com_google_googletest/googletest/src/gtest.cc:3015 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x11c301)
    #19 testing::internal::UnitTestImpl::RunAllTests() external/com_google_googletest/googletest/src/gtest.cc:5920 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x12e5c4)
    #20 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2612 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x14caf1)
    #21 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) external/com_google_googletest/googletest/src/gtest.cc:2648 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x143c61)
    #22 testing::UnitTest::Run() external/com_google_googletest/googletest/src/gtest.cc:5484 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so+0x12c539)
    #23 RUN_ALL_TESTS() external/com_google_googletest/googletest/include/gtest/gtest.h:2317 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so+0xe0b)
    #24 main external/com_google_googletest/googlemock/src/gmock_main.cc:71 (libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest_Umain.so+0xd06)

SUMMARY: ThreadSanitizer: data race external/com_github_grpc_grpc/src/core/lib/gpr/alloc.cc:49 in gpr_free
==================

Versions / Dependencies

N/A, commit d9f69fd

Reproduction script

N/A

Issue Severity

Low: It annoys or frustrates me.

@dentiny dentiny added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 28, 2024
@dentiny dentiny self-assigned this Dec 28, 2024
@dentiny
Copy link
Contributor Author

dentiny commented Dec 28, 2024

Notice, this is different bug with #49469

rynewang pushed a commit that referenced this issue Dec 31, 2024
Resolves issue #49473

From the TSAN stacktrace, it's clear we have data race on
`grpc::ClientContext`:
- One thread is trying to reconstructing the context with placement new,
another thread is accessing it from grpc calls
- The fix proposed here is to make grpc context dedicated to each grpc
call, because for async operations, we really don't have any guarantee
that two health check won't happen with no interleave (at the moment)

The thing happens for health check response as well.

---------

Signed-off-by: dentiny <[email protected]>
@dentiny dentiny closed this as completed Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core
Projects
None yet
Development

No branches or pull requests

1 participant