Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(kafka): Refactor integration tests #108

Merged
merged 3 commits into from
Dec 24, 2024
Merged

Conversation

john-z-yang
Copy link
Member

@john-z-yang john-z-yang commented Dec 23, 2024

Overview

  1. Separate the unit tests and integration test into separate CI checks
  2. Unify make file and CI so that the CI uses the makefile as much as possible. This way, it's much easier to reproduce CI failure locally.
  3. Destroys the kafka container and volume before starting an integration test, so all integration tests can start with a clean kafka container.
  4. Update the integration test output directory structure from:
.
└── integration_tests/
    └── .test_output/
        └── # Files all go in here

      to:

.
└── integration_tests/
    └── .test_output/
        ├── rebalance/
        │   └── # rebalance output goes here
        └── some_other_test/
            └── # output for the other test goes here

       so that finding integration tests would be easier as we add more
5. Update the makefile recipe for the integration test so that they delete the test output later. This ensures that the output only gets cleaned up after the test returns with 0 exit code.
6. Increase minimum time before a consumer startup and it recieving a shutdown signal from 1 seconds to 4 seconds. For some reason, with a fresh kafka topic created using the python integration test. If a new consumer starts up and shuts down before a partition assignment, it will print a kafka broker error when shutting down. This trips the error output check in the integration test causing it to fail but no data is lost / double processed. I would like to dive deeper to why this is happening and why it only happens when a topic is created in the python integration test, but I would like to unblock us first. Also, in a real prod environment, we never shut down our consumer that fast. Okay I figured it out, kafka has a setting called group.initial.rebalance.delay.ms (doc here), it controls how long to wait to respond to a consumer JoinGroupRequest when a consumer group is created. Before, the unit tests already creates the task-worker consumer group, so when the integration test runs, there are no delays from the broker when the consumers are starting. But now, because we are recreating the kafka container from scratch, the broker waits when the consumers are sending the JoinGroupRequest, and since our integration test can send shutdown signal in 1 second, we get the timeout error because the consumer is sending a LeaveGroupRequest while the broker is still not responding to the JoinGroupRequest from when it first starts, causing the timeout:

librdkafka: REQTMOUT [thrd:GroupCoordinator]: GroupCoordinator/1001: Timed out LeaveGroupRequest in flight (after 5004ms, timeout #0): possibly held back by preceeding blocking JoinGroupRequest with timeout in 297034ms

@john-z-yang john-z-yang force-pushed the john/replication-fac branch 2 times, most recently from 697c857 to acf0387 Compare December 23, 2024 20:58
@john-z-yang john-z-yang changed the title remove replication factor test(kafka): Always recreate consumer topic instead of checking Dec 23, 2024
@john-z-yang john-z-yang force-pushed the john/replication-fac branch 10 times, most recently from 1b172fb to a693814 Compare December 24, 2024 00:33
@john-z-yang john-z-yang changed the title test(kafka): Always recreate consumer topic instead of checking test(kafka): Separate integration tests from unit tests Dec 24, 2024
@john-z-yang john-z-yang marked this pull request as ready for review December 24, 2024 06:42
@john-z-yang john-z-yang requested a review from a team as a code owner December 24, 2024 06:42
@john-z-yang john-z-yang changed the title test(kafka): Separate integration tests from unit tests test(kafka): Refactor integration tests Dec 24, 2024
Copy link
Member

@enochtangg enochtangg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find and thanks for cleaning all this stuff up. Just some small questions.

name: Rebalance integration test
runs-on: ubuntu-latest

steps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do devservices up again in this separate job?

Copy link
Member Author

@john-z-yang john-z-yang Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't because the makefile recipe for rebalance has build and reset-kafka as prerequisites.

test-rebalance: build reset-kafka

So when we do make test-rebalance it will automatically do make build and make reset-kafka for us (which does a devservices up & down)

)
update_topic_partitions(topic_name, num_partitions)
# Ensure topic exists and has correct number of partitions
create_topic(topic_name, num_partitions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the topic already exists?

Copy link
Member Author

@john-z-yang john-z-yang Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will print an error message and continue, but usually we run this with make test_rebalance, it always clears the kafka container, because reset-kafka is a prerequisite

@john-z-yang john-z-yang merged commit 8c591c8 into main Dec 24, 2024
7 checks passed
@john-z-yang john-z-yang deleted the john/replication-fac branch December 24, 2024 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants