Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor integration tests and add metrics coverage #2432

Merged
merged 2 commits into from
Dec 17, 2024

Conversation

scottgerring
Copy link
Contributor

@scottgerring scottgerring commented Dec 15, 2024

Fixes #2401

This is an alternate to #2424 , with a more of an opinionated take on how the integration tests could look. Essentially I am trying to make it into a more normal looking test suite, so it is easier to extend and reason about the results of.

  1. Decompose the single "mega unit test" in integration_test.rs into discrete unit tests
  2. Pull the otel-collector container out into test_utils.rs and re-use it everywhere
  3. Introduce anyhow to make error handling cleaner and easier to follow panic output when it happens
  4. Removed the #[ignore] from all the integration tests - if we don't want them to run as part of cargo test, we can pass --lib in our CI scripts
  5. Modified the collector image so it outputs a separate log file for each signal
  6. Upgraded testcontainers and added startup waits so that we don't have to sleep waiting for the container to start and potentially run over our startup time budget

This saves a fair bit of duplication too, which is nice!

In metrics i've added test-per-meter and some supporting code to pluck the data out for each meter easily. IMHO this will make it easy to extend and easy to follow, rather than just having a enormous "total world" diff to pick through.

I've adapted traces/logs so each has a single normal unit test in its own file, but have not decomposed them or introduced more tests there, yet.

Changes

Please provide a brief description of the changes here.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@scottgerring scottgerring force-pushed the chore/integration-test-2 branch from f85c127 to 35034f4 Compare December 15, 2024 16:42
@scottgerring
Copy link
Contributor Author

scottgerring commented Dec 15, 2024

@cijothomas @lalitb let me know what you think about this; I sunk a bit of time into this this weekend to try organise the integration tests so they are a bit more "ordinary".

If we're happy with this i'll fill in the gaps in the coverage off of here (as well as other comments i've not got to from the other branch) !

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this refactor! Thanks for working on this.

(I also tested locally and confirmed that https://github.com/open-telemetry/opentelemetry-rust/pull/2431/files is solving the issue it is intending to fix)

Not marking explicit approval as the PR is marked draft. Feel free to make it review ready.

@lalitb
Copy link
Member

lalitb commented Dec 16, 2024

Removed the #[ignore] from all the integration tests - if we don't want them to run as part of cargo test, we can pass --lib in our CI scripts

The reason for ignoring integration tests by default is that they can take a long time to run, so cargo test should not execute them unless explicitly specified.

@lalitb
Copy link
Member

lalitb commented Dec 16, 2024

opentelemetry-otlp/tests/integration_test/lcov.info

nit - Do you need this file, as it is bringing ~38K loc, this will increase the clone time for repo.?

results: Vec<ResourceMetrics>,
expected: Vec<ResourceMetrics>,
results: Value,
expected: Value,
Copy link
Member

@lalitb lalitb Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier code was also testing the conversion of JSON to proto structs, while it seems the current code no longer does this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grpc_build.rs adds the annotations for serialization/deserialization between json and proto structs, and this got tested in current setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to test that?

Copy link
Contributor Author

@scottgerring scottgerring Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See we comment in the big thing at the bottom - there's an issue with the roundtrip serialization of metrics, so I wrote an additional test that shows this, marked it as #[ignore], and raised an issue. I've switched this to serde types so that the integration test for metrics will actually fail if there's a diff; as it stood differences in the metric value would not be detected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am totally fine to just get an ACK from Collector that it accepted our metrics/logs/traces. That will give instant value (to validate lot of stuff we are manually validating now, like is the shutdown going to panic or do its job etc.!)

Validating the actual content - Its important, but I consider it as non-blocking now, and can be added later too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: we're getting an ACK from the collector, and parsing all the results back out from the copy written by the collector to its file outputs, and comparing them to our expectation - we're validating the data that comes out of the collector is what we sent it now ✅

The only "thing" is, we're just using Serde models to deserialize for that validation, not our own proto-derived ones, because of the aforementioned fields-going-missing issue that would make a proto-based-deserialization test succeed when data is actually lost.

So - i think this is in a pretty good state 🙏 !

Copy link
Member

@lalitb lalitb Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to test that?

Yes it's important to test from opentelemetry-proto prospective that we can successfully deserialize the metrics data written by collector. This is a standalone crate, and also being used by users for consumers for collector. And the only way to test this reliably is with integration tests. This is not the blocker, but we should bring it back once the serde model is fixed for metrics.

Copy link
Member

@lalitb lalitb Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - Please add a TODO (with reference to the metrics serde issue), to be fixed eventually.

@lalitb lalitb added the integration tests Run integration tests label Dec 16, 2024
@cijothomas
Copy link
Member

Removed the #[ignore] from all the integration tests - if we don't want them to run as part of cargo test, we can pass --lib in our CI scripts

The reason for ignoring integration tests by default is that they can take a long time to run, so cargo test should not execute them unless explicitly specified.

if it is executing in parallel to the main CI, and is taking less time than the longest CI (which I think is the windows ci), then l think we can let integration test run always.

@lalitb
Copy link
Member

lalitb commented Dec 16, 2024

Removed the #[ignore] from all the integration tests - if we don't want them to run as part of cargo test, we can pass --lib in our CI scripts

The reason for ignoring integration tests by default is that they can take a long time to run, so cargo test should not execute them unless explicitly specified.

if it is executing in parallel to the main CI, and is taking less time than the longest CI (which I think is the windows ci), then l think we can let integration test run always.

Yes, should be fine to keep enabled in CI. The concern was that is this fast enough to keep it enabled for cargo test from command line.

@scottgerring scottgerring force-pushed the chore/integration-test-2 branch from 609a158 to d5dc9ef Compare December 16, 2024 10:38
@scottgerring scottgerring force-pushed the chore/integration-test-2 branch 2 times, most recently from 684a70c to 5707bfe Compare December 16, 2024 11:37
@scottgerring
Copy link
Contributor Author

scottgerring commented Dec 16, 2024

Hey both, thanks for the quick turnaround! Preemptive sorry-for-the-braindump - i'm rushing this comment out between meetings 😱

cargo test and integration tests

I changed this so that it's possible to skip tests in the crate; the round-trip example i've added that shows an issue with the JSON serialization caught me out, as I marked it skipped and had to work out why it was still running in the CI build. I expect this suite will grow and the need to have compiled-but-skip tests will increase.

It's normal behaviour for cargo test to run the integration suite, and per the docs you can skip it from the CLI by adding --lib. I've modified the CI jobs here so that they maintain the same behaviour - e.g. the integration test suite runs the integration suite, and CI does not. I think it is nice to keep them separate as integration suites often end up a little flakier and separating them out in Github makes it easier for devs to reason about.

For reference, the integration suite currently adds 10s on my MBP if I run cargo test from the root without --lib.

Serialization/Deserialization Tests

The earlier code was also testing the conversion of JSON to proto structs, while it seems the current code no longer does this?

I changed to using Serde types because the roundtrip for metrics is broken (raised #2434) and its a can of worms - this test I added as ignore demonstrates it and validates the serialization/deserialization and can be unskipped when it's fixed.

The metric value (at least) disappears on the deserialization side, not the serialization side, which I think is less of a worry for us! I had a bit of a look but because it is all tied up into the protobuff and serde serialization magic I didn't want to pull that into this PR. The test I've added catches this issue with the serialization

Remaining Work

I will add a test for shutdown() flush to metrics and I think that is it. I should be able to do this tomorrow!

@scottgerring scottgerring marked this pull request as ready for review December 16, 2024 18:04
@scottgerring scottgerring requested a review from a team as a code owner December 16, 2024 18:04
@scottgerring scottgerring force-pushed the chore/integration-test-2 branch from c78170f to 4a59154 Compare December 16, 2024 18:36
@cijothomas
Copy link
Member

Removed the #[ignore] from all the integration tests - if we don't want them to run as part of cargo test, we can pass --lib in our CI scripts

The reason for ignoring integration tests by default is that they can take a long time to run, so cargo test should not execute them unless explicitly specified.

if it is executing in parallel to the main CI, and is taking less time than the longest CI (which I think is the windows ci), then l think we can let integration test run always.

Yes, should be fine to keep enabled in CI. The concern was that is this fast enough to keep it enabled for cargo test from command line.

Got it. Yes I think it is best to keep this ignored, and only triggered from the integration test CI.
Even if it is just few seconds, given it needs the port free I am inclined to keep existing way.

Copy link

codecov bot commented Dec 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.7%. Comparing base (eb8d7c6) to head (dba5ff1).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2432     +/-   ##
=======================================
- Coverage   79.4%   76.7%   -2.8%     
=======================================
  Files        122     122             
  Lines      21700   21700             
=======================================
- Hits       17247   16657    -590     
- Misses      4453    5043    +590     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scottgerring
Copy link
Contributor Author

@cijothomas , I think this is in a pretty good state, except that I need to work out why the integration suite fails on CI and not locally 😱 I'll look again tomorrow.

@cijothomas
Copy link
Member

@cijothomas , I think this is in a pretty good state, except that I need to work out why the integration suite fails on CI and not locally 😱 I'll look again tomorrow.

I am curious if we can enable tracing::fmt in the integration tests, and view internal logs? Not required in this PR, just sharing something that'd make our lives easier if things go wrong.

@scottgerring
Copy link
Contributor Author

@cijothomas , I think this is in a pretty good state, except that I need to work out why the integration suite fails on CI and not locally 😱 I'll look again tomorrow.

I am curious if we can enable tracing::fmt in the integration tests, and view internal logs? Not required in this PR, just sharing something that'd make our lives easier if things go wrong.

Good idea! I’ll give it a try tomorrow. Certainly doesn’t hurt leaving it on for tests.

@scottgerring scottgerring changed the title chore: alternate refactor integration tests chore: refactor integration tests and add metrics coverage Dec 17, 2024
@scottgerring scottgerring force-pushed the chore/integration-test-2 branch 2 times, most recently from f8044a2 to 90c2449 Compare December 17, 2024 10:33
@scottgerring
Copy link
Contributor Author

scottgerring commented Dec 17, 2024

@cijothomas , this should be good to merge. I've rebased and squashed everything together to make a nicer merge history too. The outstanding issue yesterday with the integration tests was the "magic sleep" for the collector container to start; i've added a newer version of testcontainers so we can wait for the HTTP collector port to start answering instead, which should make things more robust.

There are two outstanding issues:

  • Deserialization of metrics from OTLP output broken #2434
  • HTTP-client exporters have issues that cause test failures I will raise an issue to resolve this after this PR is merged, and work on it; as it seems like it is likely happening in the exporter itself, and i've fairly significantly refactored the integration test code, i'm keen to get this PR merged and then look at this extra issue. For the moment the new test is simply skipping itself for hyper/reqwest until whatever issue lies in there is fixed.

@scottgerring scottgerring force-pushed the chore/integration-test-2 branch from 90c2449 to 21174e8 Compare December 17, 2024 10:42
Copy link
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor. Nicely done.

results: Vec<ResourceMetrics>,
expected: Vec<ResourceMetrics>,
results: Value,
expected: Value,
Copy link
Member

@lalitb lalitb Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - Please add a TODO (with reference to the metrics serde issue), to be fixed eventually.

@scottgerring
Copy link
Contributor Author

hey @lalitb , i've got a test over here and a link to the issue for the roundtripping the models:

///
/// Validate JSON/Protobuf models roundtrip correctly.
///
/// TODO - this test fails currently. Fields disappear, such as the actual value of a given metric.
/// This appears to be on the _deserialization_ side.
/// Issue: https://github.com/open-telemetry/opentelemetry-rust/issues/2434
///
#[tokio::test]
#[ignore]
async fn test_roundtrip_example_data() -> Result<()> {
let metrics_in = include_str!("../expected/metrics/test_u64_counter_meter.json");
let metrics: MetricsData = serde_json::from_str(metrics_in)?;
let metrics_out = serde_json::to_string(&metrics)?;
println!("{:}", metrics_out);
let metrics_in_json: Value = serde_json::from_str(metrics_in)?;
let metrics_out_json: Value = serde_json::from_str(&metrics_out)?;
assert_eq!(metrics_in_json, metrics_out_json);
Ok(())
}

I reckon, if we keep the MetricsAsserter stuff on Serde, and have separate "roundtrip the models" tests, then we won't accidentally miss cases where an integration test regresses and we don't notice it because of model mapping. What do you think?

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can address some of the remaining issues in followups as needed.

fn init_tracing() {
INIT_TRACING.call_once(|| {
let subscriber = FmtSubscriber::builder()
.with_max_level(tracing::Level::DEBUG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG is giving a ton of noise from the networking libraries in the CI logs.. probably okay for now, we can revisit if we find it too much noise.

@cijothomas cijothomas merged commit 9173ddf into open-telemetry:main Dec 17, 2024
20 of 21 checks passed
@lalitb
Copy link
Member

lalitb commented Dec 17, 2024

hey @lalitb , i've got a test over here and a link to the issue for the roundtripping the models:

///
/// Validate JSON/Protobuf models roundtrip correctly.
///
/// TODO - this test fails currently. Fields disappear, such as the actual value of a given metric.
/// This appears to be on the _deserialization_ side.
/// Issue: https://github.com/open-telemetry/opentelemetry-rust/issues/2434
///
#[tokio::test]
#[ignore]
async fn test_roundtrip_example_data() -> Result<()> {
let metrics_in = include_str!("../expected/metrics/test_u64_counter_meter.json");
let metrics: MetricsData = serde_json::from_str(metrics_in)?;
let metrics_out = serde_json::to_string(&metrics)?;
println!("{:}", metrics_out);
let metrics_in_json: Value = serde_json::from_str(metrics_in)?;
let metrics_out_json: Value = serde_json::from_str(&metrics_out)?;
assert_eq!(metrics_in_json, metrics_out_json);
Ok(())
}

I reckon, if we keep the MetricsAsserter stuff on Serde, and have separate "roundtrip the models" tests, then we won't accidentally miss cases where an integration test regresses and we don't notice it because of model mapping. What do you think?

Yes agree, this ensures serialization and deserialization are validated independently.

@scottgerring scottgerring deleted the chore/integration-test-2 branch December 17, 2024 15:21
@scottgerring
Copy link
Contributor Author

Hurrah! Thanks @lalitb @cijothomas .
Opening another issue for the rest of the HTTP clients now.

@cijothomas cijothomas changed the title chore: refactor integration tests and add metrics coverage refactor integration tests and add metrics coverage Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration tests Run integration tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use integration test to cover key OTLP scenarios
3 participants