Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry design #11175

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Telemetry design #11175

wants to merge 2 commits into from

Conversation

JanProvaznik
Copy link
Member

Fixes #10947

Context

Writeup of proposed telemetry implementation based on experimentation in #11084

Copy link
Member

@JanKrivanek JanKrivanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks!!


### Security

- Providing a method for creating a hook in Framework MSBuild
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Providing a method for creating a hook in Framework MSBuild
- Providing or/and documenting a method for creating a hook in Framework MSBuild

### Security

- Providing a method for creating a hook in Framework MSBuild
- document the security implications of hooking custom telemetry Exporters/Collectors in Framework
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- document the security implications of hooking custom telemetry Exporters/Collectors in Framework
- If custom hooking solution will be used - document the security implications of hooking custom telemetry Exporters/Collectors in Framework

Since we plan to use AppDomainManager - we are using existing solution that is outside of our trust boundaries

### Data handling

- Implement head [Sampling](https://opentelemetry.io/docs/concepts/sampling/) with the granularity of a MSBuild.exe invocation/VS instance.
- VS Data handle tail sampling in their infrastructure not to overwhelm storage with a lot of build events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed - we should not prevent ourselves to be able to add (in future versions):

  • different sampling rates for different namespaces/activities
  • ability to configure the overal and per-namespace sampling from server side (e.g. storing it in the .msbuild folder in user profile if different then default values set from server side - this would obviously have a delay of the default sample rate # of executions)


## Looking ahead

- Create a way of using a "HighPrioActivitySource" which would override sampling and initialize Collector in MSBuild.exe scenario/tracerprovider in VS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generaly - sample rate per Activity/namespace (higher even always or even lower or newer)

## Uncertainties

- Configuring tail sampling in VS telemetry server side infrastructure to not overflow them with data.
- How much head sampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just ballpark estimate some rates or possibly we can use some little statistic science behind the sample size determination: https://en.wikipedia.org/wiki/Sample_size_determination

E.g. for proportion estimation (of fairly common occurence in the builds), with not very strict confidnece (let's say 95% is awesome for us now) and margin for error (5% is very acceptable for us) and quite high population size (let's estimate # of total daily build events to be between 10M and 100M [while in fact much more close to the uppor bound]), we would be very fine with the sampling rate of 1 from 26.000

Sample table of sample size for proprtion hypothesis: https://www.research-advisors.com/images/subpage/SSTable.jpg

For more rare events (runaway builds, custom tasks etc.) we'd need to adjust apropriately to capture at least couple hundrets datapoints daily ... that should still allow for considerably small sampling rates and hence low impact on the observed builds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. this might be as well a partial answer to some below open questions around perf - if we are not able to get the perf to be sufficient for regular executions, but still quite around 'human noticable threshold' (per various UX researches ~100ms) - we might just choose to pay the cost in very low amount of cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Propose/Design the way of referencing/using VS OTel
2 participants