Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Avoid unnecessary URL processing while parsing links #13132

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ichard26
Copy link
Member

@ichard26 ichard26 commented Dec 27, 2024

There are three optimizations in this commit, in descending order of impact:

  • If the file URL in the "project detail" response is already absolute, then avoid calling urljoin() as it's expensive (mostly because it calls urlparse() on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts with http[s]://, we can avoid slow urljoin() calls for PyPI responses.

  • Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in _ensure_quoted_url(). The URL parsing functions are equivalent for our needs1. However, urlsplit() isfaster, and we achieve better cache utilization of its internal cache if we call it directly2.

  • Calculating the Link.path property in advance as it's very hot.

Footnotes

  1. we don't care about URL parameters AFAIK (which are different than the query component!)

  2. urlparse() calls urlsplit() internally, but it passes the authority parameter (unlike any of our calls) so it bypasses the cache.

There are three optimizations in this commit, in descending order of
impact:

- If the file URL in the "project detail" response is already absolute,
  then avoid calling urljoin() as it's expensive (mostly because it
  calls urlparse() on both of its URL arguments) and does nothing. While
  it'd be more correct to check whether the file URL has a scheme, we'd
  need to parse the URL which is what we're trying to avoid in the first
  place. Anyway, by simply checking if the URL starts with http[s]://,
  we can avoid slow urljoin() calls for PyPI responses.

- Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in
  _ensure_quoted_url(). The URL parsing functions are equivalent for our
  needs[^1]. However, urlsplit() isfaster, and we achieve better cache
  utilization of its internal cache if we call it directly[^2].

- Calculating the Link.path property in advance as it's very hot.

[^1]: we don't care about URL parameters AFAIK (which are different than
  the query component!)

[^2]: urlparse() calls urlsplit() internally, but it passes the authority
  parameter (unlike any of our calls) so it bypasses the cache.
@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 27, 2024
@ichard26
Copy link
Member Author

ichard26 commented Dec 27, 2024

As an example for where this matters, with #13128 already applied, this saves about 1600 ms while collecting and resolving a list of homeassistant dependencies:

Command: python -m cProfile -o profile2.pstats -m pip install -r temp/homeassistant/requirements.txt --dry-run

Profile (before)

image

Profile (after)

image

Method Before After
Link.from_json 3310 ms 1730 ms
LinkEvaluator.evaluate_link 1040 ms 990 ms

And while this depends on network performance (so please look at the time elapsed, not the percentages which may be off), the entire command takes ~16-18 seconds, so the savings are significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant