perf: Avoid unnecessary URL processing while parsing links #13132

ichard26 · 2024-12-27T19:22:06Z

There are three optimizations in this commit, in descending order of impact:

If the file URL in the "project detail" response is already absolute, then avoid calling urljoin() as it's expensive (mostly because it calls urlparse() on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts with http[s]://, we can avoid slow urljoin() calls for PyPI responses.
Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in _ensure_quoted_url(). The URL parsing functions are equivalent for our needs¹. However, urlsplit() isfaster, and we achieve better cache utilization of its internal cache if we call it directly².
Calculating the Link.path property in advance as it's very hot.

we don't care about URL parameters AFAIK (which are different than the query component!) ↩
urlparse() calls urlsplit() internally, but it passes the authority parameter (unlike any of our calls) so it bypasses the cache. ↩

There are three optimizations in this commit, in descending order of impact: - If the file URL in the "project detail" response is already absolute, then avoid calling urljoin() as it's expensive (mostly because it calls urlparse() on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts with http[s]://, we can avoid slow urljoin() calls for PyPI responses. - Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in _ensure_quoted_url(). The URL parsing functions are equivalent for our needs[^1]. However, urlsplit() isfaster, and we achieve better cache utilization of its internal cache if we call it directly[^2]. - Calculating the Link.path property in advance as it's very hot. [^1]: we don't care about URL parameters AFAIK (which are different than the query component!) [^2]: urlparse() calls urlsplit() internally, but it passes the authority parameter (unlike any of our calls) so it bypasses the cache.

ichard26 · 2024-12-27T19:33:15Z

As an example for where this matters, with #13128 already applied, this saves about 1600 ms while collecting and resolving a list of homeassistant dependencies:

Command: python -m cProfile -o profile2.pstats -m pip install -r temp/homeassistant/requirements.txt --dry-run

Profile (before)

Profile (after)

Method	Before	After
`Link.from_json`	3310 ms	1730 ms
`LinkEvaluator.evaluate_link`	1040 ms	990 ms

And while this depends on network performance (so please look at the time elapsed, not the percentages which may be off), the entire command takes ~16-18 seconds, so the savings are significant.

ichard26 added the type: performance Commands take too long to run label Dec 27, 2024

psf-chronographer bot added the bot:chronographer:provided label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Avoid unnecessary URL processing while parsing links #13132

perf: Avoid unnecessary URL processing while parsing links #13132

ichard26 commented Dec 27, 2024 •

edited

Loading

ichard26 commented Dec 27, 2024 •

edited

Loading

perf: Avoid unnecessary URL processing while parsing links #13132

Are you sure you want to change the base?

perf: Avoid unnecessary URL processing while parsing links #13132

Conversation

ichard26 commented Dec 27, 2024 • edited Loading

Footnotes

ichard26 commented Dec 27, 2024 • edited Loading

ichard26 commented Dec 27, 2024 •

edited

Loading

ichard26 commented Dec 27, 2024 •

edited

Loading