Reorder requirements file decoding #12795

matthewhughes934 · 2024-06-25T17:14:46Z

This changes the decoding process to be more in line with what was
previously documented. The new process is outlined in the updated docs.

The auto_decode function was removed and all decoding logic moved to
the pip._internal.req.req_file module because:

This function was only ever used to decode requirements file
It was never really a generic 'util' function, it was always tied to
the idiosyncrasies of decoding requirements files.
The module lived under _internal so I felt comfortable removing it

A warning was added when we do fallback to using the locale defined
encoding to encourage users to move to an explicit encoding definition
via a coding style comment.

This fixes two existing bugs. Firstly, when:

a requirements file is encoded as UTF-8, and
some bytes in the file are incompatible with the system locale

Previously, assuming no BOM or PEP-263 style comment, we would default
to using the encoding from the system locale, which would then fail (see
issue #12771)

Secondly, when decoding a file starting with a UTF-32 little endian Byte
Order Marker. Previously this would always fail since
codecs.BOM_UTF32_LE is codecs.BOM_UTF16_LE followed by two null
bytes, and because of the ordering of the list of BOMs we the UTF-16
case would be run first and match the file prefix so we would
incorrectly deduce that the file was UTF-16 little endian encoded. I
can't imagine this is a popular encoding for a requirements file.

Fixes: #12771

news/9a3e9584-3fd4-4840-916b-414c164f9c28.trivial.rst

src/pip/_internal/req/req_file.py

news/9a3e9584-3fd4-4840-916b-414c164f9c28.trivial.rst

src/pip/_internal/req/req_file.py

uranusjr · 2024-08-22T06:37:43Z

src/pip/_internal/req/req_file.py

+            exc.encoding,
+            fallback_encoding,
+        )
+        content = raw_content.decode(fallback_encoding)


It may be a good idea to use error="backslashreplace" here. Most of the time, the offending bytes would just be a part of a comment anyway and would not make a difference.

It may be a good idea to use error="backslashreplace" here. Most of the time, the offending bytes would just be a part of a comment anyway and would not make a difference.

I've been hesitating with this a bit, specifically I'm wondering if this could be abused for nefarious purposes where the contents of the file you 'see' (not well defined, since this is the case where the data won't fully decode) isn't the same contents that pip will process. Though I'm having a hard time finding a vulnerable use-case (something like injecting an extra element or a adding a . to a domain name in a requirement URL)

Generally, we're trying to enforce compliance with the documentation, so I don't think we should introduce another hack to support technically mal-encoded requirement files.

sbidoul · 2024-10-13T11:14:53Z

Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current locale.getpreferredencoding(False) or sys.getdefaultencoding() with a deprecation warning recommending to add the encoding comment?

That way we have a (more or less) non-breaking path to compliance with the docs?

Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.

matthewhughes934 · 2024-10-14T18:44:33Z

Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current locale.getpreferredencoding(False) or sys.getdefaultencoding() with a deprecation warning recommending to add the encoding comment?

That way we have a (more or less) non-breaking path to compliance with the docs?

Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.

This sounds reasonable, though I think this change would need to be made in auto_deocde rather than where I made it. But now I am wondering if auto_decode needs to live where it does: it's only ever used for decoding requirements files, so maybe it can just live in req_file

sbidoul · 2024-10-20T16:59:41Z

Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.

matthewhughes934 · 2024-10-22T20:11:30Z

Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.

👍 I've updated the change and title+description. It was basically a re-do so I just stomped my previous commits.

per the description: I found and fixed another bug while testing this: requirements files starting with a UTF-32 LE BOM would always be decoded as UTF-16 LE

ichard26

Looks great so far! I love your tests. They're extensive and so easy to read! Also good catch with the UTF-32 LE vs UTF-16 LE BOM bug.

ichard26 · 2024-12-26T00:21:59Z

docs/html/reference/requirements-file-format.md

+It is simplest to encode your requirements files with UTF-8.
+The process for decoding requirements files is:
+
+- Check for any Byte Order Mark at the start of the file and if found use
+  the corresponding encoding to decode the file.
+- Check for any {pep}`263` style comment (e.g. `# -*- coding: <encoding name> -*-`)
+  and if found decode with the given encoding.
+- Try and decode with UTF-8, and if that fails,
+- fallback to trying to decode using the locale defined encoding.


This is the specification for requirement files, we should be tighter.

The default encoding for requirement files is `UTF-8` unless a different encoding is specified using a {pep}`263` style comment (e.g. `# -*- coding: <encoding name> -*-`). ```{warning} pip will fallback to the locale defined encoding if `UTF-8` decoding fails. This is a quirk of pip's parser. This behaviour is *deprecated* and should not be relied upon. ```

While we're here, should we expand the specification to formally allow BOMs?

While we're here, should we expand the specification to formally allow BOMs?

How much do we want to encourage this usage, do you have an idea for how much they are relied on? Specifically, I'm wondering if we can just say: UTF-8 by default, PEP-263 style comments supported, other bits deprecated, i.e if someone is relying on locale specific or BOM, we suggest they should switch to PEP-263 comment

I would prefer to not allow BOMs, as @matthewhughes934 says. Ideally, as it's a backward incompatible change, we would initially issue a deprecation warning if a BOM is found, but given that the use of a BOM is undocumented (in the user docs), I'd be fine with simply removing that support if that's easier.

However, PEP 263 states that the UTF8 signature '\xef\xbb\xbf' should signal UTF8, so that one (which is treated as a BOM in auto_decode) should stay.

It's worth noting that UTF16 and UTF32 encodings can't be handled with a PEP 263 style comment, as those encodings are not ASCII supersets. But I think it's fine to stop supporting them.

It's worth noting that UTF16 and UTF32 encodings can't be handled with a PEP 263 style comment, as those encodings are not ASCII supersets. But I think it's fine to stop supporting them.

I'm not following this issue closely, but in case it's helpful, I'll note that uv initially only supported UTF8 but then found they had user issues not supporting UTF16: astral-sh/uv#2283

I think because redirecting from stdout to a file on Windows would end up with a UTF16 file, e.g. pip freeze > requirements.txt.

I'm not really in the mood to break our Windows users over requirements file decoding, honestly. I'd prefer documenting BOMs as supported as I don't like to have implementation-defined behaviour here, but I also recognize that would promote their proliferation so leaving this as implementation-defined behaviour is a fair compromise. Using the old docs text or my suggestion (without any changes) is fine.

Thanks @notatallshaw for letting us know that UTF-16 can be pretty common on Windows.

I'm not really in the mood to break our Windows users over requirements file decoding, honestly. I'd prefer documenting BOMs as supported as I don't like to have implementation-defined behaviour here, but I also recognize that would promote their proliferation so leaving this as implementation-defined behaviour is a fair compromise. Using the old docs text or my suggestion (without any changes) is fine.

👍 I like your more concise docs, and the warning, so went with that 5c49381

I think because redirecting from stdout to a file on Windows would end up with a UTF16 file

In Windows 11, cmd seems to use UTF8¹, as does Powershell Core. Windows Powershell does use UTF16, though, so I guess there's a proportion of users who will encounter this (I don't have a feel for the relative usage of the 3 shells).

Given this, +1 on retaining the existing support but not documenting it.

Footnotes

This may depend on the console codepage, I'm not sure when UTF8 became the default. ↩

👍 I added a code comment explaining why we still support BOMs: f1dbf49

src/pip/_internal/req/req_file.py

ichard26 · 2024-12-26T00:26:26Z

src/pip/_internal/req/req_file.py

-            content = auto_decode(f.read())
+            raw_content = f.read()
    except OSError as exc:
        raise InstallationError(f"Could not open requirements file: {exc}")
+
+    content = _decode_req_file(raw_content, url)


Hmm, if the decoding with the locale encoding also fails, the exception won't be caught. Any objections to calling _decode_req_file within in the try-except?

Hmm, if the decoding with the locale encoding also fails, the exception won't be caught. Any objections to calling _decode_req_file within in the try-except?

No objection to adding it there. My original assumption, and the reason I moved it out, was the catching of OSError was only really relevant for reading the content, and not the actual decoding.

tests/unit/test_req_file.py

ichard26 · 2024-12-26T00:41:09Z

Oh and sorry for taking so long to re-review your PR. It fell off our radar, and then we haven't had the time to review it honestly.

ichard26

We could make the fallback to the locale/system encoding raise a deprecation warning instead of a plain warning, but I'm honestly not very interested in going through a deprecation cycle over this. If other maintainers feel differently, I have no issues with adding one.

Thank you again for your patience! Good work!

This changes the decoding process to be more iniline with what was previously documented. The new process is outlined in the updated docs. The `auto_decode` function was removed and all decoding logic moved to the `pip._internal.req.req_file` module because: * This function was only ever used to decode requirements file * It was never really a generic 'util' function, it was always tied to the idiosyncrasies of decoding requirements files. * The module lived under `_internal` so I felt comfortable removing it A warning was added when we _do_ fallback to using the locale defined encoding to encourage users to move to an explicit encoding definition via a coding style comment. This fixes two existing bugs. Firstly, when: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale Previously, assuming no BOM or PEP-263 style comment, we would default to using the encoding from the system locale, which would then fail (see issue pypa#12771) Secondly, when decoding a file starting with a UTF-32 little endian Byte Order Marker. Previously this would always fail since `codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null bytes, and because of the ordering of the list of BOMs the UTF-16 case would be run first and match the file prefix so we would incorrectly deduce that the file was UTF-16 little endian encoded. I can't imagine this is a popular encoding for a requirements file. Fixes: pypa#12771

matthewhughes934 · 2024-12-28T18:26:58Z

Latest change was just the result of me rebasing on main and squashing all the fixup commits together

matthewhughes934 · 2024-12-28T19:51:50Z

There's a test failure https://github.com/pypa/pip/actions/runs/12528829883/job/34943820232?pr=12795, error:

FAILED tests/functional/test_install.py::test_vcs_url_urlquote_normalization - pip._internal.exceptions.InstallationSubprocessError: bzr checkout --lightweight --quiet http://bazaar.launchpad.net/%7Edjango-wikiapp/django-wikiapp/release-0.1 /tmp/pytest-of-runner/pytest-1/popen-gw1/test_vcs_url_urlquote_normaliz0/cache/release-0.1 exited with 3

It looks like bzr checkout failed with an error? Maybe some intermittent/network issue: it ran ok for all other Python versions+OS (I can't see the button to re-run jobs, I assume due to lack of permissions)

pfmoore · 2024-12-28T20:07:54Z

I hit "rerun failed jobs". Let's see if that fixes it.

matthewhughes934 force-pushed the handle-request-file-decode-failures branch from a3f1cac to aa0f744 Compare June 25, 2024 17:39

psf-chronographer bot added the bot:chronographer:provided label Jun 25, 2024

matthewhughes934 force-pushed the handle-request-file-decode-failures branch from aa0f744 to 7df3500 Compare June 25, 2024 17:48

matthewhughes934 marked this pull request as ready for review June 25, 2024 18:04

matthewhughes934 commented Jun 26, 2024

View reviewed changes

news/9a3e9584-3fd4-4840-916b-414c164f9c28.trivial.rst Outdated Show resolved Hide resolved

ichard26 added this to the 24.3 milestone Jul 16, 2024

ichard26 requested changes Aug 14, 2024

View reviewed changes

src/pip/_internal/req/req_file.py Outdated Show resolved Hide resolved

news/9a3e9584-3fd4-4840-916b-414c164f9c28.trivial.rst Outdated Show resolved Hide resolved

src/pip/_internal/req/req_file.py Outdated Show resolved Hide resolved

matthewhughes934 force-pushed the handle-request-file-decode-failures branch from 7df3500 to b4c3255 Compare August 15, 2024 17:49

uranusjr reviewed Aug 22, 2024

View reviewed changes

sbidoul removed this from the 24.3 milestone Oct 20, 2024

matthewhughes934 force-pushed the handle-request-file-decode-failures branch from b4c3255 to d0bf895 Compare October 22, 2024 20:10

matthewhughes934 changed the title ~~Handle req file decode failures on locale encoding~~ Reorder requirements file decoding Oct 22, 2024

matthewhughes934 requested review from uranusjr and ichard26 October 22, 2024 20:10

matthewhughes934 mentioned this pull request Oct 22, 2024

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

Open

1 task

ichard26 requested changes Dec 26, 2024

View reviewed changes

matthewhughes934 force-pushed the handle-request-file-decode-failures branch from fd2801e to 60c6bb9 Compare December 26, 2024 13:59

ichard26 approved these changes Dec 27, 2024

View reviewed changes

ichard26 added this to the 25.0 milestone Dec 28, 2024

matthewhughes934 force-pushed the handle-request-file-decode-failures branch from f1dbf49 to ef986d2 Compare December 28, 2024 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder requirements file decoding #12795

Reorder requirements file decoding #12795

matthewhughes934 commented Jun 25, 2024 •

edited

Loading

uranusjr Aug 22, 2024

matthewhughes934 Aug 31, 2024 •

edited

Loading

ichard26 Dec 27, 2024

sbidoul commented Oct 13, 2024

matthewhughes934 commented Oct 14, 2024

sbidoul commented Oct 20, 2024

matthewhughes934 commented Oct 22, 2024 •

edited

Loading

ichard26 left a comment •

edited

Loading

ichard26 Dec 26, 2024

matthewhughes934 Dec 26, 2024 •

edited

Loading

pfmoore Dec 26, 2024

notatallshaw Dec 26, 2024

ichard26 Dec 26, 2024

matthewhughes934 Dec 26, 2024

pfmoore Dec 26, 2024

ichard26 Dec 27, 2024

ichard26 Dec 26, 2024

matthewhughes934 Dec 26, 2024

ichard26 commented Dec 26, 2024

ichard26 left a comment

matthewhughes934 commented Dec 28, 2024

matthewhughes934 commented Dec 28, 2024

pfmoore commented Dec 28, 2024

Reorder requirements file decoding #12795

Are you sure you want to change the base?

Reorder requirements file decoding #12795

Conversation

matthewhughes934 commented Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

matthewhughes934 Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbidoul commented Oct 13, 2024

matthewhughes934 commented Oct 14, 2024

sbidoul commented Oct 20, 2024

matthewhughes934 commented Oct 22, 2024 • edited Loading

ichard26 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewhughes934 Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ichard26 commented Dec 26, 2024

ichard26 left a comment

Choose a reason for hiding this comment

matthewhughes934 commented Dec 28, 2024

matthewhughes934 commented Dec 28, 2024

pfmoore commented Dec 28, 2024

matthewhughes934 commented Jun 25, 2024 •

edited

Loading

matthewhughes934 Aug 31, 2024 •

edited

Loading

matthewhughes934 commented Oct 22, 2024 •

edited

Loading

ichard26 left a comment •

edited

Loading

matthewhughes934 Dec 26, 2024 •

edited

Loading