-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771
Comments
Are you able to share the contents of the |
@matthewhughes934
|
@matthewhughes934 |
It’s probably best to always use A PR would be much welcomed. |
I guess the underlying issue was: the file looks to be UTF-8 encoded but you're working in an environment that uses a simplified Chinese locale, and so uses GBK for decoding. I guess an alternative solution would be to run Python in UTF-8 mode (https://docs.python.org/3/using/windows.html#utf-8-mode) |
👍 happy to get a PR up. I'm wondering two things:
|
Unfortunately, requirements aren't the only things in a requirement file. However, the documentation states that requirement files should be UTF-8 by default, so this seems like a simple bug in Of course, even though this is technically a bug fix, it is still a breaking change, potentially, so we need to consider how we handle that. (We could fall back to the system encoding if UTF8 fails, with a deprecation warning - this won't avoid mojibake, but it will catch outright encoding failures). |
Ah, right, I forgot about paths. Falling back with a deprecation warning sounds like the way to go. |
For the case where: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale In this case, fallback to decoding as UTF-8 as a last resort (rather than crashing on the `UnicodeDecodeError`). This behaviour was added when parsing the request file, rather than in `auto_decode` as it didn't seem to belong in a generic util (though that util looks to only be ever called when parsing requirements files anyway). Perhaps we should just go straight to UTF-8 without querying the system locale (unless there is a PEP-263 style comment), per the docs[1]: > Requirements files are utf-8 encoding by default But to avoid a breaking change just warn if decoding with this locale fails then fallback to UTF-8 [1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding Fixes: pypa#12771
For the case where: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale In this case, fallback to decoding as UTF-8 as a last resort (rather than crashing on the `UnicodeDecodeError`). This behaviour was added when parsing the request file, rather than in `auto_decode` as it didn't seem to belong in a generic util (though that util looks to only be ever called when parsing requirements files anyway). Perhaps we should just go straight to UTF-8 without querying the system locale (unless there is a PEP-263 style comment), per the docs[1]: > Requirements files are utf-8 encoding by default But to avoid a breaking change just warn if decoding with this locale fails then fallback to UTF-8 [1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding Fixes: pypa#12771
If this issue had been fixed, can you tell me how to do that? Unfortunately , I also encounter this problem as you were |
No-one has submitted a PR to fix this yet. For now, the simplest solution is likely to be to remove any non-ASCII content from your requirements file. Or ensure your requirements file is encoded in the locale preferred encoding, as returned by |
are you able to share the requirements file you were using when you saw the error? |
This changes the decoding process to be more in line with what was previously documented. The new process is outlined in the updated docs. The `auto_decode` function was removed and all decoding logic moved to the `pip._internal.req.req_file` module because: * This function was only ever used to decode requirements file * It was never really a generic 'util' function, it was always tied to the idiosyncrasies of decoding requirements files. * The module lived under `_internal` so I felt comfortable removing it A warning was added when we _do_ fallback to using the locale defined encoding to encourage users to move to an explicit encoding definition via a coding style comment. This fixes two existing bugs. Firstly, when: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale Previously, assuming no BOM or PEP-263 style comment, we would default to using the encoding from the system locale, which would then fail (see issue pypa#12771) Secondly, when decoding a file starting with a UTF-32 little endian Byte Order Marker. Previously this would always fail since `codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null bytes, and because of the ordering of the list of BOMs we the UTF-16 case would be run first and match the file prefix so we would incorrectly deduce that the file was UTF-16 little endian encoded. I can't imagine this is a popular encoding for a requirements file. Fixes: pypa#12771
I have #12795 to address this. I just updated it to change the approach from adding a work-around for this case to re-ordering the encodings attempted when decoding a requirements file |
This changes the decoding process to be more in line with what was previously documented. The new process is outlined in the updated docs. The `auto_decode` function was removed and all decoding logic moved to the `pip._internal.req.req_file` module because: * This function was only ever used to decode requirements file * It was never really a generic 'util' function, it was always tied to the idiosyncrasies of decoding requirements files. * The module lived under `_internal` so I felt comfortable removing it A warning was added when we _do_ fallback to using the locale defined encoding to encourage users to move to an explicit encoding definition via a coding style comment. This fixes two existing bugs. Firstly, when: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale Previously, assuming no BOM or PEP-263 style comment, we would default to using the encoding from the system locale, which would then fail (see issue pypa#12771) Secondly, when decoding a file starting with a UTF-32 little endian Byte Order Marker. Previously this would always fail since `codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null bytes, and because of the ordering of the list of BOMs the UTF-16 case would be run first and match the file prefix so we would incorrectly deduce that the file was UTF-16 little endian encoded. I can't imagine this is a popular encoding for a requirements file. Fixes: pypa#12771
This changes the decoding process to be more iniline with what was previously documented. The new process is outlined in the updated docs. The `auto_decode` function was removed and all decoding logic moved to the `pip._internal.req.req_file` module because: * This function was only ever used to decode requirements file * It was never really a generic 'util' function, it was always tied to the idiosyncrasies of decoding requirements files. * The module lived under `_internal` so I felt comfortable removing it A warning was added when we _do_ fallback to using the locale defined encoding to encourage users to move to an explicit encoding definition via a coding style comment. This fixes two existing bugs. Firstly, when: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale Previously, assuming no BOM or PEP-263 style comment, we would default to using the encoding from the system locale, which would then fail (see issue pypa#12771) Secondly, when decoding a file starting with a UTF-32 little endian Byte Order Marker. Previously this would always fail since `codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null bytes, and because of the ordering of the list of BOMs the UTF-16 case would be run first and match the file prefix so we would incorrectly deduce that the file was UTF-16 little endian encoded. I can't imagine this is a popular encoding for a requirements file. Fixes: pypa#12771
Description
When the unit attempted to install dependencies using the
pip install -r requirements.txt
command, an errorUnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence
occurred. The error log is as follows:error trace:
Expected behavior
Properly install dependencies.
pip version
24.0
Python version
3.10.14
OS
window10
How to Reproduce
When the unit attempted to install dependencies using the
pip install -r requirements.txt
command, an errorUnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence
occurred. The error log is as follows:error trace:
Output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: