Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence #12771

Open
1 task done
danerlt opened this issue Jun 18, 2024 · 12 comments · May be fixed by #12795
Open
1 task done
Labels
state: awaiting PR Feature discussed, PR is needed type: bug A confirmed bug or unintended behavior

Comments

@danerlt
Copy link

danerlt commented Jun 18, 2024

Description

When the unit attempted to install dependencies using the pip install -r requirements.txt command, an error UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence occurred. The error log is as follows:

error trace:

ERROR: Exception:
Traceback (most recent call last):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\commands\install.py", line 342, in run
    reqs = self.get_requirements(args, options, finder, session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 433, in get_requirements
    for parsed_req in parse_requirements(
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 156, in parse_requirements
    for parsed_line in parser.parse(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 337, in parse
    yield from self._parse_and_recurse(filename, constraint)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 342, in _parse_and_recurse
    for line in self._parse_file(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 373, in _parse_file
    _, content = get_file_content(filename, self._session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 551, in get_file_content
    content = auto_decode(f.read())
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\utils\encoding.py", line 34, in auto_decode
    return data.decode(
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence

Expected behavior

Properly install dependencies.

pip version

24.0

Python version

3.10.14

OS

window10

How to Reproduce

When the unit attempted to install dependencies using the pip install -r requirements.txt command, an error UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence occurred. The error log is as follows:

error trace:

ERROR: Exception:
Traceback (most recent call last):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\commands\install.py", line 342, in run
    reqs = self.get_requirements(args, options, finder, session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\cli\req_command.py", line 433, in get_requirements
    for parsed_req in parse_requirements(
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 156, in parse_requirements
    for parsed_line in parser.parse(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 337, in parse
    yield from self._parse_and_recurse(filename, constraint)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 342, in _parse_and_recurse
    for line in self._parse_file(filename, constraint):
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 373, in _parse_file
    _, content = get_file_content(filename, self._session)
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\req\req_file.py", line 551, in get_file_content
    content = auto_decode(f.read())
  File "D:\ProgramData\Anaconda3\envs\ka\lib\site-packages\pip\_internal\utils\encoding.py", line 34, in auto_decode
    return data.decode(
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence

Output

No response

Code of Conduct

@danerlt danerlt added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Jun 18, 2024
@matthewhughes934
Copy link
Contributor

Are you able to share the contents of the requirements.txt file you were using?

@danerlt
Copy link
Author

danerlt commented Jun 18, 2024

@matthewhughes934
The contents of my requirements.txt are as follows:

# server
supervisor==4.2.5
gunicorn==21.2.0
gevent==23.9.1

# web
Werkzeug==2.3.7
celery==5.2.7
click==8.1.7
dataclasses_json==0.6.4
Flask==2.3.3
Flask_Cors==3.0.10
Flask_Login==0.6.2
Flask_Migrate==4.0.5
Flask_RESTful==0.3.9
flask_sqlalchemy==3.0.5
SQLAlchemy==2.0.0
minio==7.2.4
psycopg2-binary==2.9.9
python-dotenv==1.0.1
redis==5.0.2
requests==2.31.0

# rag
langchain==0.1.16
llama-index==0.10.30
llama-index-core==0.10.30  # 这个必须手指定,不然构建的时候会去获取最新的版本,可能会有bug。
llama-index-retrievers-bm25==0.1.3
llama-index-storage-index-store-redis==0.1.2
llama-index-storage-kvstore-redis==0.1.3
llama-index-storage-docstore-mongodb==0.1.3
llama-index-vector-stores-milvus==0.1.10
llama-index-vector-stores-qdrant==0.2.5
llama-parse==0.4.1
rank-bm25==0.2.2
ragas==0.1.1
qdrant-client==1.9.0
pymongo==4.6.3
motor==3.4.0
asyncpg==0.29.0
spacy==3.7.4
jieba==0.42.1
./zh_core_web_sm-3.7.0-py3-none-any.whl
scikit-learn==1.4.2


# data loader 相关依赖
pypdf==4.2.0
pdfminer-six==20231228
PyMuPDF==1.24.2
docx2txt==0.8
python-docx==1.1.0
openpyxl==3.1.2

# 评估相关
dashscope==1.19.2
zhipuai==2.1.0

@danerlt
Copy link
Author

danerlt commented Jun 18, 2024

@matthewhughes934
I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

@uranusjr
Copy link
Member

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

@uranusjr uranusjr added state: awaiting PR Feature discussed, PR is needed and removed S: needs triage Issues/PRs that need to be triaged labels Jun 18, 2024
@matthewhughes934
Copy link
Contributor

matthewhughes934 commented Jun 18, 2024

I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

I guess the underlying issue was: the file looks to be UTF-8 encoded but you're working in an environment that uses a simplified Chinese locale, and so uses GBK for decoding. I guess an alternative solution would be to run Python in UTF-8 mode (https://docs.python.org/3/using/windows.html#utf-8-mode)

@matthewhughes934
Copy link
Contributor

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

👍 happy to get a PR up. I'm wondering two things:

  • If I change auto_decode: are there places where we want decoding to fail (per errors="strict") or would it be ok to always replace? Or is there code elsewhere that should be changed?
  • 🤔 Is there any potential for issues with multi-byte/non-ascii-extended encodings: I have no idea how common these might be, but I guess a consequence could be instead of getting a 'failed to decode' error you could get an error about pip failing to install a package named "����"

@pfmoore
Copy link
Member

pfmoore commented Jun 18, 2024

We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

Unfortunately, requirements aren't the only things in a requirement file. --requirement <path to file to include> could include arbitrary Unicode characters, and for that matter a simple local pathname is valid (and could be Unicode).

However, the documentation states that requirement files should be UTF-8 by default, so this seems like a simple bug in auto_decode - https://github.com/pypa/pip/blob/main/src/pip/_internal/utils/encoding.py#L35 should be using UTF-8. (And arguably the BOM detection in there is in violation of the spec, but IMO it's not worth changing).

Of course, even though this is technically a bug fix, it is still a breaking change, potentially, so we need to consider how we handle that. (We could fall back to the system encoding if UTF8 fails, with a deprecation warning - this won't avoid mojibake, but it will catch outright encoding failures).

@uranusjr
Copy link
Member

Ah, right, I forgot about paths. Falling back with a deprecation warning sounds like the way to go.

matthewhughes934 added a commit to matthewhughes934/pip that referenced this issue Jun 25, 2024
For the case where:

* a requirements file is encoded as UTF-8, and
* some bytes in the file are incompatible with the system locale

In this case, fallback to decoding as UTF-8 as a last resort (rather
than crashing on the `UnicodeDecodeError`). This behaviour was added
when parsing the request file, rather than in `auto_decode` as it didn't
seem to belong in a generic util (though that util looks to only be ever
called when parsing requirements files anyway).

Perhaps we should just go straight to UTF-8 without querying the system
locale (unless there is a PEP-263 style comment), per the docs[1]:

> Requirements files are utf-8 encoding by default

But to avoid a breaking change just warn if decoding with this locale
fails then fallback to UTF-8

[1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding

Fixes: pypa#12771
@matthewhughes934 matthewhughes934 linked a pull request Jun 25, 2024 that will close this issue
matthewhughes934 added a commit to matthewhughes934/pip that referenced this issue Aug 15, 2024
For the case where:

* a requirements file is encoded as UTF-8, and
* some bytes in the file are incompatible with the system locale

In this case, fallback to decoding as UTF-8 as a last resort (rather
than crashing on the `UnicodeDecodeError`). This behaviour was added
when parsing the request file, rather than in `auto_decode` as it didn't
seem to belong in a generic util (though that util looks to only be ever
called when parsing requirements files anyway).

Perhaps we should just go straight to UTF-8 without querying the system
locale (unless there is a PEP-263 style comment), per the docs[1]:

> Requirements files are utf-8 encoding by default

But to avoid a breaking change just warn if decoding with this locale
fails then fallback to UTF-8

[1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding

Fixes: pypa#12771
@Pied-Piper1
Copy link

If this issue had been fixed, can you tell me how to do that? Unfortunately , I also encounter this problem as you were

@pfmoore
Copy link
Member

pfmoore commented Oct 21, 2024

No-one has submitted a PR to fix this yet. For now, the simplest solution is likely to be to remove any non-ASCII content from your requirements file. Or ensure your requirements file is encoded in the locale preferred encoding, as returned by locale.getpreferredencoding(False) or sys.getdefaultencoding().

@matthewhughes934
Copy link
Contributor

If this issue had been fixed, can you tell me how to do that? Unfortunately , I also encounter this problem as you were

are you able to share the requirements file you were using when you saw the error?

matthewhughes934 added a commit to matthewhughes934/pip that referenced this issue Oct 22, 2024
This changes the decoding process to be more in line with what was
previously documented. The new process is outlined in the updated docs.

The `auto_decode` function was removed and all decoding logic moved to
the `pip._internal.req.req_file` module because:

* This function was only ever used to decode requirements file
* It was never really a generic 'util' function, it was always tied to
  the idiosyncrasies of decoding requirements files.
* The module lived under `_internal` so I felt comfortable removing it

A warning was added when we _do_ fallback to using the locale defined
encoding to encourage users to move to an explicit encoding definition
via a coding style comment.

This fixes two existing bugs. Firstly, when:

* a requirements file is encoded as UTF-8, and
* some bytes in the file are incompatible with the system locale

Previously, assuming no BOM or PEP-263 style comment, we would default
to using the encoding from the system locale, which would then fail (see
issue pypa#12771)

Secondly, when decoding a file starting with a UTF-32 little endian Byte
Order Marker. Previously this would always fail since
`codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null
bytes, and because of the ordering of the list of BOMs we the UTF-16
case would be run first and match the file prefix so we would
incorrectly deduce that the file was UTF-16 little endian encoded. I
can't imagine this is a popular encoding for a requirements file.

Fixes: pypa#12771
@matthewhughes934
Copy link
Contributor

No-one has submitted a PR to fix this yet. For now, the simplest solution is likely to be to remove any non-ASCII content from your requirements file. Or ensure your requirements file is encoded in the locale preferred encoding, as returned by locale.getpreferredencoding(False) or sys.getdefaultencoding().

I have #12795 to address this. I just updated it to change the approach from adding a work-around for this case to re-ordering the encodings attempted when decoding a requirements file

matthewhughes934 added a commit to matthewhughes934/pip that referenced this issue Dec 26, 2024
This changes the decoding process to be more in line with what was
previously documented. The new process is outlined in the updated docs.

The `auto_decode` function was removed and all decoding logic moved to
the `pip._internal.req.req_file` module because:

* This function was only ever used to decode requirements file
* It was never really a generic 'util' function, it was always tied to
  the idiosyncrasies of decoding requirements files.
* The module lived under `_internal` so I felt comfortable removing it

A warning was added when we _do_ fallback to using the locale defined
encoding to encourage users to move to an explicit encoding definition
via a coding style comment.

This fixes two existing bugs. Firstly, when:

* a requirements file is encoded as UTF-8, and
* some bytes in the file are incompatible with the system locale

Previously, assuming no BOM or PEP-263 style comment, we would default
to using the encoding from the system locale, which would then fail (see
issue pypa#12771)

Secondly, when decoding a file starting with a UTF-32 little endian Byte
Order Marker. Previously this would always fail since
`codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null
bytes, and because of the ordering of the list of BOMs the UTF-16 case
would be run first and match the file prefix so we would incorrectly
deduce that the file was UTF-16 little endian encoded. I can't imagine
this is a popular encoding for a requirements file.

Fixes: pypa#12771
matthewhughes934 added a commit to matthewhughes934/pip that referenced this issue Dec 28, 2024
This changes the decoding process to be more iniline with what was
previously documented. The new process is outlined in the updated docs.

The `auto_decode` function was removed and all decoding logic moved to
the `pip._internal.req.req_file` module because:

* This function was only ever used to decode requirements file
* It was never really a generic 'util' function, it was always tied to
  the idiosyncrasies of decoding requirements files.
* The module lived under `_internal` so I felt comfortable removing it

A warning was added when we _do_ fallback to using the locale defined
encoding to encourage users to move to an explicit encoding definition
via a coding style comment.

This fixes two existing bugs. Firstly, when:

* a requirements file is encoded as UTF-8, and
* some bytes in the file are incompatible with the system locale

Previously, assuming no BOM or PEP-263 style comment, we would default
to using the encoding from the system locale, which would then fail (see
issue pypa#12771)

Secondly, when decoding a file starting with a UTF-32 little endian Byte
Order Marker. Previously this would always fail since
`codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null
bytes, and because of the ordering of the list of BOMs the UTF-16 case
would be run first and match the file prefix so we would incorrectly
deduce that the file was UTF-16 little endian encoded. I can't imagine
this is a popular encoding for a requirements file.

Fixes: pypa#12771
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: awaiting PR Feature discussed, PR is needed type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants