»recommended« and »mandatory« are with regard to reading a section.
The Python modules in the repo complement this guide. Feel free to copy them into your projects and send improvements.
- Write
unicode
or unicode if you mean the Python type. - Write Unicode if you mean Unicode in general.
- ❃
str
vs.unicode
vs.bytes
and Python 2 vs. Python 3 - When dealing with strings and Unicode in Python, there are two types
you have to know.
str
is a plain list of bytes that just happens to be rendered as a string.unicode
is a list of Unicode characters. Python 2 → Python 3:str
→bytes
,unicode
→str
. - ❃ default string type
- The default string type in both Pythons is
str
, but note thatstr
is different things in Python 2 and Python 3. In Python 3 all string variables inside a program are lists of Unicode characters and we want to have the same in Python 2, because we are forward-looking. - ❃ the ideal: every string is
unicode
- Therefore, we assume all string variables inside our programs to be
of type
unicode
. - ❃ (nearly) everything outside is
str
- When communicating with the outside world and some libraries, we have
to convert to or from
str
. - ❃ Unicode and UTF-8
- Unicode is different from UTF-8. Read the first paragraph in the blue box at the top of https://pythonhosted.org/kitchen/unicode-frustrations.html.
- ❃ encoding and decoding
- To turn a UTF-8-encoded
str
(list of bytes) intounicode
, use.decode('utf-8')
. To turn aunicode
into a UTF-8-encodedstr
, use.encode('utf-8')
.
- ❃ unicode_literals
In every Python file, import
unicode_literals
:from __future__ import unicode_literals
If you don't do this, all string literals in your source code will be
str
, which is against the »every string isunicode
« ideal of the Need to Know.- ❃
str
literals - Use
b"bla"
to write astr
"bla". - ❃ string conversion
- Use
unicode()
instead ofstr()
when you want to convert numbers etc. to strings. - ❃ naming convention
If there is a string variable that needs to be of type
str
inside your program, prefix it withb_
if you don't know the encoding, or withutf8_
if you know it is UTF-8:b_company_name = read_company_name_str() utf8_company_name = read_company_name_utf8()
- ❃ reading and writing files
When you want to read from or write to a file, use
codecs.open()
instead of the built-inopen()
:>>> from __future__ import unicode_literals >>> import codecs >>> with codecs.open("bla.txt", 'w', 'utf-8') as f: ... f.write("üüü") ... >>> with codecs.open("bla.txt", 'r', 'utf-8') as f: ... f.read(3) ... u'\xfc\xfc\xfc' >>> 'ü' * 3 u'\xfc\xfc\xfc'
- ❃
print
Everything that is written to the outside world should be
str
. This normally includes parameters toprint
. In order to avoid having to convert yourunicode
s all the time, write at the top of every file, but after all imports:import sys import codecs # and other imports if not isinstance(sys.stdout, codecs.StreamWriter): sys.stdout = codecs.getwriter('utf-8')(sys.stdout) # main code follows
(Don't forget to add imports for
sys
andcodecs
if they aren't there already.) This way you can doprint(unicode)
. Note however, that now it's dangerous to doprint(str)
. Never pass astr
toprint
unless you're sure it contains only ASCII. In such cases, write a clarifying comment.- ❃ exceptions and warnings
- When raising exceptions or warnings, only pass
str
. Think twice whether the thing you're passing really isstr
! - ❃
print
tosys.stderr
- We don't put an UTF-8 writer in front of
sys.stderr
, since that would cause even more confusion. So make sure that everything you send there isstr
. - ❃ external libraries
- Check whether the library procedures you're calling accept and return
str
orunicode
. If they accept and returnstr
, take care to make the right conversions. Below are notes on which libraries do what. - ❃ environment variables
- Use
unicode_environ.getenv
andunicode_environ.environ
instead ofos.getenv
andos.environ
. If you need to do anything else with the environment, extendunicode_environ
instead of resorting to environment utilities fromos
. - ❃ command line arguments
- Command line arguments come as
str
and you need to convert them. Unfortunately, passingtype=unicode
toArgumentParser.add_argument
is not enough. Useunicode_argparse.ArgumentParser
instead ofargparse.ArgumentParser
. - ❃ testing
- In your tests, try to break the system by including non-ASCII characters in strings. If you can't succeed, chances are good that you have done the Unicode thing correctly.
- ❃ CONSTANT VIGILANCE!
- When you read data from or write data to somewhere outside your program, make sure it gets converted to the right types.
You may make project-specific exceptions to these rules if they get annoying. Be sure to document them.
Example for a project that uses Pygit2 often:
- ❃ Git SHA1s
- Git SHA1s as returned by
Oid.hex
are of typestr
. Since they never contain non-ASCII characters and it would be annoying to convert them all the time, we leave them asstr
. Since we know that they arestr
and it is annoying to write prefixes, it is okay to leave off theb_
. (Not so sure if this is good, though.)
- ❃ UTF-8-encoded source
In the first or second line of every Python file, put the following:
# -*- coding: utf-8 -*-
Doing this will allow you to use non-ASCII characters in your Python source.
- ❃ unicodification (stringification)
Implement
__unicode__
and__str__
like this (credits):def __unicode__(self): return … # create unicode representation of your object def __str__(self): return unicode(self).encode('utf-8')
- ❃ writing Unicode utilities
- If you want to write utilities like
unicode_environ
andunicode_argparse
, you might find the functions fromunicode_tools
helpful.
When I write something like »works with unicode
arguments«, I mean that it
works with arguments of type unicode
which can contain arbitrary
characters, i. e. ASCII as well as non-ASCII.
Feel free to extend, or correct if things have changed.
codecs.open
works with unicode
as well as str
filenames.
datetime.datetime.strftime(unicode)
: str
httplib2.Http.request
works with unicode
arguments. However, the
results will all contain or be of type str
. Example:
>>> r, c = httplib2.Http(".cache").request("http://de.wikipedia.org/wiki/Erdkröte") >>> r['content-type'] 'text/html; charset=UTF-8' >>> type(r['content-type']) <type 'str'> >>> type(c) <type 'str'>
Things in os are generally safe to use with unicode
. However, note this:
path.join(unicode, unicode)
:unicode
path.relpath(unicode, unicode)
:str
orunicode
(!!!) If the result contains non-ASCII characters, it will beunicode
, otherwisestr
. Isn't it sweet?
PyCurl works solely on str
s.
- Config values can be
unicode
. Commit.hex
:str
Commit.message
:unicode
- Paths are
str
. However, this is extrapolated from the fact thatPatch.delta.{old,new}_file.path
isstr
. The API might be inconsistent, so check the thing you're using and add the data here. Reference.name
,Reference.shorthand
:str
- However,
Repository.lookup_reference(unicode)
works. - Refspecs should be
str
.Remote.add_fetch
doesn't complain when you passunicode
, butRemote.fetch_refspecs
throws an exception if you added a refspec with non-ASCII characters. Funny enough, though,Remote.fetch_refspecs
is a list ofunicode
. Repository(path)
doesn't work withunicode
s containing non-ASCII characters. In order to be sure, I'd say that all paths passed to Pygit2 methods or the like should be converted to UTF-8str
s first.Signature.name
,Signature.email
:unicode
. If you needstr
, you can useSignature.raw_name
andSignature.raw_email
.
Trivia:
>>> no_r = pygit2.Repository("/tmp/tüüls") # throws error >>> r = pygit2.clone_repository("/tmp/tüüls", "./tüüls") # works >>> r.remotes[0].url # throws error
re is completely okay with unicode
everywhere.
textile.textile
returns unicode
if you give it unicode
.
urllib2 didn't like unicode
for URLs and also returned str
only. Since
urllib is older, I guess it's the same there.
- https://docs.python.org/2.7/howto/unicode.html
- https://pythonhosted.org/kitchen/unicode-frustrations.html
- http://python-future.org/unicode_literals.html
- the documentation of the mentioned modules or libraries
- Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you are in an industrious mood, you can help improving this document and the modules.
- I marked up many things as
literal text
. It would be nice if you could change this to interpreted text, such as :meth:`pygit2.Diff.merge`. But you'd also have to find the right way to convert this to HTML, since rst2html doesn't likemeth
(as well as the other Python-specific roles, I guess). - As stated above, the notes on which libraries do what are always happy to be updated and extended.
Copyright (c) 2015 Richard Möhn
This work is licensed under the Creative Commons Attribution 4.0 International License.