Unicode and bytes
- str.encode(encoding, errors='strict')
- bytes.decode(encoding, errors='strict')
- open(filename, mode, encoding=None)
|encoding||The encoding to use, e.g. |
|errors||The errors mode, e.g. |
'replace' to replace bad characters with question marks,
'ignore' to ignore bad characters, etc...
In Python 3
str is the type for unicode-enabled strings, while
bytes is the type for sequences of raw bytes.
In Python 2 a casual string was a sequence of raw bytes by default and the unicode string was every string with "u" prefix.
Unicode to bytes
Unicode strings can be converted to bytes with
in py2 the default console encoding is
sys.getdefaultencoding() == 'ascii' and not
utf-8 as in py3, therefore printing it as in the previous example is not directly possible.
If the encoding can't handle the string, a `UnicodeEncodeError` is raised:
Bytes to unicode
Bytes can be converted to unicode strings with
A sequence of bytes can only be converted into a unicode string via the appropriate encoding!
If the encoding can't handle the string, a
UnicodeDecodeError is raised:
Encoding/decoding error handling
.decode both have error modes.
The default is
'strict', which raises exceptions on error. Other modes are more forgiving.
It is clear from the above that it is vital to keep your encodings straight when dealing with unicode and bytes.
Files opened in a non-binary mode (e.g.
'w') deal with strings. The deafult encoding is
Files opened in a binary mode (e.g.
'wb') deal with bytes. No encoding argument can be specified as there is no encoding.
This modified text is an extract of the original Stack Overflow Documentation created by following contributors
and released under CC BY-SA 3.0