UnicodeEncodeError: 'charmap' codec can't encode - character maps to , print function


Question

I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.

The code is like this:

conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));

the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 10248: character maps to <undefined>

I could fix it using this quite ugly code:

print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))

Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.

There are several things I do not like from my solution.

  1. The code is ugly with all that decoding, encoding, and decoding.
  2. It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn't happen before)
  3. It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.

The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).

So, the question:

Is there a nicer solution that makes my code agnostic from the output interface encoding?

1
146
1/31/2013 4:18:30 PM

Accepted Answer

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!


Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):

PrintFails - Python Wiki

Here's a code excerpt from that page:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б

There's some more information on that page, well worth a read.

35
1/4/2016 5:18:53 PM

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.


I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.

The error means that Unicode characters that you are trying to print can't be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:

>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to 

I assume this is because the Windows console does not accept Unicode-only characters. What's the best way around this?

Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in @Daira Hopwood's answer. It can be called transparently i.e., you don't need to and should not modify your scripts if you use win-unicode-console package:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

See What's the deal with Python 3.4, Unicode, different languages and Windows?

Is there any way I can make Python automatically print a ? instead of failing in this situation?

If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:

T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]

In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon