Python Unicode Encode Error


Question

I'm reading and parsing an Amazon XML file and while the XML file shows a ' , when I try to print it I get the following error:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

From what I've read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?

1
100
4/27/2012 4:17:42 AM

Accepted Answer

Likely, your problem is that you parsed it okay, and now you're trying to print the contents of the XML and you can't because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the 'ignore' part will tell it to just skip those characters. From the python docs:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what's going on. After the read, you'll stop feeling like you're just guessing what commands to use (or at least that happened to me).

183
4/4/2011 1:56:36 PM

A better solution:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

If you would like to read more about why:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon