My code just scrapes a web page, then converts it to Unicode.
html = urllib.urlopen(link).read() html.encode("utf8","ignore") self.response.out.write(html)
But I get a
Traceback (most recent call last): File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__ handler.get(*groups) File "/Users/greg/clounce/main.py", line 55, in get html.encode("utf8","ignore") UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?
As of February 2018, using compressions like
gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte
In order to decode a gzpipped response you need to add the following modules (in Python 3):
import gzip import io
Then you can parse the content out like this:
response = urlopen("https://example.com/gzipped-ressource") buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2 gzipped_file = gzip.GzipFile(fileobj=buffer) decoded = gzipped_file.read() content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource
This code reads the response, and places the bytes in a buffer. The
gzip module then reads the buffer using the
GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.
Can we get the actual value used for
In addition, we usually encounter this problem here when we are trying to
.encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read() unicode_str = html.decode(<source encoding>) encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xa0' encoded_str = html.encode("utf8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
html = '\xa0' decoded_str = html.decode("windows-1252") encoded_str = decoded_str.encode("utf8")
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from
.urlopen().read() to what applies to the content you retrieved.
Another problem I see there is that the
.encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have
self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).
As Ignacio suggested, check the source webpage for the actual encoding of the returned string from
read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for
Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).
>>> u'aあä'.encode('ascii', 'ignore') 'a'
Decode the string you get back, using either the charset in the the appropriate
meta tag in the response or in the
Content-Type header, then encode.
encode() accepts other values as "ignore". For example: 'replace', 'xmlcharrefreplace', 'backslashreplace'. See https://docs.python.org/3/library/stdtypes.html#str.encode