I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII.
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs): # csv.py doesn't do Unicode; encode temporarily as UTF-8: csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs) for row in csv_reader: # decode UTF-8 back to Unicode, cell by cell: yield [unicode(cell, 'utf-8') for cell in row] def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') filename = 'output.csv' reader = unicode_csv_reader(open(filename)) try: products =  for field1, field2, field3 in reader: ...
Below is an extract of the CSV file I am trying to read:
0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu 0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris 0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert ...
Even though I try to encode/decode to UTF-8, I am still getting the following exception:
Traceback (most recent call last): File ".\Test.py", line 53, in <module> for field1, field2, field3 in reader: File ".\Test.py", line 40, in unicode_csv_reader for row in csv_reader: File ".\Test.py", line 46, in utf_8_encoder yield line.encode('utf-8', 'ignore') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)
How do I fix this?
.encode method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the
codecs module in the standard library and
codecs.open in particular for better general solutions for reading UTF-8 encoded text files. However, for the
csv module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:
import csv def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] filename = 'da.csv' reader = unicode_csv_reader(open(filename)) for field1, field2, field3 in reader: print field1, field2, field3
PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the
csv module level), of the form
line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the
yield line in my code above, instead of
csv is actually going to be just fine with ISO-8859-* encoded bytestrings.
There is a unicode-csv library which should solve your problems, with added benefit of not naving to write any new csv-related code.
Here is a example from their readme:
>>> import unicodecsv >>> from cStringIO import StringIO >>> f = StringIO() >>> w = unicodecsv.writer(f, encoding='utf-8') >>> w.writerow((u'é', u'ñ')) >>> f.seek(0) >>> r = unicodecsv.reader(f, encoding='utf-8') >>> row = r.next() >>> print row, row é ñ
In python 3 this is supported out of the box by the build-in
csv module. See this example:
import csv with open('some.csv', newline='', encoding='utf-8') as f: reader = csv.reader(f) for row in reader: print(row)