Using Python 3.3. I want to do the following:
This is what I have so far:
mystring_modified = mystring.replace('\u00E9', 'e').replace('\u00F4', 'o').lower() alphnumspace = re.compile(r"[^a-zA-Z\d\s]") mystring_modified = alphnumspace.sub('', mystring_modified)
How can I improve this? Efficiency is a big concern, especially since I am currently performing the operations inside a loop:
# Pseudocode for mystring in myfile: mystring_modified = # operations described above mylist.append(mystring_modified)
The files in question are about 200,000 characters each.
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
accented_string = u'Málaga' # accented_string is of type 'unicode' import unidecode unaccented_string = unidecode.unidecode(accented_string) # unaccented_string contains 'Malaga'and is of type 'str'
How about this:
import unicodedata def strip_accents(s): return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E") u'A A \u0394 \u03a5' >>>
The character category "Mn" stands for
Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".