Python: efficient method to replace accents (é to e), remove [^a-zA-Z\d\s], and lower()


Using Python 3.3. I want to do the following:

  • replace special alphabetical characters such as e acute (é) and o circumflex (ô) with the base character (ô to o, for example)
  • remove all characters except alphanumeric and spaces in between alphanumeric characters
  • convert to lowercase

This is what I have so far:

mystring_modified = mystring.replace('\u00E9', 'e').replace('\u00F4', 'o').lower()
alphnumspace = re.compile(r"[^a-zA-Z\d\s]")
mystring_modified = alphnumspace.sub('', mystring_modified)

How can I improve this? Efficiency is a big concern, especially since I am currently performing the operations inside a loop:

# Pseudocode
for mystring in myfile:
    mystring_modified = # operations described above

The files in question are about 200,000 characters each.

3/7/2013 2:08:35 AM

Accepted Answer

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.


accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'
7/21/2017 5:23:40 PM

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow