Stripping everything but alphanumeric chars from a string in Python


Question

What is the best way to strip all non alphanumeric characters from a string, using Python?

The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don't seem very 'pythonic' to me.

For the record, I don't just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.

1
285
5/23/2017 12:26:07 PM

Accepted Answer

I just timed some functions out of curiosity. In these tests I'm removing non-alphanumeric characters from the string string.printable (part of the built-in string module). The use of compiled '[\W_]+' and pattern.sub('', str) was found to be fastest.

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop
294
7/4/2019 2:26:50 PM

Regular expressions to the rescue:

import re
re.sub(r'\W+', '', your_string)

By Python definition '\W == [^a-zA-Z0-9_], which excludes all numbers, letters and _


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon