How can I remove all characters except numbers from string?
In Python 2.*, by far the fastest approach is the
>>> x='aaa12333bb445bb54b5b52' >>> import string >>> all=string.maketrans('','') >>> nodigs=all.translate(all, string.digits) >>> x.translate(all, nodigs) '1233344554552' >>>
string.maketrans makes a translation table (a string of length 256) which in this case is the same as
''.join(chr(x) for x in range(256)) (just faster to make;-).
.translate applies the translation table (which here is irrelevant since
all essentially means identity) AND deletes characters present in the second argument -- the key part.
.translate works very differently on Unicode strings (and strings in Python 3 -- I do wish questions specified which major-release of Python is of interest!) -- not quite this simple, not quite this fast, though still quite usable.
Back to 2.*, the performance difference is impressive...:
$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)' 1000000 loops, best of 3: 1.04 usec per loop $ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)' 100000 loops, best of 3: 7.9 usec per loop
Speeding things up by 7-8 times is hardly peanuts, so the
translate method is well worth knowing and using. The other popular non-RE approach...:
$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())' 100000 loops, best of 3: 11.5 usec per loop
is 50% slower than RE, so the
.translate approach beats it by over an order of magnitude.
In Python 3, or for Unicode, you need to pass
.translate a mapping (with ordinals, not characters directly, as keys) that returns
None for what you want to delete. Here's a convenient way to express this for deletion of "everything but" a few characters:
import string class Del: def __init__(self, keep=string.digits): self.comp = dict((ord(c),c) for c in keep) def __getitem__(self, k): return self.comp.get(k) DD = Del() x='aaa12333bb445bb54b5b52' x.translate(DD)
'1233344554552'. However, putting this in xx.py we have...:
$ python3.1 -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)' 100000 loops, best of 3: 8.43 usec per loop $ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)' 10000 loops, best of 3: 24.3 usec per loop
...which shows the performance advantage disappears, for this kind of "deletion" tasks, and becomes a performance decrease.
re.sub, like so:
>>> import re >>> re.sub("\D", "", "aas30dsa20") '3020'
\D matches any non-digit character so, the code above, is essentially replacing every non-digit character for the empty string.
Or you can use
filter, like so (in Python 2k):
>>> filter(lambda x: x.isdigit(), "aas30dsa20") '3020'
Since in Python 3k,
filter returns an iterator instead of a
list, you can use the following instead:
>>> ''.join(filter(lambda x: x.isdigit(), "aas30dsa20")) '3020'