Is it worth using Python's re.compile?


Is there any benefit in using compile for regular expressions in Python?

h = re.compile('hello')
h.match('hello world')


re.match('hello', 'hello world')
4/3/2009 9:44:08 PM

Accepted Answer

I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I've found the difference to be negligible.

EDIT: After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).

From module (comments are mine):

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def _compile(*key):

    # Does cache check at top of function
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None: return p

    # ...
    # Does actual compilation on cache miss
    # ...

    # Caches compiled regex
    if len(_cache) >= _MAXCACHE:
    _cache[cachekey] = p
    return p

I still often pre-compile regular expressions, but only to bind them to a nice, reusable name, not for any expected performance gain.

4/24/2015 2:51:02 PM

For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

It's certainly possible to store strings and pass them to re.match; however, that's less readable:

num = "..."
# then, much later:
m = re.match(num, input)

Versus compiling:

num = re.compile("...")
# then, much later:
m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow