In my quests of optimization, I discovered that that built-in split() method is about 40% faster that the re.split() equivalent.
A dummy benchmark (easily copy-pasteable):
import re, time, random def random_string(_len): letters = "ABC" return "".join([letters[random.randint(0,len(letters)-1)] for i in range(_len) ]) r = random_string(2000000) pattern = re.compile(r"A") start = time.time() pattern.split(r) print "with re.split : ", time.time() - start start = time.time() r.split("A") print "with built-in split : ", time.time() - start
Why this difference?
re.split is expected to be slower, as the usage of regular expressions incurs some overhead.
Of course if you are splitting on a constant string, there is no point in using
When in doubt, check the source code. You can see that Python
s.split() is optimized for whitespace and inlined. But
s.split() is for fixed delimiters only.
For the speed tradeoff, a re.split regular expression based split is far more flexible.
>>> re.split(':+',"One:two::t h r e e:::fourth field") ['One', 'two', 't h r e e', 'fourth field'] >>> "One:two::t h r e e:::fourth field".split(':') ['One', 'two', '', 't h r e e', '', '', 'fourth field'] # would require an addition step to find the empty fields... >>> re.split('[:\d]+',"One:two:2:t h r e e:3::fourth field") ['One', 'two', 't h r e e', 'fourth field'] # try that without a regex split in an understandable way...
re.split() is only 29% slower (or that
s.split() is only 40% faster) is what should be amazing.