I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now how can I extract "123", i.e. the contents of first bold text after string "Your number is"?
import re m = re.search("Your number is <b>(\d+)</b>", "xxx Your number is <b>123</b> fdjsk") if m: print m.groups()
s = "Your number is <b>123</b>" then:
import re m = re.search(r"\d+", s)
will work and give you
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of
re.search() to make sure that
m contained a valid reference, otherwise
m.group() would result in a
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.