Strip HTML from strings in Python


Question

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>', it will only print 'some text', '<b>hello</b>' prints 'hello', etc. How would one go about doing this?

1
249
7/20/2013 9:38:47 PM

Accepted Answer

I always used this function to strip HTML tags, as it requires only the Python stdlib:

On Python 2

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

For Python 3

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Note: this works only for 3.1. For 3.2 or above, you need to call the parent class's init function. See Using HTMLParser in Python 3.2

388
5/23/2017 11:55:02 AM

I haven't thought much about the cases it will miss, but you can do a simple regex:

re.sub('<[^<]+?>', '', text)

For those that don't understand regex, this searches for a string <...>, where the inner content is made of one or more (+) characters that isn't a <. The ? means that it will match the smallest string it can find. For example given <p>Hello</p>, it will match <'p> and </p> separately with the ?. Without it, it will match the entire string <..Hello..>.

If non-tag < appears in html (eg. 2 < 3), it should be written as an escape sequence &... anyway so the ^< may be unnecessary.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon