Parse HTML via XPath


In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?

11/13/2008 8:18:01 AM

Accepted Answer

In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:

>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{}p")
<Element {}p at 264eb8>
11/14/2008 3:37:03 AM

I'm surprised there isn't a single mention of lxml. It's blazingly fast and will work in any environment that allows CPython libraries.

Here's how you can parse HTML via XPATH using lxml.

>>> from lxml import etree
>>> doc = '<foo><bar></bar></foo>'
>>> tree = etree.HTML(doc)

>>> r = tree.xpath('/foo/bar')
>>> len(r)
>>> r[0].tag

>>> r = tree.xpath('bar')
>>> r[0].tag

