Decode HTML entities in Python string?


I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text

How can I decode the HTML entities in text to get "£682m" instead of "&pound;682m".

11/28/2015 8:18:00 PM

Accepted Answer

Python 3.4+

Use html.unescape():

import html

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.

Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
7/2/2019 5:13:19 PM

Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.

Beautiful Soup 3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)

Beautiful Soup 4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow