How do I perform HTML decoding/encoding using Python/Django?


Question

I have a string that is HTML encoded:

'''<img class="size-medium wp-image-113"\
 style="margin-left: 15px;" title="su1"\
 src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"\
 alt="" width="300" height="194" />'''

I want to change that to:

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" /> 

I want this to register as HTML so that it is rendered as an image by the browser instead of being displayed as text.

The string is stored like that because I am using a web-scraping tool called BeautifulSoup, it "scans" a web-page and gets certain content from it, then returns the string in that format.

I've found how to do this in C# but not in Python. Can someone help me out?

1
118
4/21/2019 12:04:46 AM

Accepted Answer

Given the Django use case, there are two answers to this. Here is its django.utils.html.escape function, for reference:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape. More generally, it is a good idea to stick with the standard library:

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.

With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}
107
1/22/2019 3:34:48 PM

With the standard library:

  • HTML Escape

    try:
        from html import escape  # python 3.x
    except ImportError:
        from cgi import escape  # python 2.x
    
    print(escape("<"))
    
  • HTML Unescape

    try:
        from html import unescape  # python 3.4+
    except ImportError:
        try:
            from html.parser import HTMLParser  # python 3.x (<3.4)
        except ImportError:
            from HTMLParser import HTMLParser  # python 2.x
        unescape = HTMLParser().unescape
    
    print(unescape("&gt;"))
    

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon