This will surely be an easy one but it is really bugging me.
I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.
All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.
Every time I go to print out a variable that holds 'String' I get
[u'String'] printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?
[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.
I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:
However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.
Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:
You probably have a list containing one unicode string. The
repr of this is
You can convert this to a list of byte strings using any variation of the following:
# Functional style. print map(lambda x: x.encode('ascii'), my_list) # List comprehension. print [x.encode('ascii') for x in my_list] # Interesting if my_list may be a tuple or a string. print type(my_list)(x.encode('ascii') for x in my_list) # What do I care about the brackets anyway? print ', '.join(repr(x.encode('ascii')) for x in my_list) # That's actually not a good way of doing it. print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)