How can I retrieve the links of a webpage and copy the url address of the links using Python?
Here's a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://www.nytimes.com') for link in BeautifulSoup(response, parse_only=SoupStrainer('a')): if link.has_attr('href'): print(link['href'])
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.
For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:
from bs4 import BeautifulSoup import urllib2 resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks") soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset')) for link in soup.find_all('a', href=True): print link['href']
or the Python 3 version:
from bs4 import BeautifulSoup import urllib.request resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks") soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset')) for link in soup.find_all('a', href=True): print(link['href'])
and a version using the
requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests resp = requests.get("http://www.gpsbasecamp.com/national-parks") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, from_encoding=encoding) for link in soup.find_all('a', href=True): print(link['href'])
soup.find_all('a', href=True) call finds all
<a> elements that have an
href attribute; elements without the attribute are skipped.
BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.
Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a
<meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method
EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.
response.encoding attribute defaults to Latin-1 if the response has a
text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no
charset is set in the Content-Type header.