Web scraping is an automated, programmatic process through which data can be constantly 'scraped' off webpages. Also known as screen scraping or web harvesting, web scraping can provide instant data from any publicly accessible webpage. On some websites, web scraping may be illegal.
Useful Python packages for web scraping (alphabetical order)
Caching for requests; caching data is very useful. In development, it means you can avoid hitting a site unnecessarily. While running a real collection, it means that if your scraper crashes for some reason (maybe you didn't handle some unusual content on the site...? maybe the site went down...?) you can repeat the collection very quickly from where you left off.
Processes HTML and XML. Can be used to query and select content from HTML documents via CSS selectors and XPath.
Basic example of using requests and lxml to scrape some data
# For Python 2 compatibility.
from __future__ import print_function
r = requests.get("https://httpbin.org")
html_source = r.text
root_element = lxml.html.fromstring(html_source)
# Note root_element.xpath() gives a *list* of results.
# XPath specifies a path to the element we want.
page_title = root_element.xpath('/html/head/title/text()')
if __name__ == '__main__':
Maintaining web-scraping session with requests
It is a good idea to maintain a web-scraping session to persist the cookies and other parameters. Additionally, it can result into a performance improvement because requests.Session reuses the underlying TCP connection to a host:
Modify Scrapy user agent
Sometimes the default Scrapy user agent ("Scrapy/VERSION (+http://scrapy.org)") is blocked by the host. To change the default user agent open settings.py, uncomment and edit the following line to what ever you want.
Scraping using BeautifulSoup4
Scraping using Selenium WebDriver
Some websites don’t like to be scraped. In these cases you may need to simulate a real user working with a browser. Selenium launches and controls a web browser.
Scraping using the Scrapy framework
First you have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
To scrape we need a spider. Spiders define how a certain site will be scraped. Here’s the code for a spider that follows the links to the top voted questions on StackOverflow and scrapes some data from each page (source):
Save your spider classes in the projectName\spiders directory. In this case - projectName\spiders\stackoverflow_spider.py.
Now you can use your spider. For example, try running (in the project's directory):
Scraping with curl
-s: silent download
-A: user agent flag
Simple web content download with urllib.request
The standard library module urllib.request can be used to download web content: