Crawling web pages using Python and Scrapy - Tutorial

June 2, 2015
Python
crawling
scrapy

In this post, let us walk through how we can crawl web pages using Scrapy.

For this tutorial, we will download all the excerpts and ebooks available in https://www.goodreads.com/ebooks?sort=popular_books. This page is paginated. Let’s download books from first page only. At the end of this post, you will know how to follow and crawl other pages too.

First lets create a python virtual environment called goodreads.

1
2
mkvirtualenv goodreads
workon goodreads

To know more about how mkvirtualenv and workon commands work, visit and install virtualenvwrapper

Now, lets install scrapy.

1
pip install scrapy

After installing, lets create a new project.

1
2
scrapy startproject goodreads
cd goodreads

This will create the following directory structure.

├── goodreads
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

For this tutorial, we will only touch spiders directory, settings.py, items.py. Spider directory will contain all the spiders, otherwise called crawlers. settings.py will have project related settings. items.py will define your models. Model or Item is a definition of a object that you are going to crawl. For example, if we crawl stock details from a page, we can define a item like the one below

1
2
3
class Stock(scrapy.Item):
    company_name = scrapy.Field()
    price = scrapy.Field()

For our project we will create a item with following field. open goodreads/items.py and add following lines.

1
2
3
4
import scrapy

class GoodreadsItem(scrapy.Item):
    file_name = scrapy.Field()

We will save downloaded path of the document in file_name field.

Now lets create a spider to crawl books. Run the following command in terminal.

1
    scrapy genspider goodread_spider http://www.goodreads.com/ebooks?page=1&sort=popular_books -t crawl

This will generate the spider file called goodreads_spider. A spider is somethings that crawls web pages and follows the links on that page to crawl other pages. Spider itself will not follow links available in a page. We have to define rules to follow links.

Open goodreads/spiders/goodread_spider.py

Let’s adjust some variables according to out site.

1
2
allowed_domains = ['www.goodreads.com', 's3.amazonaws.com']
start_urls = ['http://www.goodreads.com/ebooks?page=1&sort=popular_books']

s3.amazonaws.com is where goodreads books are hosted. So we have to add this domain to allowed_domains list.

Now lets add a rule to follow and download ebook. Edit the rule defined to match with following line.

1
2
3
rules = (
    Rule(LinkExtractor(allow=r'ebooks/download/.*'), callback='parse_item', follow=True),
)

follow=True means follow them if any other links present in the crawled page. Even if follow is true, it should match any of defined rules. ebook/download/<item_id> will actually return ebook document. It may be any file including pdf, epub, mobi and zip.

Every time an item is fetched our callback function parse_item will called with the response object.

Lets make some changes in parse_item function in the same file to save our downloaded books. Edit the parse_item function to match with the following lines.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from goodreads.items import GoodreadsItem
from scrapy import log

def parse_item(self, response):
    if not response.headers['Content-Type'] == 'text/html; charset=utf-8':
        item = GoodreadsItem()
        item['file_name'] = '/Users/thavan/learnspace/ebooks/' + response.url.split('/')[-1]
        with open(item['file_name'], 'wb') as f:
            f.write(response.body)
        log.msg('Path {0}'.format(item['file_name']), level=log.DEBUG)
        return item

We have added an if condition to make sure we download only books not html pages. If it is text/html we omit the response else we save the response in a file.

Now lets run the crawler

1
scrapy crawl goodread_spider

As mentioned earlier, this spider will crawl books in only first page. To crawl all the books in different pages, we have to add one more rule. Add the below line to Rules tuple in goodread_spider.py

1
Rule(LinkExtractor(allow=r'ebooks\?page=\d+&sort=popular_books'), follow=True),

That’s it for now. Use settings.py to change project related settings. User agent can be changed in settings.py. Open goodreads/settings.py and change the USER_AGENT. You can also set delay between every page request. DOWNLOAD_DELAY = 0.25 will set 250ms delay before sending a request.

As a final note, crawling a website going to give extra load to server. The example given in this tutorial is only for educational purpose. Crawl responsibly by identifying yourself (and your website) on the user-agent.