Crawling web pages using Python and Scrapy - Tutorial
In this post, let us walk through how we can crawl web pages using Scrapy.
For this tutorial, we will download all the excerpts and ebooks available in https://www.goodreads.com/ebooks?sort=popular_books. This page is paginated. Let’s download books from first page only. At the end of this post, you will know how to follow and crawl other pages too.
First lets create a python virtual environment called goodreads.
| |
To know more about how mkvirtualenv and workon commands work, visit and install virtualenvwrapper
Now, lets install scrapy.
| |
After installing, lets create a new project.
| |
This will create the following directory structure.
├── goodreads
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
For this tutorial, we will only touch spiders directory, settings.py, items.py. Spider directory will contain all the spiders, otherwise called crawlers. settings.py will have project related settings. items.py will define your models. Model or Item is a definition of a object that you are going to crawl. For example, if we crawl stock details from a page, we can define a item like the one below
| |
For our project we will create a item with following field. open goodreads/items.py and add following lines.
| |
We will save downloaded path of the document in file_name field.
Now lets create a spider to crawl books. Run the following command in terminal.
| |
This will generate the spider file called goodreads_spider. A spider is somethings that crawls web pages and follows the links on that page to crawl other pages. Spider itself will not follow links available in a page. We have to define rules to follow links.
Open goodreads/spiders/goodread_spider.py
Let’s adjust some variables according to out site.
| |
s3.amazonaws.com is where goodreads books are hosted. So we have to add this domain to allowed_domains list.
Now lets add a rule to follow and download ebook. Edit the rule defined to match with following line.
| |
follow=True means follow them if any other links present in the crawled page. Even if follow is true, it should match any of defined rules. ebook/download/<item_id> will actually return ebook document. It may be any file including pdf, epub, mobi and zip.
Every time an item is fetched our callback function parse_item will called with the response object.
Lets make some changes in parse_item function in the same file to save our downloaded books. Edit the parse_item function to match with the following lines.
| |
We have added an if condition to make sure we download only books not html pages. If it is text/html we omit the response else we save the response in a file.
Now lets run the crawler
| |
As mentioned earlier, this spider will crawl books in only first page. To crawl all the books in different pages, we have to add one more rule. Add the below line to Rules tuple in goodread_spider.py
| |
That’s it for now. Use settings.py to change project related settings. User agent can be changed in settings.py. Open goodreads/settings.py and change the USER_AGENT. You can also set delay between every page request. DOWNLOAD_DELAY = 0.25 will set 250ms delay before sending a request.
As a final note, crawling a website going to give extra load to server. The example given in this tutorial is only for educational purpose. Crawl responsibly by identifying yourself (and your website) on the user-agent.