Web Scraping With Scrapy

article feature image

Scrapy is the most popular web-scraping framework in the world, and it definitely earns this name as it's a highly performant, easily accessible and extendible framework.

In this Python + Scrapy tutorial we'll start by taking a look at what actually is Scrapy, what composes a scrapy project and some common tips and tricks.
Finally, we'll solidify this knowledge through a scrapy example project by scraping product data from producthunt.com.

Introduction To Scrapy

Scrapy for Python is a web scraping framework built around Twisted asynchronous networking engine which means it's not using standard python async/await infrastructure.
While it's important to be aware of base architecture, we rarely need to touch Twisted as scrapy abstracts it away with its own interface. From the user's perspective, we'll be mostly working with callbacks and generators.

illustration of scrapy's main object relations

Simplified relation between scrapy's Crawler and project's Spiders

As you can see in this illustration, scrapy comes with an engine called Crawler (light blue) which handles low level logic like http connection, scheduling and entire program flow.
What it's missing is high-level logic (dark blue) of what to scrape and how to do it. This is called a Spider. In other words, we must provide the crawler with a scrapy spider object that generates requests to retrieve and results to store.

Before we create our first Spider let's start off with a short glossary:

Callback
since scrapy is an asynchronous framework, a lot of actions happen in the background which allows us to produce highly concurrent and efficient code. Callback is a function that we attach to a background task that is called upon successful finish of this task.
Errorback
Same as callback but called for a failed task rather than successful.
Generator
In python, generators are functions that instead of returning all results at once (like a list), is capable of returning them one by one.
Settings
Scrapy is configured through central configuration object called settings. Project settings are located in settings.py file.

It's important to visualize this architecture, as this is the core working principal of all scrapy based scrapers: we'll write generators that generate either requests with callbacks or results that will be saved to storage.


In this section, we'll introduce ourselves to scrapy through an example project. We'll be scraping product data from https://www.producthunt.com/. We'll write a scraper that will:

  1. Go to product directory listing (e.g. https://www.producthunt.com/topics/developer-tools)
  2. Find product urls (e.g. https://www.producthunt.com/posts/slack)
  3. Go to every product url
  4. Extract product's title, subtitle, score and tags

Setup

Scrapy can be installed through pip install scrapy command, and it comes with a convenient terminal command scrapy.

Installation of scrapy might be a bit more complex on some systems, see official scrapy installation guide for more information

This scrapy command has 2 possible contexts: global context and project context. In this article we'll focus on using project context, for that we first must create a scrapy project:

$ scrapy startproject producthunt producthunt-scraper
#                     ^ name      ^ project directory
$ cd producthunt-scraper
$ tree
.
├── producthunt
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py 
│   └── spiders
│       ├── __init__.py 
└── scrapy.cfg

As you can see, startproject command created us this project structure, which is mostly empty. However, if we run scrapy --help command in this new directory we'll notice a bunch of new commands - now we're working in project context:

$ scrapy --help
Scrapy 1.8.1 - project: producthunt

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Creating Spiders

Currently, we have no scrapy spiders in our project, if we run scrapy list it'll show us nothing - so let's create our first spider:

$ scrapy genspider products producthunt.com
#                  ^ name   ^ host we'll be scraping
Created spider 'products' using template 'basic' in module:
  producthunt.spiders.products
$ tree
.
├── producthunt
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── products.py  <--- New spider
└── scrapy.cfg
$ scrapy list
products 
# 1 spider has been found!

The generated spider doesn't do much other than give us a starting framework:

# /spiders/products.py
import scrapy


class ProductsSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['producthunt.com']
    start_urls = ['http://producthunt.com/']

    def parse(self, response):
        pass

Let's take a look at these fields:

  • name is used as a reference to this spider for scrapy commands such as scrapy crawl <name> which would run this scraper.
  • allowed_domains is a safety feature that restricts this spider to crawling only particular domains. It's not very useful in this example but it's a good practice to have it configured to reduce accidental errors where our spider could wander off and scrape some other website by accident.
  • start_urls indicates starting point and parse() is the first callback. The way scrapy spiders start work is by connecting to each of start urls, calling back parse() method and following whatever instruction this method produces.

Adding Crawling Logic

As per our example logic, we want our start_urls to be some topic directories (like https://www.producthunt.com/topics/developer-tools) and in our parse() callback method we want to find all product links and schedule them to be scraped:

# /spiders/products.py
import scrapy
from scrapy.http import Response, Request


class ProductsSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['producthunt.com']
    start_urls = [
        'https://www.producthunt.com/topics/developer-tools',
        'https://www.producthunt.com/topics/tech',
    ]

    def parse(self, response: Response):
        product_urls = response.xpath(
            "//div[contains(@class, 'item')//a[contains(@class, 'title')/@href"
        ).getall()
        for url in product_urls:
            # convert relative url (e.g. /posts/slack) 
            # to absolute (e.g. https://producthunt.com/posts/slack)
            url = response.urljoin(url)
            yield Request(url, callback=self.parse_product)
        # or shortcut in scrapy >2.0
        # yield from response.follow_all(product_urls, callback=self.parse_product)
    
    def parse_product(self, response: Response):
        print(response)

We've updated our start_urls with a couple of directory links. Further, we've updated our parse() callback with some crawling logic: we find product urls using xpath selector and for each one of them we generate another request that calls back to parse_product() method.

Parsing HTML with Xpath

For more on xpath and how to use it in web-scraping check out this extensive introduction article

Parsing HTML with Xpath

Adding Parsing Logic

With our basic crawling logic complete, let's add our parsing logic. For the Producthunt products we want to extract fields: title, subtitle, top tags and score:

parsing markup of producthunt.com example

parsing markup of Slack product page.

Let's populate our parse_product() callback with this parsing logic:

# /spiders/products.py
...

def parse_product(self, response: Response):
    yield {
        'title': response.xpath('//h1/text()').get(),
        'subtitle': response.xpath('//h2//text()').get(),
        'votes': response.xpath("//span[contains(@class, 'bigButtonCount')]/text()").get(),
        'tags': response.xpath(
            "//div[contains(@class,'topicPriceWrap')]"
            "//a[contains(@href, '/topics/')]/text()"
        ).getall(),
    }

We got the title and the subtitle from the header elements, the vote scores from the button and the tags from the topic links. Finally, we can test our scraper though before we run scrapy crawl products command let's take a look at the default settings as they might get in the way of our scraping.

Basic Settings

By default, Scrapy doesn't include many settings and relies on the built-in defaults which aren't always optimal. Let's take a look at the basic recommended settings:

# settings.py
# will ignore /robots.txt rules that might prevent scraping
ROBOTSTXT_OBEY = False
# will cache all request to /httpcache directory which makes running spiders in development much quicker
# tip: to refresh cache just delete /httpcache directory
HTTPCACHE_ENABLED = True
# while developing we want to see debug logs
LOG_LEVEL = "DEBUG" # or "INFO" in production

# to avoid basic bot detection we want to set some basic headers
DEFAULT_REQUEST_HEADERS = {
    # we should use headers
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en',
}

With these settings we are ready to run our scraper!

Running Spiders

There are 2 ways to run Scrapy spiders: through scrapy command and by calling Scrapy via python script explicitly. It's often recommended to use Scrapy CLI tool since scrapy is a rather complex system, and it's safer to provide it a dedicated process python process.

We can run our products spider through scrapy crawl products command:

$ scrapy crawl products
...
2022-01-19 14:47:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.producthunt.com/topics/developer-tools> (referer: None) ['cached']
2022-01-19 14:47:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.producthunt.com/posts/slack> (referer: https://www.producthunt.com/topics/developer-tools) ['cached']
2022-01-19 14:47:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.producthunt.com/posts/slack>
{'title': 'Slack', 'subtitle': 'Be less busy. Real-time messaging, archiving & search.', 'votes': '17,380', 'tags': ['Android', 'iPhone', 'Mac']}
...
2022-01-19 14:47:18 [scrapy.core.engine] INFO: Closing spider (finished)
2022-01-19 14:47:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
 ...
 'finish_time': datetime.datetime(2022, 1, 19, 7, 47, 18, 689962),
 'httpcache/hit': 2,
 'item_scraped_count': 1,
 'start_time': datetime.datetime(2022, 1, 19, 7, 47, 18, 459982)
 }
2022-01-19 14:47:18 [scrapy.core.engine] INFO: Spider closed (finished)

Scrapy provides brilliant logs that log everything the scrapy engine is doing as well as logging any returned results. At the end of the process, scrapy also attaches some useful scrape statistics - like how many items were scraped, how long it took for our scraper to finish and so on.

running Scrapy via python script is a bit more complicated and we recommend taking a look at the official recipe

Saving Results

We have a spider which successfully scrapes product data and prints results to logs. If we want to save the results to a file we can either update our scrapy crawl command with an output flag:

$ scrapy crawl products --output results.json
...
$ head results.json
[
{"title": "Slack", "subtitle": "Be less busy. Real-time messaging, archiving & search.", "votes": "17,380", "tags": ["Android", "iPhone", "Mac"]}
...

Or alternative, we can configure the FEEDS setting which will automatically store all data in a file:

# settings.py
FEEDS = {
    # location where to save results
    'producthunt.json': {
        # file format like json, jsonlines, xml and csv
        'format': 'json',
        # use unicode text encoding:
        'encoding': 'utf8',
        # whether to export empty fields
        'store_empty': False,
        # we can also restrict to export only specific fields like: title and votes:
        'fields': ["title", "votes"],
        # every run will create new file, if False is set every run will append results to the existing ones
        'overwrite': True,
    },
}

This setting allows us to configure multiple outputs for our scraped data in great detail. Scrapy supports many feed exporters by default such as Amazon's S3, Google Cloud Storage and there are many community extensions that provide support for many other data storage services and types.

For more on scrapy exporters see official feed exporter documentation

Extending Scrapy

Scrapy is a very configurable framework, as it provides a lot of space for various extensions through middlewares, pipelines and general extension slots. Let's take a quick look at these and how can we improve our example project with some custom extensions.

Middlewares

Scrapy provides convenient interception points for many actions the web scraping engine performs. For example, downloader middlewares lets us pre-process outgoing requests and post-process incoming responses. We can use this to design custom connection logic like retrying some request, dropping others or implementing connection caching.

For example, let's update our Producthunt spider with a middleware that drops some requests and modifies some responses. If we open up the generated middlewares.py file, we can already see that scrapy startproject generated us a template:

# middlewares.py
...
class ProducthuntDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

So, to process all requests spider makes we use process_request() method and likewise for responses we use process_response(). Let's drop scraping of all products that start with a letter s:

def process_request(self, request, spider):
    if 'posts/s' in request.url.lower():
        raise IgnoreRequest(f'skipping product starting with letter "s" {request.url}')
    return None

Then, let's presume that Producthunt redirects all expired products to /product/expired - we should drop these responses:

def process_response(self, request, response, spider):
    if 'posts/expires' in response.url.lower():
        raise IgnoreRequest(f'skipping expired product: {request.url}')
    return response

With our middleware ready, the last step is to activate it in our settings:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'producthunt.middlewares.ProducthuntDownloaderMiddleware': 543,
}

This setting contains a dictionary of middleware paths and their priority levels - which are usually specified as integers from 0 to 1000. Priority is necessary to handle interaction between multiple middlewares as Scrapy by default already comes with over 10 middlewares enabled!

Typically, we want to include our middleware somewhere in the middle - before the 550 RetryMiddleware which handles common connection retries. That being said, it's recommended to familiarize with default middlewares for finding that efficient sweet spot where your middleware can produce stable results. You can find the list of default middlewares in the official settings documentation page.

Middlewares provide us with a lot of power when it comes to controlling the flow of our connections, likewise pipelines can provide us with a lot of power when controlling our data output - let's take a look at them!

Pipelines

Pipelines are essentially data post-processors. Whenever our spider generates some results they are being piped through registered pipelines and the final output is sent to our feed (be it a log or a feed export).

Let's add an example pipeline to our Producthunt spider which will drop low score products:

# pipelines.py
class ProducthuntPipeline(object):
    def process_item(self, item, spider):
        if int(item.get('votes', 0).replace(',', '')) < 100:
            raise DropItem(f"dropped item of score: {item.get('votes')}")
        return item

As with middlewares, we also need to activate our pipelines in the settings file:

# settings.py
ITEM_PIPELINES = {
    'producthunt.pipelines.ProducthuntPipeline': 300,
}

Since Scrapy doesn't include any default pipelines, in this case we can set extension score to any value, but it's a good practice to keep in the same 0 to 1000 range. With this pipeline every time we run scrapy crawl products all generated results will be filtered through our votes filtering logic before they are being transported to the final output.


We've taken a look at the two most common ways of extending scrapy: downloader middlewares, which allows us to control requests and responses and pipelines, which allows us to control the output. These are very powerful tools that provide an elegant way of solving common web scraping challenges, so let's take a look at some of these challenges and the existing solutions that are out there.

Common Challenges

While scrapy is a big framework it focuses on performance and robust set of core features which often means we need to solve common web scraping challenges either through community or custom extensions.

The most common challenge when web scraping is scraper blocking. For this, Scrapy community provides various plugins for proxy management like scrapy-rotating-proxies and scrapy-fake-useragent for randomizing user agent headers. Additionally, there are extensions which provide browser emulation like scrapy-playwright and scrapy-selenium.

Scraping Dynamic Websites Using Browser

For more on browser automation see our extensive article which examines and compares major browser automation libraries such as Selenium, Playwright and Puppeteer

Scraping Dynamic Websites Using Browser

For scaling, there are various task distribution extensions such as scrapy-redis and scrapy-cluster which allows scaling huge scraping projects through redis and kafka services as well as scrapy-deltafetch which provides an easy persistent connection caching for optimizing repeated scrapes.

Finally, for monitoring Scrapy has integrations with major monitoring services such as sentry via scrapy-sentry or general monitoring util scrapy-spidermon.

Scrapy + ScrapFly

While scrapy is a very powerful and accessible web scraping framework, it doesn't help much with solving the biggest web scraping problem of all - access blocking.

ScrapFly provides an easy scrapy integration through ScrapFly's python-sdk. Which lets us to take advantage of all ScrapFly features such as:

Javascript Rendering
Since scrapy is a pure python framework it doesn't provide any javascript rendering like web browsers do. Meaning, some dynamic web content is impossible to reach without reverse engineering embedded javascript functionality. ScrapFly middleware uses automated browsers to render javascript and then pass result back to scrapy which gets us all the benefits of browser rendering and scrapy's speed!

Anti Scraping Protection Bypass
ScrapFly offers Anti Scraping Protection (ASP) solution that solves present captchas and bypass various anti-bot measures automatically!

Smart Proxies
While scrapy has built-in proxy support, it doesn't offer smart way of distributing and managing them. ScrapFly automatically applies proxies that fit the request and distributes load across many proxies for fastest scraping experience.


To migrate to ScrapFly's scrapy integration all we have to do is replace base Spider object with ScrapflySpider and yield ScrapflyScrapyRequest objects instead of scrapy's Requests.

Let's see how our Producthunt scraper would look like in ScrapFly's SDK:

# /spiders/products.py

from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse


class ProductsSpider(ScrapflySpider):
    name = 'products'
    allowed_domains = ['producthunt.com']
    start_urls = [
        ScrapeConfig(url='https://www.producthunt.com/topics/developer-tools')
    ]

    def parse(self, response: ScrapflyScrapyResponse):
        product_urls = response.xpath(
            "//div[contains(@class,'item')]//a[contains(@class,'title')]/@href"
        ).getall()
        for url in product_urls:
            yield ScrapflyScrapyRequest(
                scrape_config=ScrapeConfig(
                    url=response.urljoin(url),
                    # we can render javascript via browser automation
                    render_js=True,
                    # we can get around anti bot protection
                    asp=True,
                    # we can a specific proxy country
                    country='us',
                ),
                callback=self.parse_report
            )
    
    def parse_product(self, response: ScrapflyScrapyResponse):
        yield {
            'title': response.xpath('//h1/text()').get(),
            'subtitle': response.xpath('//h2//text()').get(),
            'votes': response.xpath("//span[contains(@class, 'bigButtonCount')]/text()").get(),
            'tags': response.xpath(
                "//div[contains(@class,'topicPriceWrap')]"
                "//a[contains(@href, '/topics/')]/text()"
            ).getall(),
        }
# settings.py
SCRAPFLY_API_KEY = 'YOUR API KEY'
CONCURRENT_REQUESTS = 2

We've got all the benefits of ScrapFly service just by replacing these few scrapy classes with the ones of ScrapFly SDK! We can even toggle which features we want to apply to each individual request by configuring keyword arguments in ScrapflyScrapyRequest object.

For more see our ScrapFly + Scrapy docs

FAQ

Before we wrap up let's take a look at some frequently asked questions about using Scrapy for web scraping.

Can Selenium be used with Scrapy?

Selenium is a popular web browser automation framework in Python, however because of differing architectures making scrapy and selenium work together is tough.
Check out these open source attempts scrapy-selenium and scrapy-headless.
Alternatively, we recommend taking a look at scrapy + splash extension scrapy-splash.

Summary

In this introduction article we explored Scrapy web-scraping framework by taking a look at the core structure of a scrapy project. We used an example web scraper of https://www.producthunt.com/ product listings to explore request creation and response handling and HTML parsing.
We've also introduced ourselves to two main ways of extending web-scraping logic: downloader middlewares which processes requests and responses and pipelines which processes the results.

Finally, we wrapped everything up with some highlights of great scrapy extensions and ScrapFly's own integration which solves major access issues a performant web-scraper might encounter. For more we recommend referring to official scrapy's documentation and for community help we recommend very helpful #scrapy tag on stackoverflow.

Related post

How to Scrape Instagram

Tutorial on how to scrape instagram.com user and post data using pure Python. How to scrape instagram without loging in or being blocked.

How to Web Scrape Walmart.com

Tutorial on how to scrape walmart.com product and review data using Python. How to avoid blocking to web scrape data at scale and other tips.

How to Web Scrape Yelp.com

Tutorial on how to scrape yelp.com business and review data using Python. How to avoid blocking to web scrape data at scale and other tips.

Web Scraping Graphql with Python

Introduction to web scraping graphql powered websites. How to create graphql queries in python and what are some common challenges.