What are scrapy middlewares and how to use them?

Scrapy middlewares are Scrapy spider extensions that modify outgoing and incoming connections. It's a convenient tool to introduce connection logic to scrapy spiders.

For example, scrapy middlewares are often used to:

  • Retry and filter requests and reponses based on their content.
  • Modify outgoing connections with different header or proxies
  • Collecting and tracking connection performance.

Scrapy comes with several default middlewares that perform common tasks such as:

  • retry common exceptions
  • handle redirects
  • track cookies
  • decompresses compressed responses

Being able to define custom middlewares is the real power of scrapy middlewares. For example, here's a middleware that adds a header to each request:

# middlewares.py
class CustomHeaderMiddleware:
    def process_request(self, request, spider):
        request.headers['x-token'] = "123456"

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'your_project_name.middlewares.CustomHeaderMiddleware': 500,
}

In this example, we're adding a x-token header to each outgoing request. The process_request method is called for each outgoing request and can be used to modify the request object.

Provided by Scrapfly

This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇