Scrapy - Web Scraping Framework

Scrapfly Logo X Scrapy Python Framework

You have question?

Dev Support - Ask on stack overflow
Ensure to have following tags scrapfly web-scraping python scrapy


Scrapy is a well known web scraping framework written in python. Massively adopted by community. The integration replace all the network part to rely on our API easily. Scrapy documentation is available here

Scrapy Integration is part of our Python SDK. Source code is available on Github scrapfly-sdk package is available through PyPi.

pip install 'scrapfly-sdk[scrapy]'

What's Changed?

Python API is available to get details of objects



Following middleware are disabled, because they are not relevant when using Scrapfly :

  • scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware
  • scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware
  • scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware
  • scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
  • scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware
  • scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
  • scrapy.downloadermiddlewares.redirect.RedirectMiddleware
  • scrapy.downloadermiddlewares.cookies.CookiesMiddleware
  • scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware

Internal HTTP / HTTPS downloader are replaced :

  • scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler -> scrapfly.scrapy.downloader.ScrapflyHTTPDownloader

Stats Collector

All Scrapfly metrics are prefix by Scrapfly. Following Scrapfly metrics are avaiable :

  • Scrapfly/api_call_cost - (int) Sum of billed API call against your quota
Complete documentation about stats collector is available here:

Settings Configuration

CONCURRENT_REQUESTS = 2  # Adjust according your plan limit rate and your needs

Example: Covid Data

from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse

class CovidSpider(ScrapflySpider):
    name = 'covid'
    allowed_domains = ['']
    start_urls = [ScrapeConfig(url='')]

    def parse(self, response:ScrapflyScrapyResponse):
        rows = response.xpath('//*[@id="main_table_countries_today"]//tr[position()>1 and not(contains(@style,"display: none"))]')

        for row in rows:
            country = row.xpath(".//td[2]/a/text()").get()
            totalCase = row.xpath(".//td[3]/text()").get()
            totalDeath = row.xpath(".//td[5]/text()").get()
            totalRecovered = row.xpath(".//td[7]/text()").get()
            activeCase = row.xpath(".//td[8]/text()").get()
            seriousCritical = row.xpath(".//td[9]/text()").get()
            population = row.xpath(".//td[14]/text()").get()

            yield {
                "CountryName": country,
                "Total Case": totalCase,
                "Total Deaths": totalDeath,
                "Total Recovered": totalRecovered,
                "Active Cases": activeCase,
                "Critical Cases": seriousCritical,
                "Population": population

scrapy crawl covid -o covid.csv

Example: Download BEA Report

BEA (is a French administration civil aviation safety investigation and analysis office) collects, investigates, and reports an incident. In this Spider, we will iterate on a small portion of the report and download the PDF report if it exists

from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse

class BEA(ScrapflySpider):
    name = "bea"

    allowed_domains = ["", ""]
    start_urls = [ScrapeConfig("")]

    def parse(self, response:ScrapflyScrapyResponse):
        for href in response.css(' > a::attr(href)').extract():
            yield ScrapflyScrapyRequest(

    def parse_report(self, response:ScrapflyScrapyResponse):
        for el in response.css('li > a[href$=".pdf"]:first-child'):
            yield ScrapflyScrapyRequest(

    def save_pdf(self, response:ScrapflyScrapyResponse):
        response.sink(path='pdf', name=response.url.split('/')[-1])

Don't forget to create the PDF directory before running the Spider

scrapy crawl bea