Scrapy - Web Scraping Framework

Introduction

Scrapy is a well known web scraping framework written in python. Massively adopted by community. The integration replace all the network part to rely on our API easily. Scrapy documentation is available here

Scrapy Integration is part of our Python SDK. Source code is available on Github scrapfly-sdk package is available through PyPi.

pip install 'scrapfly-sdk[scrapy]'

What's Changed?

Python API is available to get details of objects

Objects

scrapy.http.Request -> scrapfly.scrapy.request.ScrapflyScrapyRequest
scrapfly.scrapy.response -> scrapfly.scrapy.response.ScrapyResponse
scrapy.spiders.Spider -> scrapfly.scrapy.spider.ScrapflySpider

Middlewares

Following middleware are disabled, because they are not relevant when using Scrapfly :

scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware
scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware
scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware
scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
scrapy.downloadermiddlewares.redirect.RedirectMiddleware
scrapy.downloadermiddlewares.cookies.CookiesMiddleware
scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware

Internal HTTP / HTTPS downloader are replaced :

scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler -> scrapfly.scrapy.downloader.ScrapflyHTTPDownloader

Stats Collector

All Scrapfly metrics are prefix by Scrapfly. Following Scrapfly metrics are avaiable :

Scrapfly/api_call_cost - (int) Sum of billed API Credits against your quota

Complete documentation about stats collector is available here: https://docs.scrapy.org/en/latest/topics/stats.html

Settings Configuration

SCRAPFLY_API_KEY = '{{ YOUR_API_KEY }}'
CONCURRENT_REQUESTS = 2  # Adjust according your plan limit rate and your needs

How to use equivalent of API parameters?

You can check out this section of the python SDK to see how to configure your calls

Troubleshooting

Scrapy Checkup

scrapy check

Check API Key setting

scrapy settings --get SCRAPFLY_API_KEY

tls_process_server_certificate - certificate verify failed

pip install --upgrade certifi

Example: Scrapy Spider Demo

The full example is available in our github repository

from scrapy import Item, Field
from scrapy.exceptions import CloseSpider
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.python.failure import Failure

from scrapfly import ScrapeConfig
from scrapfly.errors import ScraperAPIError, ApiHttpServerError
from scrapfly.scrapy import ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse


class Product(Item):

    name = Field()
    price = Field()
    description = Field()

    # scrapy.pipelines.images.ImagesPipeline
    image_urls = Field()
    images = Field()


class Demo(ScrapflySpider):
    name = "demo"

    allowed_domains = ["web-scraping.dev", "httpbin.dev"]
    start_urls = [
        ScrapeConfig("https://web-scraping.dev/product/1", render_js=True),
        ScrapeConfig("https://web-scraping.dev/product/2"),
        ScrapeConfig("https://web-scraping.dev/product/3"),
        ScrapeConfig("https://web-scraping.dev/product/4"),
        ScrapeConfig("https://web-scraping.dev/product/5", render_js=True),
        ScrapeConfig("https://httpbin.dev/status/403", asp=True, retry=False), # it will fail on purpose
        ScrapeConfig("https://httpbin.dev/status/400"), # it will fail on purpose - will fall on scrapy.spidermiddlewares.httperror.HttpError
        ScrapeConfig("https://httpbin.dev/status/404"), # it will fail on purpose - will fall on scrapy.spidermiddlewares.httperror.HttpError
    ]

    def start_requests(self):
        for scrape_config in self.start_urls:
            yield ScrapflyScrapyRequest(scrape_config, callback=self.parse, errback=self.error_handler, dont_filter=True)

    def error_handler(self, failure:Failure):
        if failure.check(ScraperAPIError): # The scrape errored
            error_code = failure.value.code # https://scrapfly.io/docs/scrape-api/errors#web_scraping_api_error

            if error_code == "ERR::ASP::SHIELD_PROTECTION_FAILED":
                self.logger.warning("The url %s must be retried" % failure.request.url)
        elif failure.check(HttpError): # The scrape succeed but the target server returned a non success http code >=400
            response:ScrapflyScrapyResponse = failure.value.response

            if response.status == 404:
                self.logger.warning("The url %s returned a 404 http code - Page not found" % response.url)
            elif response.status == 500:
                raise CloseSpider(reason="The target server returned a 500 http code - Website down")

        elif failure.check(ApiHttpServerError): # Generic API error, config error, quota reached, etc
            self.logger.error(failure)
        else:
            self.logger.error(failure)

    def parse(self, response:ScrapflyScrapyResponse, **kwargs):
        item = Product()

        if response.status == 200:
            # make sure the url is absolute
            item['image_urls'] = [response.urljoin(response.css('img.product-img::attr(src)').get())]

        item['name'] = response.css('h3.product-title').get()
        item['price'] = response.css('span.product-price::text').get()
        item['description'] = response.css('p.product-description').get()

        yield item

scrapy crawl demo -o product.csv