Scrapy - Web Scraping Framework
X
Introduction
Scrapy is a well known web scraping framework written in python. Massively adopted by
community. The integration replace all the network part to rely on our API easily. Scrapy documentation is available here
Scrapy Integration is part of our Python SDK.
Source code is available on
Github
scrapfly-sdk package is available through PyPi.
pip install 'scrapfly-sdk[scrapy]'
What's Changed?
Python API is available to get details of objects
Objects
Middlewares
Following middleware are disabled, because they are not relevant when using Scrapfly :
scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware
scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware
scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware
scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
scrapy.downloadermiddlewares.redirect.RedirectMiddleware
scrapy.downloadermiddlewares.cookies.CookiesMiddleware
scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
Internal HTTP / HTTPS downloader are replaced :
scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler
-> scrapfly.scrapy.downloader.ScrapflyHTTPDownloader
Stats Collector
All Scrapfly metrics are prefix by Scrapfly
. Following Scrapfly metrics are avaiable :
- Scrapfly/api_call_cost - (int) Sum of billed API Credits against your quota
Complete documentation about stats collector is available here: https://docs.scrapy.org/en/latest/topics/stats.html
Settings Configuration
SCRAPFLY_API_KEY = ''
CONCURRENT_REQUESTS = 2 # Adjust according your plan limit rate and your needs
How to use equivalent of API parameters?
You can check out this section of the python SDK to see how to configure your calls
Troubleshooting
Scrapy Checkup
Check API Key setting
scrapy settings --get SCRAPFLY_API_KEY
tls_process_server_certificate - certificate verify failed
pip install --upgrade certifi
Example: Scrapy Spider Demo
The full example is available in our
github repository
from scrapy import Item, Field
from scrapy.exceptions import CloseSpider
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.python.failure import Failure
from scrapfly import ScrapeConfig
from scrapfly.errors import ScraperAPIError, ApiHttpServerError
from scrapfly.scrapy import ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class Product(Item):
name = Field()
price = Field()
description = Field()
# scrapy.pipelines.images.ImagesPipeline
image_urls = Field()
images = Field()
class Demo(ScrapflySpider):
name = "demo"
allowed_domains = ["web-scraping.dev", "httpbin.dev"]
start_urls = [
ScrapeConfig("https://web-scraping.dev/product/1", render_js=True),
ScrapeConfig("https://web-scraping.dev/product/2"),
ScrapeConfig("https://web-scraping.dev/product/3"),
ScrapeConfig("https://web-scraping.dev/product/4"),
ScrapeConfig("https://web-scraping.dev/product/5", render_js=True),
ScrapeConfig("https://httpbin.dev/status/403", asp=True, retry=False), # it will fail on purpose
ScrapeConfig("https://httpbin.dev/status/400"), # it will fail on purpose - will fall on scrapy.spidermiddlewares.httperror.HttpError
ScrapeConfig("https://httpbin.dev/status/404"), # it will fail on purpose - will fall on scrapy.spidermiddlewares.httperror.HttpError
]
def start_requests(self):
for scrape_config in self.start_urls:
yield ScrapflyScrapyRequest(scrape_config, callback=self.parse, errback=self.error_handler, dont_filter=True)
def error_handler(self, failure:Failure):
if failure.check(ScraperAPIError): # The scrape errored
error_code = failure.value.code # https://scrapfly.io/docs/scrape-api/errors#web_scraping_api_error
if error_code == "ERR::ASP::SHIELD_PROTECTION_FAILED":
self.logger.warning("The url %s must be retried" % failure.request.url)
elif failure.check(HttpError): # The scrape succeed but the target server returned a non success http code >=400
response:ScrapflyScrapyResponse = failure.value.response
if response.status == 404:
self.logger.warning("The url %s returned a 404 http code - Page not found" % response.url)
elif response.status == 500:
raise CloseSpider(reason="The target server returned a 500 http code - Website down")
elif failure.check(ApiHttpServerError): # Generic API error, config error, quota reached, etc
self.logger.error(failure)
else:
self.logger.error(failure)
def parse(self, response:ScrapflyScrapyResponse, **kwargs):
item = Product()
if response.status == 200:
# make sure the url is absolute
item['image_urls'] = [response.urljoin(response.css('img.product-img::attr(src)').get())]
item['name'] = response.css('h3.product-title').get()
item['price'] = response.css('span.product-price::text').get()
item['description'] = response.css('p.product-description').get()
yield item
scrapy crawl demo -o product.csv