Scrapy - Web Scraping Framework


Introduction
Scrapy is a well known web scraping framework written in python. Massively adopted by community. The integration replace all the network part to rely on our API easily. Scrapy documentation is available here
Scrapy Integration is part of our Python SDK. Source code is available on Github scrapfly-sdk package is available through PyPi.
pip install 'scrapfly-sdk[scrapy]'
What's Changed?
Python API is available to get details of objects
Objects
scrapy.http.Request
->scrapfly.scrapy.request.ScrapflyScrapyRequest
scrapfly.scrapy.response
->scrapfly.scrapy.response.ScrapyResponse
scrapy.spiders.Spider
->scrapfly.scrapy.spider.ScrapflySpider
Middlewares
Following middleware are disabled, because they are not relevant when using Scrapfly :
scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware
scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware
scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware
scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
scrapy.downloadermiddlewares.redirect.RedirectMiddleware
scrapy.downloadermiddlewares.cookies.CookiesMiddleware
scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
Internal HTTP / HTTPS downloader are replaced :
scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler
->scrapfly.scrapy.downloader.ScrapflyHTTPDownloader
Stats Collector
All Scrapfly metrics are prefix by Scrapfly
. Following Scrapfly metrics are avaiable :
- Scrapfly/api_call_cost - (int) Sum of billed API Credits against your quota
Complete documentation about stats collector is available here: https://docs.scrapy.org/en/latest/topics/stats.html
Settings Configuration
SCRAPFLY_API_KEY = ''
CONCURRENT_REQUESTS = 2 # Adjust according your plan limit rate and your needs
How to use equivalent of API parameters?
You can check out this section of the python SDK to see how to configure your calls
Troubleshooting
Scrapy Checkup
scrapy check
Check API Key setting
scrapy settings --get SCRAPFLY_API_KEY
tls_process_server_certificate - certificate verify failed
pip install --upgrade certifi
Example: Covid Data
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class CovidSpider(ScrapflySpider):
name = 'covid'
allowed_domains = ['www.worldmeters.info/coronavirus']
start_urls = [ScrapeConfig(url='https://www.worldometers.info/coronavirus')]
def parse(self, response:ScrapflyScrapyResponse):
rows = response.xpath('//*[@id="main_table_countries_today"]//tr[position()>1 and not(contains(@style,"display: none"))]')
for row in rows:
country = row.xpath(".//td[2]/a/text()").get()
totalCase = row.xpath(".//td[3]/text()").get()
totalDeath = row.xpath(".//td[5]/text()").get()
totalRecovered = row.xpath(".//td[7]/text()").get()
activeCase = row.xpath(".//td[8]/text()").get()
seriousCritical = row.xpath(".//td[9]/text()").get()
population = row.xpath(".//td[14]/text()").get()
yield {
"CountryName": country,
"Total Case": totalCase,
"Total Deaths": totalDeath,
"Total Recovered": totalRecovered,
"Active Cases": activeCase,
"Critical Cases": seriousCritical,
"Population": population
}
scrapy crawl covid -o covid.csv
Example: Download BEA Report
BEA (is a French administration civil aviation safety investigation and analysis office) collects, investigates, and reports an incident. In this Spider, we will iterate on a small portion of the report and download the PDF report if it exists
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class BEA(ScrapflySpider):
name = "bea"
allowed_domains = ["www.bea.aero", "bea.aero"]
start_urls = [ScrapeConfig("https://bea.aero/en/investigation-reports/notified-events/?tx_news_pi1%5Baction%5D=searchResult&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5BfacetAction%5D=add&tx_news_pi1%5BfacetTitle%5D=year_intS&tx_news_pi1%5BfacetValue%5D=2016&cHash=408c483eae88344bf001f9cdbf653010")]
def parse(self, response:ScrapflyScrapyResponse):
for href in response.css('h1.search-entry__title > a::attr(href)').extract():
yield ScrapflyScrapyRequest(
scrape_config=ScrapeConfig(url=response.urljoin(href)),
callback=self.parse_report
)
def parse_report(self, response:ScrapflyScrapyResponse):
for el in response.css('li > a[href$=".pdf"]:first-child'):
yield ScrapflyScrapyRequest(
scrape_config=ScrapeConfig(url=response.urljoin(el.attrib['href'])),
callback=self.save_pdf
)
def save_pdf(self, response:ScrapflyScrapyResponse):
response.sink(path='pdf', name=response.url.split('/')[-1])
Don't forget to create the PDF directory before running the Spider
scrapy crawl bea