Scrapy is the most popular web-scraping framework in the world, and it earns this name as it's a highly performant, easily accessible and extendible framework.
In this web scraping in Python tutorial, we'll be taking a look at how to scrape with the Scrapy framework. We'll start by quickly introducing ourselves Scrapy and its related libraries, what composes a scrapy project and some common tips and tricks.
Finally, we'll solidify this knowledge through a scrapy example project by scraping product data from producthunt.com.
Introduction To Scrapy
Scrapy for Python is a web scraping framework built around Twisted asynchronous networking engine which means it's not using standard python async/await infrastructure.
While it's important to be aware of base architecture, we rarely need to touch Twisted as scrapy abstracts it away with its own interface. From the user's perspective, we'll be mostly working with callbacks and generators.
Simplified relation between scrapy's `Crawler` and project's `Spiders`
As you can see in this illustration, scrapy comes with an engine called Crawler (light blue) which handles low level logic like http connection, scheduling and entire program flow.
What it's missing is high-level logic (dark blue) of what to scrape and how to do it. This is called a Spider. In other words, we must provide the crawler with a scrapy spider object that generates requests to retrieve and results to store.
Before we create our first Spider let's start off with a short glossary:
Callback
since scrapy is an asynchronous framework, a lot of actions happen in the background which allows us to produce highly concurrent and efficient code. Callback is a function that we attach to a background task that is called upon successful finish of this task. Errorback
Same as callback but called for a failed task rather than successful. Generator
In python, generators are functions that instead of returning all results at once (like a list), is capable of returning them one by one. Settings
Scrapy is configured through central configuration object called settings. Project settings are located in settings.py file.
It's important to visualize this architecture, as this is the core working principal of all scrapy based scrapers: we'll write generators that generate either requests with callbacks or results that will be saved to storage.
In this section, we'll introduce ourselves to scrapy through an example project. We'll be scraping product data from https://www.producthunt.com/. We'll write a scraper that will:
Scrapy can be installed through pip install scrapy command, and it comes with a convenient terminal command scrapy.
Installation of scrapy might be a bit more complex on some systems, see official scrapy installation guide for more information
This scrapy command has 2 possible contexts: global context and project context. In this article we'll focus on using project context, for that we first must create a scrapy project:
As you can see, startproject command created us this project structure, which is mostly empty. However, if we run scrapy --help command in this new directory we'll notice a bunch of new commands - now we're working in project context:
$ scrapy --help
Scrapy 1.8.1 - project: producthunt
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Creating Spiders
Currently, we have no scrapy spiders in our project, if we run scrapy list it'll show us nothing - so let's create our first spider:
$ scrapy genspider products producthunt.com
# ^ name ^ host we'll be scraping
Created spider 'products' using template 'basic' in module:
producthunt.spiders.products
$ tree
.
├── producthunt
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── products.py <--- New spider
└── scrapy.cfg
$ scrapy list
products
# 1 spider has been found!
The generated spider doesn't do much other than give us a starting framework:
# /spiders/products.py
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['producthunt.com']
start_urls = ['http://producthunt.com/']
def parse(self, response):
pass
Let's take a look at these fields:
name is used as a reference to this spider for scrapy commands such as scrapy crawl <name> which would run this scraper.
allowed_domains is a safety feature that restricts this spider to crawling only particular domains. It's not very useful in this example but it's a good practice to have it configured to reduce accidental errors where our spider could wander off and scrape some other website by accident.
start_urls indicates starting point and parse() is the first callback. The way scrapy spiders start work is by connecting to each of start urls, calling back parse() method and following whatever instruction this method produces.
Adding Crawling Logic
As per our example logic, we want our start_urls to be some topic directories (like https://www.producthunt.com/topics/developer-tools) and in our parse() callback method we want to find all product links and schedule them to be scraped:
# /spiders/products.py
import scrapy
from scrapy.http import Response, Request
class ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['producthunt.com']
start_urls = [
'https://www.producthunt.com/topics/developer-tools',
'https://www.producthunt.com/topics/tech',
]
def parse(self, response: Response):
product_urls = response.xpath(
"//main[contains(@class,'layoutMain')]//a[contains(@class,'_title_')]/@href"
).getall()
for url in product_urls:
# convert relative url (e.g. /products/slack)
# to absolute (e.g. https://producthunt.com/products/slack)
url = response.urljoin(url)
yield Request(url, callback=self.parse_product)
# or shortcut in scrapy >2.0
# yield from response.follow_all(product_urls, callback=self.parse_product)
def parse_product(self, response: Response):
print(response)
We've updated our start_urls with a couple of directory links. Further, we've updated our parse() callback with some crawling logic: we find product urls using xpath selector and for each one of them we generate another request that calls back to parse_product() method.
With our basic crawling logic complete, let's add our parsing logic. For the Producthunt products we want to extract fields: title, subtitle, top tags and score:
Let's parse the fields highlighted in blue
Let's populate our parse_product() callback with this parsing logic:
Here, we used a few clever XPaths to select our marked up fields.
Finally, we can test our scraper though before we run scrapy crawl products command let's take a look at the default settings as they might get in the way of our scraping.
Basic Settings
By default, Scrapy doesn't include many settings and relies on the built-in defaults which aren't always optimal. Let's take a look at the basic recommended settings:
# settings.py
# will ignore /robots.txt rules that might prevent scraping
ROBOTSTXT_OBEY = False
# will cache all request to /httpcache directory which makes running spiders in development much quicker
# tip: to refresh cache just delete /httpcache directory
HTTPCACHE_ENABLED = True
# while developing we want to see debug logs
LOG_LEVEL = "DEBUG" # or "INFO" in production
# to avoid basic bot detection we want to set some basic headers
DEFAULT_REQUEST_HEADERS = {
# we should use headers
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en',
}
With these settings we are ready to run our scraper!
Running Spiders
There are 2 ways to run Scrapy spiders: through scrapy command and by calling Scrapy via python script explicitly. It's often recommended to use Scrapy CLI tool since scrapy is a rather complex system, and it's safer to provide it a dedicated process python process.
We can run our products spider through scrapy crawl products command:
Scrapy provides brilliant logs that log everything the scrapy engine is doing as well as logging any returned results. At the end of the process, scrapy also attaches some useful scrape statistics - like how many items were scraped, how long it took for our scraper to finish and so on.
🤖 runnIng Scrapy via python script is a bit more complicated and we recommend taking a look at the official recipe
Saving Results
We have a spider which successfully scrapes product data and prints results to logs. If we want to save the results to a file we can either update our scrapy crawl command with an output flag:
Or alternative, we can configure the FEEDS setting which will automatically store all data in a file:
# settings.py
FEEDS = {
# location where to save results
'producthunt.json': {
# file format like json, jsonlines, xml and csv
'format': 'json',
# use unicode text encoding:
'encoding': 'utf8',
# whether to export empty fields
'store_empty': False,
# we can also restrict to export only specific fields like: title and votes:
'fields': ["title", "votes"],
# every run will create new file, if False is set every run will append results to the existing ones
'overwrite': True,
},
}
This setting allows us to configure multiple outputs for our scraped data in great detail. Scrapy supports many feed exporters by default such as Amazon's S3, Google Cloud Storage and there are many community extensions that provide support for many other data storage services and types.
Scrapy is a very configurable framework, as it provides a lot of space for various extensions through middlewares, pipelines and general extension slots. Let's take a quick look at these and how can we improve our example project with some custom extensions.
Middlewares
Scrapy provides convenient interception points for many actions the web scraping engine performs. For example, downloader middlewares lets us pre-process outgoing requests and post-process incoming responses. We can use this to design custom connection logic like retrying some request, dropping others or implementing connection caching.
For example, let's update our Producthunt spider with a middleware that drops some requests and modifies some responses. If we open up the generated middlewares.py file, we can already see that scrapy startproject generated us a template:
# middlewares.py
...
class ProducthuntDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
So, to process all requests spider makes we use process_request() method and likewise for responses we use process_response(). Let's drop scraping of all products that start with a letter s:
def process_request(self, request, spider):
if 'posts/s' in request.url.lower():
raise IgnoreRequest(f'skipping product starting with letter "s" {request.url}')
return None
Then, let's presume that Producthunt redirects all expired products to /product/expired - we should drop these responses:
def process_response(self, request, response, spider):
if 'posts/expires' in response.url.lower():
raise IgnoreRequest(f'skipping expired product: {request.url}')
return response
With our middleware ready, the last step is to activate it in our settings:
This setting contains a dictionary of middleware paths and their priority levels - which are usually specified as integers from 0 to 1000. Priority is necessary to handle interaction between multiple middlewares as Scrapy by default already comes with over 10 middlewares enabled!
Typically, we want to include our middleware somewhere in the middle - before the 550 RetryMiddleware which handles common connection retries. That being said, it's recommended to familiarize with default middlewares for finding that efficient sweet spot where your middleware can produce stable results. You can find the list of default middlewares in the official settings documentation page.
Middlewares provide us with a lot of power when it comes to controlling the flow of our connections, likewise pipelines can provide us with a lot of power when controlling our data output - let's take a look at them!
Pipelines
Pipelines are essentially data post-processors. Whenever our spider generates some results they are being piped through registered pipelines and the final output is sent to our feed (be it a log or a feed export).
Let's add an example pipeline to our Producthunt spider which will drop low score products:
# pipelines.py
class ProducthuntPipeline(object):
def process_item(self, item, spider):
if int(item.get('votes', 0).replace(',', '')) < 100:
raise DropItem(f"dropped item of score: {item.get('votes')}")
return item
As with middlewares, we also need to activate our pipelines in the settings file:
Since Scrapy doesn't include any default pipelines, in this case we can set extension score to any value, but it's a good practice to keep in the same 0 to 1000 range. With this pipeline every time we run scrapy crawl products all generated results will be filtered through our votes filtering logic before they are being transported to the final output.
We've taken a look at the two most common ways of extending scrapy: downloader middlewares, which allows us to control requests and responses and pipelines, which allows us to control the output. These are very powerful tools that provide an elegant way of solving common web scraping challenges, so let's take a look at some of these challenges and the existing solutions that are out there.
Common Challenges
While scrapy is a big framework it focuses on performance and robust set of core features which often means we need to solve common web scraping challenges either through community or custom extensions.
The most common challenge when web scraping is scraper blocking. For this, Scrapy community provides various plugins for proxy management like scrapy-rotating-proxies and scrapy-fake-useragent for randomizing user agent headers. Additionally, there are extensions which provide browser emulation like scrapy-playwright and scrapy-selenium.
For scaling, there are various task distribution extensions such as scrapy-redis and scrapy-cluster which allows scaling huge scraping projects through redis and kafka services as well as scrapy-deltafetch which provides an easy persistent connection caching for optimizing repeated scrapes.
Finally, for monitoring Scrapy has integrations with major monitoring services such as sentry via scrapy-sentry or general monitoring util scrapy-spidermon.
Scrapy + ScrapFly
While scrapy is a very powerful and accessible web scraping framework, it doesn't help much with solving the biggest web scraping problem of all - access blocking.
ScrapFly provides an easy scrapy integration through ScrapFly's python-sdk. Which lets us to take advantage of all ScrapFly features such as:
Javascript Rendering
Since scrapy is a pure python framework it doesn't provide any javascript rendering like web browsers do. Meaning, some dynamic web content is impossible to reach without reverse engineering embedded javascript functionality. ScrapFly middleware uses automated browsers to render javascript and then pass result back to scrapy which gets us all the benefits of browser rendering and scrapy's speed!
Anti Scraping Protection Bypass
ScrapFly offers Anti Scraping Protection (ASP) solution that solves present captchas and bypass various anti-bot measures automatically!
Smart Proxies
While scrapy has built-in proxy support, it doesn't offer smart way of distributing and managing them. ScrapFly automatically applies proxies that fit the request and distributes load across many proxies for fastest scraping experience.
To migrate to ScrapFly's scrapy integration all we have to do is replace base Spider object with ScrapflySpider and yield ScrapflyScrapyRequest objects instead of scrapy's Requests.
Let's see how our Producthunt scraper would look like in ScrapFly's SDK:
# /spiders/products.py
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class ProductsSpider(ScrapflySpider):
name = 'products'
allowed_domains = ['producthunt.com']
start_urls = [
ScrapeConfig(url='https://www.producthunt.com/topics/developer-tools')
]
def parse(self, response: ScrapflyScrapyResponse):
product_urls = response.xpath(
"//main[contains(@class,'layoutMain')]//a[contains(@class,'_title_')]/@href"
).getall()
for url in product_urls:
yield ScrapflyScrapyRequest(
scrape_config=ScrapeConfig(
url=response.urljoin(url),
# we can render javascript via browser automation
render_js=True,
# we can get around anti bot protection
asp=True,
# we can a specific proxy country
country='us',
),
callback=self.parse_report
)
def parse_product(self, response: ScrapflyScrapyResponse):
yield {
'title': response.xpath('//h1/text()').get(),
'subtitle': response.xpath('//h1/following-sibling::div//text()').get(),
'votes': response.xpath("//*[contains(.//text(),'upvotes')]/preceding-sibling::*//text()").get(),
'reviews': response.xpath("//*[contains(text(),'reviews')]/preceding-sibling::*/text() ").get(),
}
# settings.py
SCRAPFLY_API_KEY = 'YOUR API KEY'
CONCURRENT_REQUESTS = 2
We've got all the benefits of ScrapFly service just by replacing these few scrapy classes with the ones of ScrapFly SDK! We can even toggle which features we want to apply to each individual request by configuring keyword arguments in ScrapflyScrapyRequest object.
Before we wrap up let's take a look at some frequently asked questions about using Scrapy for web scraping.
Can Selenium be used with Scrapy?
Selenium is a popular web browser automation framework in Python, however because of differing architectures making scrapy and selenium work together is tough.
Check out these open source attempts scrapy-selenium and scrapy-headless.
Alternatively, we recommend taking a look at scrapy + splash extension scrapy-splash.
How to scrape dynamic web pages with Scrapy?
We can use browser automation tools like Selenium though it's hard to make them work well with Scrapy. ScrapFly's scrapy extension also offers a javascript rendering feature.
Alternatively, a lot of dynamic web page data is actually hidden in the web body, for more see How to Scrape Hidden Web Data.
Summary
In this Scrapy tutorial, we started with a quick architecture overview: what are callbacks, errorbacks and the whole asynchronous ecosystem.
To get the hang of scrapy spiders we started an example scrapy project for https://www.producthunt.com/ product listings. We covered scrapy project basics - how to start a project, create spiders and how to parse HTML content using XPath selectors.
We've also introduced ourselves to two main ways of extending scrapy.
The first one is downloader middlewares, which processes outgoing requests and incoming responses.
The second one is - pipelines which processes the scraped results.
Finally, we wrapped everything up with some highlights of great scrapy extensions and ScrapFly's own integration which solves major access issues a performant web-scraper might encounter. For more we recommend referring to official scrapy's documentation and for community help we recommend very helpful #scrapy tag on stackoverflow.
In this article, we’ll take a look at SEO web scraping, what it is and how to use it for better SEO keyword optimization. We’ll also create an SEO keyword scraper that scrapes Google search rankings and suggested keywords.
In this article, we’ll take a look at the User-Agent header, what it is and how to use it in web scraping. We'll also generate and rotate user agents to avoid web scraping blocking.