Scrapy is the most popular web scraping framework out there. It earns this name as it's a highly performant, easily accessible and extendible framework.
In this Python web scraping tutorial, we'll explain how to scrape with Scrapy. We'll start by introducing ourselves to Scrapy, its related components, and some common tips and tricks. Finally, we will apply all the details we mention through an example web scraping project with Scrapy. Let's get started!
What is Scrapy?
Scrapy is a Python web scraping framework built around Twisted, an asynchronous networking engine. This means that it doesn't use the standard asynchronous Python approach. Instead, it uses an event-driven networking infrastructure, allowing for more efficiency and scalability.
That being said, we don't have to interact with the underlying architecture. Scrapy abstracts it away with its own interface. From the development perspective, we'll mostly deal with the equivalent logic in callbacks and generators.
The above illustration explains the Scrapy architecture in simple terms. Scrapy comes with an engine called Crawler (light blue). It handles the low-level logic, such as the HTTP connection, scheduling and the entire execution flow.
On the other hand, the high-level logic (dark blue) is missing. It's called Spider, which handles the scraping logic and how to perform it. In simple terms, we must provide the Crawler with a Spider object to generate the requests, parse and retrieve the data to store.
Now before we create our first Spider to web scrape with Scrapy. Let's start off by defining common Scrapy terms:
Callback
Scrapy is an asyncronous framework. Therefore, most of the actions are executed in the background, which allows for highly concurrent and efficient logic. In this context, a callback is a function that's attached to a background task, which called upon the successful finish of this task.
Errorback
A similar function to the callback, but it is triggered when a task fails instead of when it succeeds. Generator
In Python, generators are functions that return results one at a time instead of all at once like a list.
Settings
Scrapy's central configuration object, called settings and it's located in the settings.py file of the project.
It's essential to visualize this architecture, as it's the core working principle of all Scrapy web scrapers. We'll write generators that generate either requests with callbacks or results that will be saved to storage.
How to Install Scrapy?
Scrapy can be installed using pip. It will add a convenient scrapy terminal command for managing the Project:
pip install scrapy
Installing Scrapy for other systems, such as Anaconda and Ubuntu, can be complex. For detailed instructions, refer to the official Scrapy installation guide.
Web Scraping With Scrapy
In this section, we'll explain using Scrapy for web scraping through an example project. We'll be scraping product data from web-scraping.dev. We'll write a scraper that will:
Extract product's title, price, description and image.
Start Scrapy Project
This scrapy command has two possible contexts: global context and project context. In this Scrapy tutorial, we'll focus on using project context. Therefore, we must create a Scrapy project first:
We can see that the startproject command created a project in the illustrated structure. However, running the scrapy --help command in the newly created directory will result in a few new commands as we are in the project context:
$ scrapy --help
Scrapy 1.8.1 - project: webscrapingdev
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Creating Spiders
At the moment, our Scrapy project doesn't include any Spiders. Running the scrapy list command will return nothing. So, let's create our first Spider:
$ scrapy genspider products web-scraping.dev
# ^ name ^ host we'll be scraping
Created spider 'products' using template 'basic' in module:
webscrapingdev.spiders.products
$ tree
.
├── webscrapingdev
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── products.py <--- New spider
└── scrapy.cfg
$ scrapy list
products
# 1 spider has been found!
The generated spider doesn't provide much except for a starting boilerplate:
# /spiders/products.py
import scrapy
class ProductsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["web-scraping.dev"]
start_urls = ["https://web-scraping.dev"]
def parse(self, response):
pass
Let's break down the above idioms:
name is used as a reference to this spider for scrapy commands such as scrapy crawl <name>, which would run this scraper.
allowed_domains is a safety feature that restricts this spider to crawling only particular domains. It's not very useful in this example, but it's a good practice to have it configured. It reduces accidental errors where the spider could wander off and scrape some other domains accidentally.
start_urls indicates the spider starting point, where parse() is the first callback. It represents a callback to execute the following instructions.
Adding Crawling Logic
As per our example logic, we want the start_urls to be the starting point of our scraping logic. Next, the scraper should call the parse() function to find all the product links to schedule them for scraping as well:
As per our example logic, we want the start_urls to be some topic directories (like https://www.producthunt.com/topics/developer-tools) and in our parse() callback method we want to find all product links and schedule them to be scraped:
# /spiders/products.py
import scrapy
from scrapy.http import Response, Request
class ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['web-scraping.dev']
start_urls = [
'https://web-scraping.dev/products',
]
def parse(self, response: Response):
product_urls = response.xpath(
"//div[@class='row product']/div/h3/a/@href"
).getall()
for url in product_urls:
yield Request(url, callback=self.parse_product)
# or shortcut in scrapy >2.0
# yield from response.follow_all(product_urls, callback=self.parse_product)
def parse_product(self, response: Response):
print(response)
We've updated our start_urls with the main page containing the product URLs. Further, we've updated our parse() callback with some crawling logic: we find product URLs using the XPath selector, and for each one of them, we generate another request that calls back to the parse_product() method.
Adding Parsing Logic
With our basic crawling logic complete, let's proceed with the parsing logic. For the products, we want to extract specific fields: title, price, image and description:
Let's populate our parse_product() callback with the equivalent parsing logic:
Here, we used a few clever XPaths to select the desired fields on the HTML. Our Scrapy scraper can scrape the above fields using the scrapy crawl products command. However, let's have a look at the default settings, as it might get in our way.
Basic Settings
By default, Scrapy doesn't include many settings and relies on the built-in defaults, which aren't always optimal. Let's take a look at the basic recommended settings:
# settings.py
# will ignore /robots.txt rules that might prevent scraping
ROBOTSTXT_OBEY = False
# will cache all request to /httpcache directory which makes running spiders in development much quicker
# tip: to refresh cache just delete /httpcache directory
HTTPCACHE_ENABLED = True
# while developing we want to see debug logs
LOG_LEVEL = "DEBUG" # or "INFO" in production
# to avoid basic bot detection we want to set some basic headers
DEFAULT_REQUEST_HEADERS = {
# we should use headers
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en',
}
With the above settings adjusted, we are ready to execute our scraper!
Running Spiders
There are two ways to run Scrapy spiders: either through the scrapy command or by explicitly calling Scrapy via a Python script. It is generally recommended to use the Scrapy CLI tool, as Scrapy is a complex system, and it is safer to provide it with a dedicated Python process.
Let's run our products spider through the scrapy crawl products command:
$ scrapy crawl products
...
2024-02-16 19:05:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://web-scraping.dev/product/4> (referer: https://web-scraping.dev/products)
2024-02-16 19:05:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://web-scraping.dev/product/3>
{'title': 'Teal Energy Potion', 'price': '$4.99', 'image': 'https://web-scraping.dev/assets/products/teal-potion.webp', 'description': "Experience a surge of vitality with our 'Teal Potion', an exceptional energy drink designed for the gaming community. With its intriguing teal color and a flavor that keeps you asking for more, this potion is your best companion during those long gaming nights. Every sip is an adventure - let the quest begin!"}
Scrapy provides brilliant logs that log everything that the Scrapy engine is performing, as well as logging the returned results. Moreover, Scrapy attaches some useful scrape statistics, such as how many items were scraped, how long it took for our scraper to finish, etc.
🤖 Running Scrapy using Python script is a bit more complicated and we recommend following the official recipe.
Saving Results
We have a spider that scrapes product data successfully and logs the results. To save to a file, we can update our 'scrapy crawl' command with an output flag:
$ scrapy crawl products --output results.json
Here is a sample output to the results we got:
[
{
"title": "Blue Energy Potion",
"price": "$4.99",
"image": "https://web-scraping.dev/assets/products/blue-potion.webp",
"description": "Ignite your gaming sessions with our 'Blue Energy Potion', a premium energy drink crafted for dedicated gamers. Inspired by the classic video game potions, this energy drink provides a much-needed boost to keep you focused and energized. It's more than just an energy drink - it's an ode to the gaming culture, packaged in an aesthetically pleasing potion-like bottle that'll make you feel like you're in your favorite game world. Drink up and game on!"
},
...
]
Alternatively, we can configure the FEEDS setting, which will automatically store all data in a file:
# settings.py
FEEDS = {
# location where to save results
'producthunt.json': {
# file format like json, jsonlines, xml and csv
'format': 'json',
# use unicode text encoding:
'encoding': 'utf8',
# whether to export empty fields
'store_empty': False,
# we can also restrict to export only specific fields like: title and votes:
'fields': ["title", "price"],
# every run will create new file, if False is set every run will append results to the existing ones
'overwrite': True,
},
}
This setting allows for configuring multiple output storages for the scraped data in great detail. Scrapy supports different feed exporters by default, such as Amazon's S3 and Google Cloud Storage. However, many community extensions support many other data storage services and types.
Scrapy is a very configurable framework, as it provides a wide space for various extensions through middlewares, pipelines and general extension slots. Let's have a quick look at how we can improve our Scrapy web scraping project with some custom extensions.
Adding Scrapy Middlewares
Scrapy offers various convenient interception points for different actions performed by the web scraping engine. For example, downloader middleware allows for the pre-processing of outgoing requests and the post-processing of incoming responses. We can use this to design custom connection logic, such as retrying requests, dropping others or implementing connection caching.
For example, let's update our Product spider with a middleware that drops and modifies the responses. If we open up the generated middlewares.py file, we can already see that the scrapy startproject has generated a template.
# middlewares.py
...
class WebscrapingdevDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
So, to process all requests spider makes we use process_request() method and likewise for responses we use process_response(). Let's drop scraping of all products that starts with the id 1:
def process_request(self, request, spider):
if 'product/1' in request.url.lower():
raise IgnoreRequest(f'skipping product with the ID "1" {request.url}')
return None
For example, we can drop the responses for the products that contain the word expired in the URL:
def process_response(self, request, response, spider):
if 'product/expires' in response.url.lower():
raise IgnoreRequest(f'skipping expired product: {request.url}')
return response
With our middleware ready, the last step is to activate it in our settings:
The above setting contains a dictionary of middleware paths and their priority levels - which are usually specified as integers from 0 to 1000. The priority level is necessary to handle interaction between multiple middlewares as Scrapy, by default, already comes with over 10 middlewares enabled!
Typically, we want to include our middleware somewhere in the middle - before the 550 RetryMiddleware which handles common connection retries. That being said, it's recommended to familiarize with default middlewares for finding that efficient sweet spot where your middleware can produce stable results. You can find the list of default middlewares in the official settings documentation page.
Middlewares provide us with a lot of power when it comes to controlling the flow of our connections, likewise pipelines can provide us with a lot of power when controlling our data output - let's take a look at them!
Pipelines
Pipelines are essentially data post-processors. Whenever our spider generates some results, they are piped through registered pipelines, and the final output is sent to our feed (be it a log or a feed export).
Let's add an example pipeline to our Product spider, which will drop low price products:
# pipelines.py
class WebscrapingdevPipeline:
def process_item(self, item, spider):
if float(item.get('price', 0).replace('$', '')) < 5:
raise DropItem(f"dropped item of price: {item.get('price')}")
return item
As with middlewares, we also need to activate our pipelines in the settings file:
Since Scrapy doesn't include any default pipelines, in this case we can set extension score to any value, but it's a good practice to keep in the same 0 to 1000 range. With this pipeline every time we run scrapy crawl products all generated results will be filtered through our votes filtering logic before they are being transported to the final output.
We've taken a look at the two most common ways of extending Scrapy: downloader middlewares, which allows us to control requests and responses, and pipelines, which allow us to control the output.
These are very powerful tools that provide an elegant way of solving common web scraping challenges, so let's take a look at some of these challenges and the existing solutions that are out there.
Scrapy Limitations
While scrapy is a big framework it focuses on performance and robust set of core features which often means we need to solve common web scraping challenges either through community or custom extensions.
The most common challenge when web scraping is scraper blocking. For this, Scrapy community provides various plugins for proxy management like scrapy-rotating-proxies and scrapy-fake-useragent for randomizing user agent headers. Additionally, there are extensions which provide browser emulation like scrapy-playwright and scrapy-selenium.
For scaling, there are various task distribution extensions such as scrapy-redis and scrapy-cluster which allows scaling huge scraping projects through redis and kafka services as well as scrapy-deltafetch which provides an easy persistent connection caching for optimizing repeated scrapes.
Finally, for monitoring Scrapy has integrations with major monitoring services such as sentry via scrapy-sentry or general monitoring util scrapy-spidermon.
Scrapy + ScrapFly
While scrapy is a very powerful and accessible web scraping framework, it doesn't help much with solving the biggest web scraping problem of all - access blocking.
To migrate to ScrapFly's scrapy integration all we have to do is replace base Spider object with ScrapflySpider and yield ScrapflyScrapyRequest objects instead of scrapy's Requests.
Let's see how our Producthunt scraper would look like in ScrapFly's SDK:
# /spiders/products.py
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class ProductsSpider(ScrapflySpider):
name = 'products'
allowed_domains = ['web-scraping.dev']
start_urls = [
ScrapeConfig(url='https://web-scraping.dev/products')
]
def parse(self, response: ScrapflyScrapyResponse):
product_urls = response.xpath(
"//div[@class='row product']/div/h3/a/@href"
).getall()
for url in product_urls:
yield ScrapflyScrapyRequest(
scrape_config=ScrapeConfig(
url=response.urljoin(url),
# we can render javascript via browser automation
render_js=True,
# we can get around anti bot protection
asp=True,
# we can a specific proxy country
country='us',
),
callback=self.parse_report
)
def parse_product(self, response: ScrapflyScrapyResponse):
yield {
"title": response.xpath("//h3[contains(@class, 'product-title')]/text()").get(),
"price": response.xpath("//span[contains(@class, 'product-price')]/text()").get(),
"image": response.xpath("//div[contains(@class, 'product-image')]/img/@src").get(),
"description": response.xpath("//p[contains(@class, 'description')]/text()").get()
}
# settings.py
SCRAPFLY_API_KEY = 'YOUR API KEY'
CONCURRENT_REQUESTS = 2
We've got all the benefits of ScrapFly service just by replacing these few scrapy classes with the ones of ScrapFly SDK! We can even toggle which features we want to apply to each individual request by configuring keyword arguments in ScrapflyScrapyRequest object.
Before we wrap up this guide. Let's take a look at some frequently asked questions about using Scrapy for web scraping.
Can I use Selenium with Scrapy?
Selenium is a popular web browser automation framework in Python, however because of differing architectures making scrapy and selenium work together is tough.
Check out these open source attempts scrapy-selenium and scrapy-headless.
Alternatively, we recommend taking a look at scrapy + splash extension scrapy-splash.
How to scrape dynamic web pages with Scrapy?
We can use browser automation tools like Selenium though it's hard to make them work well with Scrapy. ScrapFly's scrapy extension also offers a javascript rendering feature.
Alternatively, a lot of dynamic web page data is actually hidden in the web body, for more see How to Scrape Hidden Web Data.
Web Scraping With Scrapy Summary
In this Scrapy tutorial, we started with a quick architecture overview: what are callbacks, errorbacks and the whole asynchronous ecosystem.
To get the hang of Scrapy spiders we started an example scrapy project for web-scraping.dev/products/ product listings. We covered scrapy project basics - how to start a project, create spiders and how to parse HTML content using XPath selectors.
We have also explained the two main ways of extending Scrapy. The first one is downloader middleware, which processes outgoing requests and incoming responses. The second one is - pipelines, which process the scraped results.
Finally, we wrapped everything up with some highlights of great scrapy extensions and ScrapFly's own integration which solves major access issues a performant web-scraper might encounter. For more we recommend referring to official scrapy's documentation and for community help we recommend very helpful #scrapy tag on stackoverflow.
In this tutorial we'll take a look at website change tracking using Python, Playwright and Wand. We'll build a tracking tool and schedule it to send us emails on detected changes.