What are scrapy pipelines and how to use them?

Scrapy pipelines are data processing extensions that can modify scraped data before it's saved by scrapy spiders. Scrapy pipelines are often used to:

  • Enhance scraped data with metadata fields. Like adding scraped item date.
  • Validate scraped data for errors. Like checking scraped item fields.
  • Store scraped data to a database or cloud storage. (not recommended, use feed exporters instead)

Most commonly pipelines are used to modify scraped data. For example, to add scrape datetime to every scraped item we could use this pipeline:

# define our pipeline code:
# pipelines.py
import datetime

class AddScrapedDatePipeline:
    def process_item(self, item, spider):
        current_utc_datetime = datetime.datetime.utcnow()
        item['scraped_date'] = current_utc_datetime.isoformat()
        return item

# settings.py
# activate pipeline in settings:
   'your_project_name.pipelines.AddScrapedDatePipeline': 300,

Pipelines are an easy and flexible way to control scrapy item output with very little extra code. Finally, here are some popular use cases for scrapy pipelines that can help you understand their potential:

  • Use cerberus to validate scraped item fields.
  • Use pymongo to store scraped items in MongoDB.
  • Use google sheets API to store scraped items in Google Sheets.
  • Raise DropItem exception to discard invalid scraped items.
Question tagged: scrapy

Related Posts

Web Scraping Dynamic Websites With Scrapy Playwright

Learn about Selenium Playwright. A Scrapy integration that allows web scraping dynamic web pages with Scrapy. We'll explain web scraping with Scrapy Playwright through an example project and how to use it for common scraping use cases, such as clicking elements, scrolling and waiting for elements.

Web Scraping Dynamic Web Pages With Scrapy Selenium

Learn how to scrape dynamic web pages with Scrapy Selenium. You will also learn how to use Scrapy Selenium for common scraping use cases, such as waiting for elements, clicking buttons and scrolling.

Scrapy Splash Guide: Scrape Dynamic Websites With Scrapy

Learn about web scraping with Scrapy Splash, which lets Scrapy scrape dynamic web pages. We'll define Splash, cover installation and navigation, and provide a step-by-step guide for using Scrapy Splash.