Web Scraping With Scrapy Intro Through Examples
Tutorial on web scraping with scrapy and Python through a real world example project. Best practices, extension highlights and common challenges.
Scrapy pipelines are data processing extensions that can modify scraped data before it's saved by scrapy spiders.
Scrapy pipelines are often used to:
Most commonly pipelines are used to modify scraped data. For example to add scrape datetime to every scraped item we could use this pipeline:
# define our pipeline code:
# pipelines.py
import datetime
class AddScrapedDatePipeline:
def process_item(self, item, spider):
current_utc_datetime = datetime.datetime.utcnow()
item['scraped_date'] = current_utc_datetime.isoformat()
return item
# settings.py
# activate pipeline in settings:
ITEM_PIPELINES = {
'your_project_name.pipelines.AddScrapedDatePipeline': 300,
}
Pipelines are an easy and flexible way to control scrapy item output with very little extra code.