Tutorial on web scraping with scrapy and Python through a real world example project. Best practices, extension highlights and common challenges.
Scrapy downloader middlewares can be used to intercept and update outgoing requests and incoming responses. Here's how to use them.
Scrapy pipelines can be used to extend scraped result data with new fields or validate the whole datasets. Here's how.
To rotate proxies in scrapy spiders a request middleware can be used to randomly or smartly select the most viable proxy. Here's how.
To use headless browser with scrapy a plugin like scrapy-playwright can be used. Here's how to use it and what are some other alternatives.
To add headers to scrapy's request the `DEFAULT_REQUEST_HEADERS` settting or a custom request middleware can be used. Here's how.
To pass custom parameters to scrapy spider there CLI argument -a can be used. Here's how and why is it such a useful feature.
Scrapy's Item and ItemLoader classes are great way to structure dataset parsing logic. Here's how to use it.
To pass data between scrapy callbacks when scraping multiple pages the Request.item can be used. Here's how.
To pass data between scrapy callbacks like start_request and parse the Request.meta attribute can be used. Here's how.
Scrapy and BeautifulSoup are two popular web scraping libraries though very different. Scrapy is a framework while beautifulsoup is a HTML parser