🚀 We are hiring! See open positions

How to use headless browsers with scrapy?

Python offers several libraries for headless browser control like Web Scraping with Playwright and Python or Web Scraping with Selenium and Python but integrating them with scrapy can be difficult.

To use Playwright with scrapy the scrapy-playwright community extension can be used. Scrapy-playwright works by creating a new download handler that is powered by Playwright exclusively. To activate it set the DOWNLOADER_HANDLER setting:

python
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
# and switch to asyncio reactor as playwright is asynchronous
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Then to enable playwright attach meta={"playwright": True} parameter to each outgoing scrapy.Request object:

python
import scrapy

class PlaywrightSpider(scrapy.Spider):
    name = "playwright-spider"

    def start_requests(self):
        yield scrapy.Request("https://httpbin.dev/get", meta={"playwright": True})
        # or POST request
        yield scrapy.FormRequest(
            url="https://httpbin.dev/post",
            formdata={"foo": "bar"},
            meta={"playwright": True}
        )

    def parse(self, response):
        # 'response' contains the page as seen by the browser
        return {"url": response.url}

While scrapy-playwright doesn't give full control of the web browser it integrates effortlessly with scrapy Spiders and can be an easy solution for scraping dynamic web content using scrapy.

Alternatively, check out Scrapfly's scrapy SDK with the headless browser feature which configures scrapy request to go through Scrapfly's managed cloud browsers.

Scale Your Web Scraping
Anti-bot bypass, browser rendering, and rotating proxies — all in one API. Start with 1,000 free credits.
No credit card required 1,000 free API credits Anti-bot bypass included
Not ready? Get our newsletter instead.