How to block resources in Playwright and Python?

by scrapecrow Nov 03, 2022

To speed up Playwright web scrapers we can block media and other non-essential requests using the request interception feature:

from playwright.sync_api import sync_playwright

# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
  'beacon',
  'csp_report',
  'font',
  'image',
  'imageset',
  'media',
  'object',
  'texttrack',
#  we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',  
# 'xhr',
]


# we can also block popular 3rd party resources like tracking and advertisements.
BLOCK_RESOURCE_NAMES = [
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
]

def intercept_route(route):
    """intercept all requests and abort blocked ones"""
    if route.request.resource_type in BLOCK_RESOURCE_TYPES:
        print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
        return route.abort()
    if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
        print(f"blocking background resource {route.request} blocked name {route.request.url}")
        return route.abort()
    return route.continue_()

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        headless=False, 
        # tip: you can enable devtools so we can see total resource usage (bottom left corner)
        devtools=True, 
    )
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    # enable intercepting for this page, **/* stands for all requests
    page.route("**/*", intercept_route)
    page.goto("http://some-webpage.com/")

Resource blocking can significantly reduce bandwidth usage - often by 2-10 times! Take note though that blocking functional resources like stylesheets, scripts and xhr could affect the web scraping process.

Related Articles

Playwright Examples for Web Scraping and Automation

Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.

PLAYWRIGHT
PYTHON
NODEJS
Playwright Examples for Web Scraping and Automation

How to Scrape With Headless Firefox

Discover how to use headless Firefox with Selenium, Playwright, and Puppeteer for web scraping, including practical examples for each library.

HEADLESS-BROWSER
PUPPETEER
SELENIUM
NODEJS
PLAYWRIGHT
PYTHON
How to Scrape With Headless Firefox

Web Scraping Dynamic Websites With Scrapy Playwright

Learn about Selenium Playwright. A Scrapy integration that allows web scraping dynamic web pages with Scrapy. We'll explain web scraping with Scrapy Playwright through an example project and how to use it for common scraping use cases, such as clicking elements, scrolling and waiting for elements.

PYTHON
PLAYWRIGHT
SCRAPY
HEADLESS-BROWSER
Web Scraping Dynamic Websites With Scrapy Playwright

How to use Headless Chrome Extensions for Web Scraping

In this article, we'll explore different useful Chrome extensions for web scraping. We'll also explain how to install Chrome extensions with various headless browser libraries, such as Selenium, Playwright and Puppeteer.

PYTHON
NODEJS
TOOLS
PLAYWRIGHT
PUPPETEER
SELENIUM
How to use Headless Chrome Extensions for Web Scraping

How to Scrape Google Maps

We'll take a look at to find businesses through Google Maps search system and how to scrape their details using either Selenium, Playwright or ScrapFly's javascript rendering feature - all of that in Python.

SCRAPEGUIDE
PYTHON
SELENIUM
PLAYWRIGHT
How to Scrape Google Maps

How to Scrape Dynamic Websites Using Headless Web Browsers

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping

HEADLESS-BROWSER
PYTHON
SELENIUM
PUPPETEER
PLAYWRIGHT
How to Scrape Dynamic Websites Using Headless Web Browsers