Playwright Examples for Web Scraping and Automation
Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.
To speed up Playwright web scrapers we can block media and other non-essential requests using the request interception feature:
from playwright.sync_api import sync_playwright
# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
'beacon',
'csp_report',
'font',
'image',
'imageset',
'media',
'object',
'texttrack',
# we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',
# 'xhr',
]
# we can also block popular 3rd party resources like tracking and advertisements.
BLOCK_RESOURCE_NAMES = [
'adzerk',
'analytics',
'cdn.api.twitter',
'doubleclick',
'exelator',
'facebook',
'fontawesome',
'google',
'google-analytics',
'googletagmanager',
]
def intercept_route(route):
"""intercept all requests and abort blocked ones"""
if route.request.resource_type in BLOCK_RESOURCE_TYPES:
print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
return route.abort()
if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
print(f"blocking background resource {route.request} blocked name {route.request.url}")
return route.abort()
return route.continue_()
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=False,
# tip: you can enable devtools so we can see total resource usage (bottom left corner)
devtools=True,
)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# enable intercepting for this page, **/* stands for all requests
page.route("**/*", intercept_route)
page.goto("http://some-webpage.com/")
Resource blocking can significantly reduce bandwidth usage - often by 2-10 times! Take note though that blocking functional resources like stylesheets, scripts and xhr could affect the web scraping process.
This knowledgebase is provided by Scrapfly data APIs, check us out! 👇