What is a Headless Browser? Top 5 Headless Browser Tools
Quick overview of new emerging tech of browser automation - what exactly are these tools and how are they used in web scraping?
To speed up Playwright web scrapers we can block media and other non-essential requests using the request interception feature:
from playwright.sync_api import sync_playwright
# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
'beacon',
'csp_report',
'font',
'image',
'imageset',
'media',
'object',
'texttrack',
# we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',
# 'xhr',
]
# we can also block popular 3rd party resources like tracking and advertisements.
BLOCK_RESOURCE_NAMES = [
'adzerk',
'analytics',
'cdn.api.twitter',
'doubleclick',
'exelator',
'facebook',
'fontawesome',
'google',
'google-analytics',
'googletagmanager',
]
def intercept_route(route):
"""intercept all requests and abort blocked ones"""
if route.request.resource_type in BLOCK_RESOURCE_TYPES:
print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
return route.abort()
if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
print(f"blocking background resource {route.request} blocked name {route.request.url}")
return route.abort()
return route.continue_()
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=False,
# tip: you can enable devtools so we can see total resource usage (bottom left corner)
devtools=True,
)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# enable intercepting for this page, **/* stands for all requests
page.route("**/*", intercept_route)
page.goto("http://some-webpage.com/")
Resource blocking can significantly reduce bandwidth usage - often by 2-10 times! Take note though that blocking functional resources like stylesheets, scripts and xhr could affect the web scraping process.
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇