How do you scrape JavaScript-heavy websites? Traditional scrapers see blank pages where JavaScript renders content. Playwright for Python solves this by controlling real browsers that execute JavaScript exactly like a user would. This guide takes you from extracting your first dynamic page to building production scrapers that handle thousands of URLs reliably. You'll build a real Twitch.tv scraper while learning patterns that work for any JavaScript-powered site.
Modern anti-bot systems (Cloudflare Turnstile, DataDome, Kasada) detect Playwright instantly through navigator.webdriver and TLS fingerprinting. For production scraping with anti-detection, see Scrapfly's Web Scraping API.
How to Scrape Dynamic Websites Using Headless Web Browsers
Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping
Key Takeaways
Master Playwright web scraping Python for JavaScript-heavy websites using Python, handling dynamic content, form interactions, and browser automation for complex scraping scenarios.
- Use Playwright WebDriver to automate real browsers for JavaScript-rendered content scraping
- Implement explicit waits and element detection to handle dynamic page loading
- Navigate complex web applications with form filling, clicking, and scrolling actions
- Handle pop-ups, alerts, and JavaScript challenges that block traditional HTTP scrapers
- Implement exponential backoff retry logic with 403 status code detection for rate limiting
- Build robust scrapers that can handle modern single-page applications and AJAX content
What is Playwright?
Playwright is a cross-platform and cross-language web browser automation toolkit. It's primarily intended to be used as a website test suite but it's perfectly capable of general browser automation and web scraping.
Using playwright we can automate web headless browsers like Firefox or Chrome to navigate the web just like a human being would: go to URLs, click buttons, write text and execute javascript.
It's a great tool for web scraping as it allows to scrape dynamic javascript-powered websites without the need to reverse engineer their behavior. It can also help with blocking as the scraper is running a full browser which appears more human than standalone HTTP requests.
Playwright vs Selenium vs Puppeteer
Compared to other popular browser automation toolkits like Selenium or Puppeteer, Playwright has a few advantages:
- Playwright supports many programming languages whereas Puppeteer is only available in Javasrcipt.
- Playwright uses Chrome Devtools Protocol (CDP) and a more modern API, whereas Selenium is using webdriver protocol and a less modern API.
- Playwright supports both asynchronous and synchronous clients, whereas Selenium only supports a synchronous client and Puppeteer an asynchronous one. In Playwright, we can write small scrapers using synchronous clients and scale up simply by switching to a more complex asynchronous architecture.
In other words, Playwright is a horizontal improvement over Selenium and Puppeteer. Though, every toolkit has its own strengths. If you'd like to learn more see our other introduction articles:
Setup
Playwright for Python can be installed through pip:
# install playwright package:
$ pip install playwright
# install playwright chrome and firefox browsers
$ playwright install chrome firefox
The above command will install playwright package and playwright browser binaries. For Playwright scraping, it's best to use either Chrome or Firefox browsers as these are the most stable implementations and often are least likely to be blocked.
Tip: Playwright in REPL
The easiest way to understand Playwright is to experiment with it in real-time through Python REPL (Read, Evaluate, Print, Loop) like ipython
Starting ipython we can launch a playwright browser and execute browser automation commands in real-time to experiment and prototype our web scraper:
$ pip install ipython nest_asyncio
$ ipython
import nest_asyncio; nest_asyncio.apply() # This is needed to use sync API in repl
from playwright.sync_api import sync_playwright
pw = sync_playwright.start()
chrome = pw.chromium.launch(headless=False)
page = chrome.new_page()
page.goto("https://twitch.tv")
Here's a sneak peek of what we'll be doing in this article through the eyes of REPL:
Now, let's take a look at this in greater detail.
The Basics
To start, we need to launch a browser and start a new browser tab:
with sync_playwright() as pw:
# create browser instance
browser = pw.chromium.launch(
# we can choose either a Headful (With GUI) or Headless mode:
headless=False,
)
# create context
# using context we can define page properties like viewport dimensions
context = browser.new_context(
# most common desktop viewport is 1920x1080
viewport={"width": 1920, "height": 1080}
)
# create page aka browser tab which we'll be using to do everything
page = context.new_page()
Once we have our browser page ready, we can start Playwright web scraping for which we need only a handful of Playwright features:
- Navigation (i.e. go to URL)
- Button clicking
- Text input
- Javascript Execution
- Waiting for content to load
Let's take a look at these features through a real-life example.
For this, we'll be scraping video data from twitch.tv art section where users stream their art creation process. We'll be collecting dynamic data like stream name, viewer count and author's details.
Our task in Playwright for this exercise is:
- Start a browser instance, context and browser tab (page)
- Go to twitch.tv/directory/game/Art
- Wait for the page to fully load
- Parse loaded page data for all active streams
Navigation and Waiting
To navigate we can use page.goto() function which will direct the browser to any URL:
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# go to url
page.goto("https://twitch.tv/directory/game/Art")
# get HTML
print(page.content())
However, for javascript-heavy websites like twitch.tv our page.content() code might return data prematurely before everything is loaded.
To ensure that doesn't happen we can wait for a particular element to appear on the page. In other words, if the list of videos is present on the page then we can safely assume the page has loaded:
page.goto("https://twitch.tv/directory/game/Art")
# wait for first result to appear
page.wait_for_selector("div[data-target=directory-first-item]")
# retrieve final HTML content
print(page.content())
Above, we used page.wait_for_selector() function to wait for an element defined by our CSS selector to appear on the page.
Parsing Data
Since Playwright uses a real web browser with javascript environment we can use the browser's HTML parsing capabilities. In Playwright this is implemented through locators feature:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://twitch.tv/directory/game/Art") # go to url
page.wait_for_selector("div[data-target=directory-first-item]") # wait for content to load
parsed = []
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
for box in stream_boxes.element_handles():
parsed.append({
"title": box.query_selector("h3").inner_text(),
"url": box.query_selector(".tw-link").get_attribute("href"),
"username": box.query_selector(".tw-link").inner_text(),
"viewers": box.query_selector(".tw-media-card-stat").inner_text(),
# tags are not always present:
"tags": box.query_selector(".tw-tag").inner_text() if box.query_selector(".tw-tag") else None,
})
for video in parsed:
print(video)
Example Output
{"title": "art", "url": "/lunyatic/videos", "username": "Lunyatic", "viewers": "25 viewers", "tags": "en"}
{"title": "생존신고", "url": "/lilllly1/videos", "username": "생존신고\n\n릴리작가 (lilllly1)", "viewers": "51 viewers", "tags": "한국어"}
{"title": "The day 0914.", "url": "/niai_serie/videos", "username": "The day 0914", "viewers": "187 viewers", "tags": None}
...
In the code above, we selected each result box using XPath selectors and extracted details from within it using CSS selectors.
Unfortunately, playwrights parsing capabilities are a bit clunky and can break easily when parsing optional elements like the tags field in our example. Instead, we can use traditional Python parsing either through parsel or beautifulsoup packages which perform much faster and provide a more robust API:
...
# using Parsel:
from parsel import Selector
page_html = page.content()
sel = Selector(text=page_html)
parsed = []
for item in sel.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
parsed.append({
'title': item.css('h3::text').get(),
'url': item.css('.tw-link::attr(href)').get(),
'username': item.css('.tw-link::text').get(),
'tags': item.css('.tw-tag ::text').getall(),
'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
})
# using Beautifulsoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content())
parsed = []
for item in soup.select(".tw-tower div[data-target]"):
parsed.append({
'title': item.select_one('h3').text,
'url': item.select_one('.tw-link::attr(href)').attrs.get("href"),
'username': item.select_one('.tw-link').text,
'tags': [tag.text for tag in item.select('.tw-tag')],
'viewers': item.select_one('.tw-media-card-stat').text,
})
While playwright locators aren't great for parsing they are great for interacting with the website. Next, let's take a look at how we can click buttons and input text using locators.
Clicking Buttons and Text Input
To explore click and text input let's extend our twitch.tv scraper with search functionality:
- We'll go to twitch.tv
- Select the search box and input a search query
- Click the search button or press Enter
- Wait for the content to load
- Parse results
In playwright to interact with the web components we can use the same locator functionality we used in parsing:
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://www.twitch.tv/directory/game/Art")
# find search box and enter our query:
search_box = page.locator('input[autocomplete="twitch-nav-search"]')
search_box.type("Painting", delay=100)
# then, we can either send Enter key:
search_box.press("Enter")
# or we can press the search button explicitly:
search_button = page.locator('button[aria-label="Search Button"]')
search_button.click()
# click on tagged channels link:
page.locator('.search-results .tw-link[href*="all/tags"]').click()
# Finally, we can parse results like we did before:
parsed = []
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
for box in stream_boxes.element_handles():
...
Note: playwright's locator doesn't allow selectors that result in multiple values. It wouldn't know which one to click. Meaning, our selectors must be unique to one element we want to interact with.
We got search functionality working and extracted the first page of the results, though how do we get the rest of the pages? For this we'll need scrolling functionality - let's take a look at it.
Scrolling and Infinite Pagination
The stream results section of twitch.tv is using infinite scrolling pagination. To retrieve the rest of the results in our Playwright scraper we need to continuously scroll to the last result visible on the page to trigger new page loads.
We could do this by scrolling to the bottom of the entire page but that doesn't always work in headless browsers. A better way is to find all elements and scroll the last one into view expliclitly.
In playwright, this can be done by using locators and scroll_into_view_if_needed() function. We'll keep scrolling the last result into view to trigger the next page loading until no more new results appear:
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://www.twitch.tv/directory/game/Art")
# wait for content to fully load:
page.wait_for_selector("div[data-target=directory-first-item]")
# loop scrolling last element into view until no more new elements are created
stream_boxes = None
while True:
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
stream_boxes.element_handles()[-1].scroll_into_view_if_needed()
items_on_page = len(stream_boxes.element_handles())
page.wait_for_timeout(2_000) # give some time for new items to load
items_on_page_after_scroll = len(stream_boxes.element_handles())
if items_on_page_after_scroll > items_on_page:
continue # more items loaded - keep scrolling
else:
break # no more items - break scrolling loop
# parse data:
parsed = []
for box in stream_boxes.element_handles():
...
In the example code above, we will continuously trigger new result loading until the pagination end is reached. In this case, our code should generate hundreds of parsed results.
Advanced Functions
We've covered the most common playwright features used in web scraping: navigation, waiting, clicking, typing and scrolling. However, there are a few advanced features that come in handy scraping more complex web scraping targets.
Evaluating Javascript
Playwright can evaluate any javacript code in the context of the current page. Using javascript we can do everything we did before like navigating, clicking and scrolling and even more! In fact, many of these playwright functions are implemented through javascript evaluation.
For example, if the built-in scrolling is failing us we can define our own scrolling javascript function and submit it to Playwright:
page.evaluate("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView({behavior: "smooth", block: "end", inline: "end"});
""")
The above code will scroll the last result into view just like previously but it'll scroll smoothly and to the very edge of the object. This approach is more likely to trigger next page loading compared to Playwright's scroll_into_view_if_needed function.
Javascript evaluation is a powerful feature that can be used to scrape complex web apps as it gives us full control of the browser's capabilities through javascript.
Request and Response Intercepting
Playwright tracks all of the background requests and responses the browser sends and receives. In web scraping, we can use this to modify background requests or collect secret data from background responses:
from playwright.sync_api import sync_playwright
def intercept_request(request):
# we can update requests with custom headers
if "secret" in request.url :
request.headers['x-secret-token'] = "123"
print("patched headers of a secret request")
# or adjust sent data
if request.method == "POST":
request.post_data = "patched"
print("patched POST request")
return request
def intercept_response(response):
# we can extract details from background requests
if response.request.resource_type == "xhr":
print(response.headers.get('cookie'))
return response
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# enable intercepting for this page
page.on("request", intercept_request)
page.on("response", intercept_response)
page.goto("https://www.twitch.tv/directory/game/Art")
page.wait_for_selector("div[data-target=directory-first-item]")
In the example above, we define our interceptor functions and attach them to our playwright page. This will allow us to inspect and modify every background and foreground request the browser makes.
Blocking Resources
Web scraping using headless browsers is really bandwidth intensive. The browser is downloading all of the images, fonts and other expensive resources our web scraper doesn't care about. To optimize this we can configure our Playwright instance to block these unnecessary resources:
from collections import Counter
from playwright.sync_api import sync_playwright
# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
'beacon',
'csp_report',
'font',
'image',
'imageset',
'media',
'object',
'texttrack',
# we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',
# 'xhr',
]
# we can also block popular 3rd party resources like tracking:
BLOCK_RESOURCE_NAMES = [
'adzerk',
'analytics',
'cdn.api.twitter',
'doubleclick',
'exelator',
'facebook',
'fontawesome',
'google',
'google-analytics',
'googletagmanager',
]
def intercept_route(route):
"""intercept all requests and abort blocked ones"""
if route.request.resource_type in BLOCK_RESOURCE_TYPES:
print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
return route.abort()
if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
print(f"blocking background resource {route.request} blocked name {route.request.url}")
return route.abort()
return route.continue_()
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=False,
# enable devtools so we can see total resource usage:
devtools=True,
)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# enable intercepting for this page, **/* stands for all requests
page.route("**/*", intercept_route)
page.goto("https://www.twitch.tv/directory/game/Art")
page.wait_for_selector("div[data-target=directory-first-item]")
In the example above, we are defining an interception rule which tells Playwright to drop any unwanted background resource requests that are either of ignored type or contain ignored phrases in the URL (like google analytics).
We can see the amount of data saved in Devtools' Network tab:
Why Do Playwright Scrapers Fail and How to Retry Intelligently?
Failed requests waste 30% of scraping time on average. Without retry logic, temporary failures break entire scraping jobs. Exponential backoff with jitter prevents rate limit bans while keeping your scraper running efficiently.
Common failure patterns include:
- 403 Forbidden - Bot detection triggered
- 429 Too Many Requests - Rate limited, respect retry-after header
- TimeoutError - Page took too long to load
- NetworkError - Connection issues, DNS failures
- 500 Server Errors - Temporary server issues (safe to retry)
Exponential backoff implements intelligent retry timing where wait_time = base_delay * (2 ** attempt). This spreads retries over time and avoids hammering servers. Adding jitter (randomness) with wait_time * random(0.5, 1.5) prevents thundering herd problems when multiple scrapers retry simultaneously.
Typical configuration uses 3-5 max retries with a 1-2 second base delay. The scraper should give up after max retries or for non-retryable errors like 404 (Not Found) or 401 (Unauthorized).
import time
import random
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeoutError
def retry_with_backoff(func, max_retries=5, base_delay=1):
"""Retry decorator with exponential backoff and jitter"""
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
error_code = getattr(e, 'status', None)
# Don't retry on non-retryable errors
if error_code in [401, 404]:
raise
# Check if we should retry
if attempt == max_retries - 1:
raise
# Calculate exponential backoff with jitter
wait_time = base_delay * (2 ** attempt)
jittered_wait = wait_time * random.uniform(0.5, 1.5)
print(f"Attempt {attempt + 1} failed with {type(e).__name__}: {e}")
print(f"Retrying in {jittered_wait:.2f} seconds...")
time.sleep(jittered_wait)
return None
return wrapper
@retry_with_backoff
def scrape_page(page, url):
"""Scrape a page with automatic retry logic"""
page.goto(url, timeout=30000)
page.wait_for_selector("div[data-target=directory-first-item]", timeout=10000)
return page.content()
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
html = scrape_page(page, "https://www.twitch.tv/directory/game/Art")
print(f"Successfully scraped {len(html)} bytes")
How to Scrape Multiple Web Pages with Playwright
Single-page scraping is just the beginning. Real projects need to process hundreds of URLs, handle pagination, and run concurrent browsers without crashing. Here's how to scale Playwright from tutorial to production.
How to Loop Through Multiple URLs
When scraping product catalogs or directories, you need to process hundreds or thousands of URLs systematically. Reading URLs from a list or CSV file and iterating through each one ensures you collect data methodically.
For example, scraping multiple product pages from a catalog requires visiting each URL, extracting data, and storing results. Reading URLs from a CSV file provides a clean way to manage your scraping queue:
import csv
import json
from playwright.sync_api import sync_playwright
def scrape_url(page, url):
"""Scrape a single URL and extract data"""
try:
page.goto(url, timeout=30000)
page.wait_for_selector("div[data-target=directory-first-item]", timeout=10000)
# Extract data (simplified example)
title = page.locator("h1").inner_text()
return {"url": url, "title": title, "success": True}
except Exception as e:
return {"url": url, "error": str(e), "success": False}
# Read URLs from CSV file
urls = []
with open('urls.csv', 'r') as f:
reader = csv.DictReader(f)
urls = [row['url'] for row in reader]
# Scrape each URL and save to JSONL
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
with open('results.jsonl', 'w') as f:
for url in urls:
result = scrape_url(page, url)
f.write(json.dumps(result) + '\n')
print(f"Scraped: {url} - Success: {result['success']}")
browser.close()
The pattern above iterates through URLs sequentially with basic error handling. Each result gets written to a JSONL file immediately, ensuring data persists even if the scraper crashes partway through. You could also save to CSV using csv.DictWriter or to a database for larger projects.
How to Handle Pagination with Playwright
Websites split content across multiple pages using pagination buttons or URL parameters. Playwright handles both approaches effectively.
Button-based pagination requires detecting and clicking "Next Page" buttons using locators. The scraper should check if the button exists before clicking and detect when it's disabled or missing (indicating the last page):
from playwright.sync_api import sync_playwright
def scrape_all_pages_button(page, start_url):
"""Scrape paginated results using Next button"""
page.goto(start_url)
page.wait_for_selector("div[data-target=directory-first-item]")
all_results = []
page_num = 1
while True:
print(f"Scraping page {page_num}...")
# Extract data from current page
items = page.locator("div[data-target]").all()
all_results.extend([item.inner_text() for item in items])
# Check if Next button exists and is enabled
next_button = page.locator('button[aria-label="Next"]')
if next_button.count() == 0 or next_button.is_disabled():
print("Reached last page")
break
# Click next and wait for new content
next_button.click()
page.wait_for_selector("div[data-target=directory-first-item]")
page_num += 1
return all_results
Parameter-based pagination modifies URL parameters like ?page=2 or ?offset=20. This approach is often faster since you can directly navigate to specific pages:
def scrape_all_pages_params(page, base_url):
"""Scrape paginated results using URL parameters"""
all_results = []
page_num = 1
while True:
url = f"{base_url}?page={page_num}"
print(f"Scraping {url}...")
page.goto(url)
page.wait_for_selector("div[data-target=directory-first-item]", timeout=5000)
# Extract data
items = page.locator("div[data-target]").all()
if len(items) == 0:
print("No more results")
break
all_results.extend([item.inner_text() for item in items])
page_num += 1
return all_results
Unlike infinite scrolling (covered in the Scrolling and Infinite Pagination section), pagination requires clicking buttons or changing URL parameters rather than scrolling to load more content.
How to Scrape Pages Concurrently with Asyncio
Concurrent scraping significantly improves speed for I/O-bound tasks. Playwright has native async support designed for running multiple browser contexts simultaneously.
Why concurrent scraping matters: A single-threaded scraper spends most time waiting for pages to load. Running 5 concurrent browsers can scrape 5x faster while using only slightly more memory.
Asyncio with semaphores controls concurrency to avoid overwhelming your system or the target server. Each browser context uses approximately 50-100MB RAM, so limiting concurrent browsers prevents memory exhaustion:
import asyncio
from playwright.async_api import async_playwright
async def scrape_url_async(browser, url, semaphore):
"""Scrape a single URL with semaphore rate limiting"""
async with semaphore: # Limit concurrent browsers
context = await browser.new_context(viewport={"width": 1920, "height": 1080})
page = await context.new_page()
try:
await page.goto(url, timeout=30000)
await page.wait_for_selector("div[data-target=directory-first-item]", timeout=10000)
title = await page.locator("h1").inner_text()
return {"url": url, "title": title, "success": True}
except Exception as e:
return {"url": url, "error": str(e), "success": False}
finally:
await context.close()
async def scrape_multiple_concurrent(urls, max_concurrent=5):
"""Scrape multiple URLs concurrently with rate limiting"""
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
# Semaphore limits concurrent operations
semaphore = asyncio.Semaphore(max_concurrent)
# Create tasks for all URLs
tasks = [scrape_url_async(browser, url, semaphore) for url in urls]
# Run all tasks concurrently
results = await asyncio.gather(*tasks, return_exceptions=True)
await browser.close()
return results
# Run the concurrent scraper
urls = [
"https://www.twitch.tv/directory/game/Art",
"https://www.twitch.tv/directory/game/Music",
"https://www.twitch.tv/directory/game/Science",
# ... more URLs
]
results = asyncio.run(scrape_multiple_concurrent(urls, max_concurrent=5))
for result in results:
print(result)
Per-domain concurrency respects rate limits by controlling requests per domain. If scraping multiple domains, you can allow higher total concurrency while limiting requests to each individual domain:
from collections import defaultdict
from urllib.parse import urlparse
async def scrape_with_domain_limits(urls, per_domain_limit=2):
"""Control concurrency per domain to avoid rate limits"""
# Group URLs by domain
domain_semaphores = defaultdict(lambda: asyncio.Semaphore(per_domain_limit))
async def scrape_with_domain_semaphore(browser, url):
domain = urlparse(url).netloc
semaphore = domain_semaphores[domain]
return await scrape_url_async(browser, url, semaphore)
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
tasks = [scrape_with_domain_semaphore(browser, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
await browser.close()
return results
Performance considerations: Start with 3-5 concurrent browsers and monitor system memory. Too many parallel browsers can trigger anti-bot detection or overwhelm your network. Batch processing URLs in chunks of 50-100 prevents memory issues when scraping thousands of pages.
When to use threads: ThreadPoolExecutor works when scraping combines I/O (fetching) with CPU-heavy parsing. For pure Playwright scraping, asyncio is typically faster and more memory-efficient.
Avoiding Blocking
While Playwright uses a real browser, it's still possible for websites to determine whether it's controlled by a real user or automated by a toolkit. Modern anti-bot systems use advanced detection techniques that go beyond simple checks.
What are Honeypot Traps and How to Avoid Them?
Websites hide invisible elements (honeypots) to catch bots - clicking them instantly flags your scraper. These traps include invisible links with display: none, hidden forms with visibility: hidden, or elements positioned off-screen with negative coordinates.
Honeypots work because bots often interact with all clickable elements indiscriminately, while humans only see and click visible elements. A typical honeypot might be a link that says "Click here for admin access" but is hidden via CSS.
To avoid triggering honeypots, always check element visibility before interaction:
from playwright.sync_api import sync_playwright
def is_element_visible(element):
"""Check if element is truly visible to users"""
# Check if element is visible in viewport
return element.is_visible() and not element.is_hidden()
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com")
# Only click visible buttons
buttons = page.locator("button").all()
for button in buttons:
if is_element_visible(button):
button.click()
print(f"Clicked visible button: {button.inner_text()}")
else:
print(f"Skipped hidden honeypot button")
Additional honeypot avoidance strategies include checking element dimensions (width and height should be greater than 0), verifying opacity is not 0, and ensuring elements aren't positioned outside the viewport bounds.
How to Configure Proxy Rotation
Websites track and block IP addresses - proxy rotation spreads requests across multiple IPs to avoid detection and rate limits. When a website sees thousands of requests from a single IP, it's an obvious bot signal.
Proxy setup with Playwright configures proxies per browser instance:
from playwright.sync_api import sync_playwright
# List of proxy servers
proxies = [
{"server": "http://proxy1.example.com:8080", "username": "user1", "password": "pass1"},
{"server": "http://proxy2.example.com:8080", "username": "user2", "password": "pass2"},
{"server": "http://proxy3.example.com:8080", "username": "user3", "password": "pass3"},
]
def scrape_with_proxy(url, proxy):
"""Scrape URL using specific proxy"""
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=True,
proxy=proxy
)
page = browser.new_page()
page.goto(url)
content = page.content()
browser.close()
return content
# Rotate through proxies
import itertools
proxy_pool = itertools.cycle(proxies)
urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
for url in urls:
proxy = next(proxy_pool)
print(f"Scraping {url} with proxy {proxy['server']}")
content = scrape_with_proxy(url, proxy)
Residential vs datacenter proxies:
- Residential proxies use real IP addresses from ISPs, making them appear as legitimate home users. They're harder to detect but more expensive ($5-15 per GB). Use for sites with strong anti-bot protection.
- Datacenter proxies come from data centers and are faster and cheaper ($0.50-2 per GB) but easier to detect. Use for sites with minimal blocking.
- Rotating residential proxies automatically change IP on each request, providing maximum anonymity at premium cost ($10-20 per GB).
Managing proxy pools requires ongoing maintenance including monitoring proxy health, replacing dead proxies, handling authentication, and distributing requests evenly. The infrastructure cost and complexity add up quickly for production scrapers.
Advanced Detection Vectors
Modern websites use multiple detection methods beyond basic IP blocking:
Canvas/WebGL Fingerprinting: Websites render hidden graphics and read back pixel data to create unique browser fingerprints that identify you across sessions. Different browsers, graphics cards, and operating systems produce slightly different rendering results, creating a unique signature.
navigator.webdriver Detection: Automated browsers expose navigator.webdriver = true which anti-bot systems detect instantly. This is one of the first checks many sites perform:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()
# Check if webdriver is detected
is_webdriver = page.evaluate("navigator.webdriver")
print(f"navigator.webdriver detected: {is_webdriver}") # True in Playwright
# Basic stealth: Try to mask webdriver (limited effectiveness)
page.add_init_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
page.goto("https://bot-detection-test.com")
TLS Fingerprinting: Server-side analysis of TLS handshake patterns (cipher suites, extensions order) can identify automated clients even before HTTP requests. Browsers have unique TLS signatures, and automation tools often have different signatures than regular browsers.
Timing Analysis: Bots perform actions with inhuman precision - no human clicks exactly 100ms after page load every time. Suspicious patterns include consistent delays, instant form filling, and no mouse movement.
Mouse Movement & Behavioral Patterns: Absence of natural mouse movements, scrolling patterns, and random pauses signal automation. Humans move mice erratically, pause to read, and scroll gradually. Bots typically don't generate these behaviors.
Managing these detection vectors at scale requires continuous effort:
- Browser fingerprint rotation and maintenance
- Staying current with evolving detection techniques
- Infrastructure for proxy pools and rotation
- Monitoring and adapting as sites update their defenses
- Handling CAPTCHAs and challenge responses
For production environments where reliability and scale matter, tools like Scrapfly handle these fingerprinting challenges automatically:
- Rotating residential proxies across multiple geolocations
- Auto-retry for 403/429 errors with exponential backoff and jitter
- Browser fingerprint management (Canvas, WebGL, TLS)
- CAPTCHA solving integration
- Managed anti-scraping protection that adapts to site changes
For more on this see our extensive article covering javascript fingerprinting and variable leaking:
How Javascript is Used to Block Web Scrapers? In-Depth Guide
Introduction to how javascript is used to detect web scrapers. What's in javascript fingerprint and how to correctly spoof it for web scraping.
Power-Up with Scrapfly
Playwright is a powerful web scraping tool however it can be difficult to scale up and handle in some web scraping scenarios and this is where Scrapfly can be of assistance!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
- Anti-bot protection bypass - scrape web pages without blocking!
- Rotating residential proxies - prevent IP address and geographic blocks.
- JavaScript rendering - scrape dynamic web pages through cloud browsers.
- Full browser automation - control browsers to scroll, input and click on objects.
- Format conversion - scrape as HTML, JSON, Text, or Markdown.
- Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.
Using ScrapFly SDK we can replicate the same actions we did in Playwright:
client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
# We can use a browser to render the page, screenshot it and return final HTML
result = client.scrape(ScrapeConfig(
"https://www.twitch.tv/directory/game/Art",
# enable browser rendering
render_js=True,
# we can wait for specific part to load just like with Playwright:
wait_for_selector="div[data-target=directory-first-item]",
# we can capture screenshots
screenshots={"everything": "fullpage"},
# for targets that block scrapers we can enable block bypass:
asp=True
))
# It's also possible to execute complex javascript scenarios like button clicking
# and text typing:
result = client.scrape(ScrapeConfig(
"https://www.twitch.tv/directory/game/Art",
# enable browser rendering
wait_for_selector="div[data-target=directory-first-item]",
render_js=True,
js_scenario=[
# wait to load
{"wait_for_selector": {"selector": 'input[autocomplete="twitch-nav-search"]'}},
# input search
{"fill": {"value": "watercolor", "selector": 'input[autocomplete="twitch-nav-search"]'}},
# click search button
{"click": {"selector": 'button[aria-label="Search Button"]'}},
# wait explicit amount of time
{"wait_for_navigation": {"timeout": 2000}}
]
))
Just like with Playwright we can control a web browser to navigate the website, click buttons, input text and return the final rendered HTML to us for parsing.
FAQs
What's the difference between Playwright and Selenium for web scraping?
Playwright uses modern Chrome DevTools Protocol (CDP) with a more intuitive API, supports multiple programming languages, and offers both sync/async modes. Selenium uses WebDriver protocol with a less modern API and only supports synchronous operations. Playwright generally performs better and is less likely to be detected.
How do I handle dynamic content loading with Playwright?
Use page.wait_for_selector() to wait for specific elements to appear, page.wait_for_timeout() for fixed delays, or page.wait_for_load_state() for page load completion. For infinite scrolling, use scroll_into_view_if_needed() on the last element to trigger new content loading.
Can I use Playwright for scraping without a headless browser?
Yes, set headless=False to run with a visible browser window. This is useful for debugging, but headless mode (headless=True) is faster and more resource-efficient for production scraping.
How do I avoid getting blocked when scraping with Playwright?
Use rotating proxies, implement realistic delays, rotate user-agents and headers, use residential IPs, avoid patterns that look automated, and consider using anti-bot bypass services. Playwright's real browser helps but isn't foolproof against advanced detection.How to use a proxy with Playwright?
We can assign proxy IP address per playwright browser basis:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=False,
# direct proxy server
proxy={"server": "11.11.11.1:9000"},
# or with username/password:
proxy={"server": "11.11.11.1:9000", "username": "A", "password": "B"},
)
page = browser.new_page()
How to speed up a Playwright Scraper?
We can greatly speed up scrapers using Playwright by ensuring that the headless browser is blocking the rendering of images and media. This can greatly reduce bandwidth and speed up scraping 2-5 times! For more see the Blocking Resources section.
Which headless browser is best to use for Playwright Scraping?
Headless chrome performs the best when it comes to scraping with Playwright. Though, Firefox can often help with avoiding blocking and captchas as it's a less popular browser. For more see: How Javascript is Used to Block Web Scrapers? In-Depth Guide
Is asyncio faster than threads for Playwright scraping?
Yes, asyncio is generally faster for I/O-bound scraping. Playwright has native async support and asyncio avoids thread overhead. For pure web scraping, asyncio typically uses 30-40% less memory than threads for the same number of concurrent operations.
How many pages can I scrape in parallel safely with Playwright?
Typical safe range: 3-10 concurrent browser contexts on consumer hardware. Each browser context uses approximately 50-100MB RAM. Use semaphores to limit concurrency (asyncio.Semaphore(5)). Start conservative (3-5) and increase gradually while monitoring for 429 rate limit errors and memory usage.
What's the best retry strategy for 403/429 errors in Playwright?
Use exponential backoff: wait_time = base_delay * (2 ** attempt) with jitter * random(0.5, 1.5). Typical config: 3-5 max retries with 1-2 second base delay. Give up after max retries or for non-retryable errors (404, 401).
Can AI tools like AutoGPT use Playwright for web scraping?
Yes, AI agents can use Playwright for autonomous browsing. LLMs can navigate sites, fill forms, and extract data based on natural language goals. Current use cases include AI determining which elements to click and adapting to page structure changes. However, production implementations still require significant engineering work beyond basic AI integration.
Summary
In this in-depth introduction, we learned how can we use Playwright web browser automation toolkit for web scraping. We explored core features such as navigation, button clicking, input typing and data parsing through real-life twitch.tv scraper example.
We've also taken a look at more advanced features like resource blocking which can reduce bandwidth use by our browser-powered webscrapers significantly. The same feature can also be used to intercept browser background requests to extract details like cookies or modify connections.