Web Scraping with Playwright and Python

by Bernardas Ališauskas Nov 18, 2025

#python #headless-browser

How do you scrape JavaScript-heavy websites? Traditional scrapers see blank pages where JavaScript renders content. Playwright for Python solves this by controlling real browsers that execute JavaScript exactly like a user would. This guide takes you from extracting your first dynamic page to building production scrapers that handle thousands of URLs reliably. You'll build a real Twitch.tv scraper while learning patterns that work for any JavaScript-powered site.

Modern anti-bot systems (Cloudflare Turnstile, DataDome, Kasada) detect Playwright instantly through navigator.webdriver and TLS fingerprinting. For production scraping with anti-detection, see Scrapfly's Web Scraping API.

How to Scrape Dynamic Websites Using Headless Web Browsers

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping

Key Takeaways

Master Playwright web scraping Python for JavaScript-heavy websites using Python, handling dynamic content, form interactions, and browser automation for complex scraping scenarios.

Use Playwright WebDriver to automate real browsers for JavaScript-rendered content scraping
Implement explicit waits and element detection to handle dynamic page loading
Navigate complex web applications with form filling, clicking, and scrolling actions
Handle pop-ups, alerts, and JavaScript challenges that block traditional HTTP scrapers
Implement exponential backoff retry logic with 403 status code detection for rate limiting
Build robust scrapers that can handle modern single-page applications and AJAX content

What is Playwright?

Playwright is a cross-platform and cross-language web browser automation toolkit. It's primarily intended to be used as a website test suite but it's perfectly capable of general browser automation and web scraping.

Using playwright we can automate web headless browsers like Firefox or Chrome to navigate the web just like a human being would: go to URLs, click buttons, write text and execute javascript.

illustration of python and playwright — Playwright allows us to communicate with web browsers through Python code

It's a great tool for web scraping as it allows to scrape dynamic javascript-powered websites without the need to reverse engineer their behavior. It can also help with blocking as the scraper is running a full browser which appears more human than standalone HTTP requests.

Playwright vs Selenium vs Puppeteer

Compared to other popular browser automation toolkits like Selenium or Puppeteer, Playwright has a few advantages:

Playwright supports many programming languages whereas Puppeteer is only available in Javasrcipt.
Playwright uses Chrome Devtools Protocol (CDP) and a more modern API, whereas Selenium is using webdriver protocol and a less modern API.
Playwright supports both asynchronous and synchronous clients, whereas Selenium only supports a synchronous client and Puppeteer an asynchronous one. In Playwright, we can write small scrapers using synchronous clients and scale up simply by switching to a more complex asynchronous architecture.

In other words, Playwright is a horizontal improvement over Selenium and Puppeteer. Though, every toolkit has its own strengths. If you'd like to learn more see our other introduction articles:

Setup

Playwright for Python can be installed through pip:

# install playwright package:
$ pip install playwright
# install playwright chrome and firefox browsers
$ playwright install chrome firefox

The above command will install playwright package and playwright browser binaries. For Playwright scraping, it's best to use either Chrome or Firefox browsers as these are the most stable implementations and often are least likely to be blocked.

Tip: Playwright in REPL

The easiest way to understand Playwright is to experiment with it in real-time through Python REPL (Read, Evaluate, Print, Loop) like ipython

Starting ipython we can launch a playwright browser and execute browser automation commands in real-time to experiment and prototype our web scraper:

$ pip install ipython nest_asyncio
$ ipython
import nest_asyncio; nest_asyncio.apply()  # This is needed to use sync API in repl
from playwright.sync_api import sync_playwright
pw = sync_playwright.start()
chrome = pw.chromium.launch(headless=False)
page = chrome.new_page()
page.goto("https://twitch.tv")

Here's a sneak peek of what we'll be doing in this article through the eyes of REPL:

Playwright through iPython REPL

Now, let's take a look at this in greater detail.

The Basics

To start, we need to launch a browser and start a new browser tab:

with sync_playwright() as pw:
    # create browser instance
    browser = pw.chromium.launch(
        # we can choose either a Headful (With GUI) or Headless mode:
        headless=False,
    )
    # create context
    # using context we can define page properties like viewport dimensions
    context = browser.new_context(
        # most common desktop viewport is 1920x1080
        viewport={"width": 1920, "height": 1080}
    )
    # create page aka browser tab which we'll be using to do everything
    page = context.new_page()

Once we have our browser page ready, we can start Playwright web scraping for which we need only a handful of Playwright features:

Navigation (i.e. go to URL)
Button clicking
Text input
Javascript Execution
Waiting for content to load

Let's take a look at these features through a real-life example.
For this, we'll be scraping video data from twitch.tv art section where users stream their art creation process. We'll be collecting dynamic data like stream name, viewer count and author's details.

Our task in Playwright for this exercise is:

Start a browser instance, context and browser tab (page)
Go to twitch.tv/directory/game/Art
Wait for the page to fully load
Parse loaded page data for all active streams

To navigate we can use page.goto() function which will direct the browser to any URL:

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    # go to url
    page.goto("https://twitch.tv/directory/game/Art")
    # get HTML
    print(page.content())

However, for javascript-heavy websites like twitch.tv our page.content() code might return data prematurely before everything is loaded.
To ensure that doesn't happen we can wait for a particular element to appear on the page. In other words, if the list of videos is present on the page then we can safely assume the page has loaded:

page.goto("https://twitch.tv/directory/game/Art")
# wait for first result to appear
page.wait_for_selector("div[data-target=directory-first-item]")
# retrieve final HTML content
print(page.content())

Above, we used page.wait_for_selector() function to wait for an element defined by our CSS selector to appear on the page.

Parsing Data

Since Playwright uses a real web browser with javascript environment we can use the browser's HTML parsing capabilities. In Playwright this is implemented through locators feature:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    page.goto("https://twitch.tv/directory/game/Art")  # go to url
    page.wait_for_selector("div[data-target=directory-first-item]")  # wait for content to load

    parsed = []
    stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
    for box in stream_boxes.element_handles():
        parsed.append({
            "title": box.query_selector("h3").inner_text(),
            "url": box.query_selector(".tw-link").get_attribute("href"),
            "username": box.query_selector(".tw-link").inner_text(),
            "viewers": box.query_selector(".tw-media-card-stat").inner_text(),
            # tags are not always present:
            "tags": box.query_selector(".tw-tag").inner_text() if box.query_selector(".tw-tag") else None,
        })
    for video in parsed:
        print(video)

Example Output

{"title": "art", "url": "/lunyatic/videos", "username": "Lunyatic", "viewers": "25 viewers", "tags": "en"}
{"title": "생존신고", "url": "/lilllly1/videos", "username": "생존신고\n\n릴리작가 (lilllly1)", "viewers": "51 viewers", "tags": "한국어"}
{"title": "The day 0914.", "url": "/niai_serie/videos", "username": "The day 0914", "viewers": "187 viewers", "tags": None}
...

In the code above, we selected each result box using XPath selectors and extracted details from within it using CSS selectors.

Unfortunately, playwrights parsing capabilities are a bit clunky and can break easily when parsing optional elements like the tags field in our example. Instead, we can use traditional Python parsing either through parsel or beautifulsoup packages which perform much faster and provide a more robust API:

...
# using Parsel:
from parsel import Selector

page_html = page.content()

sel = Selector(text=page_html)
parsed = []
for item in sel.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
    parsed.append({
        'title': item.css('h3::text').get(),
        'url': item.css('.tw-link::attr(href)').get(),
        'username': item.css('.tw-link::text').get(),
        'tags': item.css('.tw-tag ::text').getall(),
        'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
    })

# using Beautifulsoup:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content())
parsed = []
for item in soup.select(".tw-tower div[data-target]"):
    parsed.append({
        'title': item.select_one('h3').text,
        'url': item.select_one('.tw-link::attr(href)').attrs.get("href"),
        'username': item.select_one('.tw-link').text,
        'tags': [tag.text for tag in item.select('.tw-tag')],
        'viewers': item.select_one('.tw-media-card-stat').text,
    })

While playwright locators aren't great for parsing they are great for interacting with the website. Next, let's take a look at how we can click buttons and input text using locators.

Clicking Buttons and Text Input

To explore click and text input let's extend our twitch.tv scraper with search functionality:

We'll go to twitch.tv
Select the search box and input a search query
Click the search button or press Enter
Wait for the content to load
Parse results

In playwright to interact with the web components we can use the same locator functionality we used in parsing:

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    
    page.goto("https://www.twitch.tv/directory/game/Art")
    # find search box and enter our query:
    search_box = page.locator('input[autocomplete="twitch-nav-search"]')
    search_box.type("Painting", delay=100)
    # then, we can either send Enter key:
    search_box.press("Enter")
    # or we can press the search button explicitly:
    search_button = page.locator('button[aria-label="Search Button"]')
    search_button.click()
    # click on tagged channels link:
    page.locator('.search-results .tw-link[href*="all/tags"]').click()

    # Finally, we can parse results like we did before:
    parsed = []
    stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
    for box in stream_boxes.element_handles():
        ...

Note: playwright's locator doesn't allow selectors that result in multiple values. It wouldn't know which one to click. Meaning, our selectors must be unique to one element we want to interact with.

We got search functionality working and extracted the first page of the results, though how do we get the rest of the pages? For this we'll need scrolling functionality - let's take a look at it.

Scrolling and Infinite Pagination

The stream results section of twitch.tv is using infinite scrolling pagination. To retrieve the rest of the results in our Playwright scraper we need to continuously scroll to the last result visible on the page to trigger new page loads.

We could do this by scrolling to the bottom of the entire page but that doesn't always work in headless browsers. A better way is to find all elements and scroll the last one into view expliclitly.

In playwright, this can be done by using locators and scroll_into_view_if_needed() function. We'll keep scrolling the last result into view to trigger the next page loading until no more new results appear:

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    page.goto("https://www.twitch.tv/directory/game/Art")
    # wait for content to fully load:
    page.wait_for_selector("div[data-target=directory-first-item]")

    # loop scrolling last element into view until no more new elements are created
    stream_boxes = None
    while True:
        stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
        stream_boxes.element_handles()[-1].scroll_into_view_if_needed()
        items_on_page = len(stream_boxes.element_handles())
        page.wait_for_timeout(2_000) # give some time for new items to load
        items_on_page_after_scroll = len(stream_boxes.element_handles())
        if items_on_page_after_scroll > items_on_page:
            continue  # more items loaded - keep scrolling
        else:
            break  # no more items - break scrolling loop
    # parse data:
    parsed = []
    for box in stream_boxes.element_handles():
        ...

In the example code above, we will continuously trigger new result loading until the pagination end is reached. In this case, our code should generate hundreds of parsed results.

Advanced Functions

We've covered the most common playwright features used in web scraping: navigation, waiting, clicking, typing and scrolling. However, there are a few advanced features that come in handy scraping more complex web scraping targets.

Evaluating Javascript

Playwright can evaluate any javacript code in the context of the current page. Using javascript we can do everything we did before like navigating, clicking and scrolling and even more! In fact, many of these playwright functions are implemented through javascript evaluation.

For example, if the built-in scrolling is failing us we can define our own scrolling javascript function and submit it to Playwright:

page.evaluate("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView({behavior: "smooth", block: "end", inline: "end"});
""")

The above code will scroll the last result into view just like previously but it'll scroll smoothly and to the very edge of the object. This approach is more likely to trigger next page loading compared to Playwright's scroll_into_view_if_needed function.

Javascript evaluation is a powerful feature that can be used to scrape complex web apps as it gives us full control of the browser's capabilities through javascript.

Request and Response Intercepting

Playwright tracks all of the background requests and responses the browser sends and receives. In web scraping, we can use this to modify background requests or collect secret data from background responses:

from playwright.sync_api import sync_playwright

def intercept_request(request):
    # we can update requests with custom headers
    if "secret" in request.url :
        request.headers['x-secret-token'] = "123"
        print("patched headers of a secret request")
    # or adjust sent data
    if request.method == "POST":
        request.post_data = "patched"
        print("patched POST request")
    return request

def intercept_response(response):
    # we can extract details from background requests
    if response.request.resource_type == "xhr":
        print(response.headers.get('cookie'))
    return response

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    # enable intercepting for this page
    page.on("request", intercept_request)
    page.on("response", intercept_response)

    page.goto("https://www.twitch.tv/directory/game/Art")
    page.wait_for_selector("div[data-target=directory-first-item]")

In the example above, we define our interceptor functions and attach them to our playwright page. This will allow us to inspect and modify every background and foreground request the browser makes.

Blocking Resources

Web scraping using headless browsers is really bandwidth intensive. The browser is downloading all of the images, fonts and other expensive resources our web scraper doesn't care about. To optimize this we can configure our Playwright instance to block these unnecessary resources:

from collections import Counter
from playwright.sync_api import sync_playwright

# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
  'beacon',
  'csp_report',
  'font',
  'image',
  'imageset',
  'media',
  'object',
  'texttrack',
#  we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',  
# 'xhr',
]


# we can also block popular 3rd party resources like tracking:
BLOCK_RESOURCE_NAMES = [
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
]

def intercept_route(route):
    """intercept all requests and abort blocked ones"""
    if route.request.resource_type in BLOCK_RESOURCE_TYPES:
        print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
        return route.abort()
    if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
        print(f"blocking background resource {route.request} blocked name {route.request.url}")
        return route.abort()
    return route.continue_()

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        headless=False, 
        # enable devtools so we can see total resource usage:
        devtools=True, 
    )
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    # enable intercepting for this page, **/* stands for all requests
    page.route("**/*", intercept_route)
    page.goto("https://www.twitch.tv/directory/game/Art")
    page.wait_for_selector("div[data-target=directory-first-item]")

In the example above, we are defining an interception rule which tells Playwright to drop any unwanted background resource requests that are either of ignored type or contain ignored phrases in the URL (like google analytics).

We can see the amount of data saved in Devtools' Network tab:

screencap of devtools network tab comparing blocked and unblocked bandwidth usage — With blocking we used almost 4 times less traffic!

Why Do Playwright Scrapers Fail and How to Retry Intelligently?

Failed requests waste 30% of scraping time on average. Without retry logic, temporary failures break entire scraping jobs. Exponential backoff with jitter prevents rate limit bans while keeping your scraper running efficiently.

Common failure patterns include:

403 Forbidden - Bot detection triggered
429 Too Many Requests - Rate limited, respect retry-after header
TimeoutError - Page took too long to load
NetworkError - Connection issues, DNS failures
500 Server Errors - Temporary server issues (safe to retry)

Exponential backoff implements intelligent retry timing where wait_time = base_delay * (2 ** attempt). This spreads retries over time and avoids hammering servers. Adding jitter (randomness) with wait_time * random(0.5, 1.5) prevents thundering herd problems when multiple scrapers retry simultaneously.

Typical configuration uses 3-5 max retries with a 1-2 second base delay. The scraper should give up after max retries or for non-retryable errors like 404 (Not Found) or 401 (Unauthorized).

import time
import random
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeoutError

def retry_with_backoff(func, max_retries=5, base_delay=1):
    """Retry decorator with exponential backoff and jitter"""
    def wrapper(*args, **kwargs):
        for attempt in range(max_retries):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                error_code = getattr(e, 'status', None)

                # Don't retry on non-retryable errors
                if error_code in [401, 404]:
                    raise

                # Check if we should retry
                if attempt == max_retries - 1:
                    raise

                # Calculate exponential backoff with jitter
                wait_time = base_delay * (2 ** attempt)
                jittered_wait = wait_time * random.uniform(0.5, 1.5)

                print(f"Attempt {attempt + 1} failed with {type(e).__name__}: {e}")
                print(f"Retrying in {jittered_wait:.2f} seconds...")
                time.sleep(jittered_wait)

        return None
    return wrapper

@retry_with_backoff
def scrape_page(page, url):
    """Scrape a page with automatic retry logic"""
    page.goto(url, timeout=30000)
    page.wait_for_selector("div[data-target=directory-first-item]", timeout=10000)
    return page.content()

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    html = scrape_page(page, "https://www.twitch.tv/directory/game/Art")
    print(f"Successfully scraped {len(html)} bytes")

How to Scrape Multiple Web Pages with Playwright

Single-page scraping is just the beginning. Real projects need to process hundreds of URLs, handle pagination, and run concurrent browsers without crashing. Here's how to scale Playwright from tutorial to production.

How to Loop Through Multiple URLs

When scraping product catalogs or directories, you need to process hundreds or thousands of URLs systematically. Reading URLs from a list or CSV file and iterating through each one ensures you collect data methodically.

For example, scraping multiple product pages from a catalog requires visiting each URL, extracting data, and storing results. Reading URLs from a CSV file provides a clean way to manage your scraping queue:

import csv
import json
from playwright.sync_api import sync_playwright

def scrape_url(page, url):
    """Scrape a single URL and extract data"""
    try:
        page.goto(url, timeout=30000)
        page.wait_for_selector("div[data-target=directory-first-item]", timeout=10000)

        # Extract data (simplified example)
        title = page.locator("h1").inner_text()
        return {"url": url, "title": title, "success": True}
    except Exception as e:
        return {"url": url, "error": str(e), "success": False}

# Read URLs from CSV file
urls = []
with open('urls.csv', 'r') as f:
    reader = csv.DictReader(f)
    urls = [row['url'] for row in reader]

# Scrape each URL and save to JSONL
with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    with open('results.jsonl', 'w') as f:
        for url in urls:
            result = scrape_url(page, url)
            f.write(json.dumps(result) + '\n')
            print(f"Scraped: {url} - Success: {result['success']}")

    browser.close()

The pattern above iterates through URLs sequentially with basic error handling. Each result gets written to a JSONL file immediately, ensuring data persists even if the scraper crashes partway through. You could also save to CSV using csv.DictWriter or to a database for larger projects.

How to Handle Pagination with Playwright

Websites split content across multiple pages using pagination buttons or URL parameters. Playwright handles both approaches effectively.

Button-based pagination requires detecting and clicking "Next Page" buttons using locators. The scraper should check if the button exists before clicking and detect when it's disabled or missing (indicating the last page):

from playwright.sync_api import sync_playwright

def scrape_all_pages_button(page, start_url):
    """Scrape paginated results using Next button"""
    page.goto(start_url)
    page.wait_for_selector("div[data-target=directory-first-item]")

    all_results = []
    page_num = 1

    while True:
        print(f"Scraping page {page_num}...")

        # Extract data from current page
        items = page.locator("div[data-target]").all()
        all_results.extend([item.inner_text() for item in items])

        # Check if Next button exists and is enabled
        next_button = page.locator('button[aria-label="Next"]')
        if next_button.count() == 0 or next_button.is_disabled():
            print("Reached last page")
            break

        # Click next and wait for new content
        next_button.click()
        page.wait_for_selector("div[data-target=directory-first-item]")
        page_num += 1

    return all_results

Parameter-based pagination modifies URL parameters like ?page=2 or ?offset=20. This approach is often faster since you can directly navigate to specific pages:

def scrape_all_pages_params(page, base_url):
    """Scrape paginated results using URL parameters"""
    all_results = []
    page_num = 1

    while True:
        url = f"{base_url}?page={page_num}"
        print(f"Scraping {url}...")

        page.goto(url)
        page.wait_for_selector("div[data-target=directory-first-item]", timeout=5000)

        # Extract data
        items = page.locator("div[data-target]").all()
        if len(items) == 0:
            print("No more results")
            break

        all_results.extend([item.inner_text() for item in items])
        page_num += 1

    return all_results

Unlike infinite scrolling (covered in the Scrolling and Infinite Pagination section), pagination requires clicking buttons or changing URL parameters rather than scrolling to load more content.

How to Scrape Pages Concurrently with Asyncio

Concurrent scraping significantly improves speed for I/O-bound tasks. Playwright has native async support designed for running multiple browser contexts simultaneously.

Why concurrent scraping matters: A single-threaded scraper spends most time waiting for pages to load. Running 5 concurrent browsers can scrape 5x faster while using only slightly more memory.

Asyncio with semaphores controls concurrency to avoid overwhelming your system or the target server. Each browser context uses approximately 50-100MB RAM, so limiting concurrent browsers prevents memory exhaustion:

import asyncio
from playwright.async_api import async_playwright

async def scrape_url_async(browser, url, semaphore):
    """Scrape a single URL with semaphore rate limiting"""
    async with semaphore:  # Limit concurrent browsers
        context = await browser.new_context(viewport={"width": 1920, "height": 1080})
        page = await context.new_page()

        try:
            await page.goto(url, timeout=30000)
            await page.wait_for_selector("div[data-target=directory-first-item]", timeout=10000)

            title = await page.locator("h1").inner_text()
            return {"url": url, "title": title, "success": True}
        except Exception as e:
            return {"url": url, "error": str(e), "success": False}
        finally:
            await context.close()

async def scrape_multiple_concurrent(urls, max_concurrent=5):
    """Scrape multiple URLs concurrently with rate limiting"""
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)

        # Semaphore limits concurrent operations
        semaphore = asyncio.Semaphore(max_concurrent)

        # Create tasks for all URLs
        tasks = [scrape_url_async(browser, url, semaphore) for url in urls]

        # Run all tasks concurrently
        results = await asyncio.gather(*tasks, return_exceptions=True)

        await browser.close()
        return results

# Run the concurrent scraper
urls = [
    "https://www.twitch.tv/directory/game/Art",
    "https://www.twitch.tv/directory/game/Music",
    "https://www.twitch.tv/directory/game/Science",
    # ... more URLs
]

results = asyncio.run(scrape_multiple_concurrent(urls, max_concurrent=5))
for result in results:
    print(result)

Per-domain concurrency respects rate limits by controlling requests per domain. If scraping multiple domains, you can allow higher total concurrency while limiting requests to each individual domain:

from collections import defaultdict
from urllib.parse import urlparse

async def scrape_with_domain_limits(urls, per_domain_limit=2):
    """Control concurrency per domain to avoid rate limits"""
    # Group URLs by domain
    domain_semaphores = defaultdict(lambda: asyncio.Semaphore(per_domain_limit))

    async def scrape_with_domain_semaphore(browser, url):
        domain = urlparse(url).netloc
        semaphore = domain_semaphores[domain]
        return await scrape_url_async(browser, url, semaphore)

    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        tasks = [scrape_with_domain_semaphore(browser, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        await browser.close()
        return results

Performance considerations: Start with 3-5 concurrent browsers and monitor system memory. Too many parallel browsers can trigger anti-bot detection or overwhelm your network. Batch processing URLs in chunks of 50-100 prevents memory issues when scraping thousands of pages.

When to use threads: ThreadPoolExecutor works when scraping combines I/O (fetching) with CPU-heavy parsing. For pure Playwright scraping, asyncio is typically faster and more memory-efficient.

Avoiding Blocking

While Playwright uses a real browser, it's still possible for websites to determine whether it's controlled by a real user or automated by a toolkit. Modern anti-bot systems use advanced detection techniques that go beyond simple checks.

What are Honeypot Traps and How to Avoid Them?

Websites hide invisible elements (honeypots) to catch bots - clicking them instantly flags your scraper. These traps include invisible links with display: none, hidden forms with visibility: hidden, or elements positioned off-screen with negative coordinates.

Honeypots work because bots often interact with all clickable elements indiscriminately, while humans only see and click visible elements. A typical honeypot might be a link that says "Click here for admin access" but is hidden via CSS.

To avoid triggering honeypots, always check element visibility before interaction:

from playwright.sync_api import sync_playwright

def is_element_visible(element):
    """Check if element is truly visible to users"""
    # Check if element is visible in viewport
    return element.is_visible() and not element.is_hidden()

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://example.com")

    # Only click visible buttons
    buttons = page.locator("button").all()
    for button in buttons:
        if is_element_visible(button):
            button.click()
            print(f"Clicked visible button: {button.inner_text()}")
        else:
            print(f"Skipped hidden honeypot button")

Additional honeypot avoidance strategies include checking element dimensions (width and height should be greater than 0), verifying opacity is not 0, and ensuring elements aren't positioned outside the viewport bounds.

How to Configure Proxy Rotation

Websites track and block IP addresses - proxy rotation spreads requests across multiple IPs to avoid detection and rate limits. When a website sees thousands of requests from a single IP, it's an obvious bot signal.

Proxy setup with Playwright configures proxies per browser instance:

from playwright.sync_api import sync_playwright

# List of proxy servers
proxies = [
    {"server": "http://proxy1.example.com:8080", "username": "user1", "password": "pass1"},
    {"server": "http://proxy2.example.com:8080", "username": "user2", "password": "pass2"},
    {"server": "http://proxy3.example.com:8080", "username": "user3", "password": "pass3"},
]

def scrape_with_proxy(url, proxy):
    """Scrape URL using specific proxy"""
    with sync_playwright() as pw:
        browser = pw.chromium.launch(
            headless=True,
            proxy=proxy
        )
        page = browser.new_page()
        page.goto(url)
        content = page.content()
        browser.close()
        return content

# Rotate through proxies
import itertools
proxy_pool = itertools.cycle(proxies)

urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
for url in urls:
    proxy = next(proxy_pool)
    print(f"Scraping {url} with proxy {proxy['server']}")
    content = scrape_with_proxy(url, proxy)

Residential vs datacenter proxies:

Residential proxies use real IP addresses from ISPs, making them appear as legitimate home users. They're harder to detect but more expensive ($5-15 per GB). Use for sites with strong anti-bot protection.
Datacenter proxies come from data centers and are faster and cheaper ($0.50-2 per GB) but easier to detect. Use for sites with minimal blocking.
Rotating residential proxies automatically change IP on each request, providing maximum anonymity at premium cost ($10-20 per GB).

Managing proxy pools requires ongoing maintenance including monitoring proxy health, replacing dead proxies, handling authentication, and distributing requests evenly. The infrastructure cost and complexity add up quickly for production scrapers.

Advanced Detection Vectors

Modern websites use multiple detection methods beyond basic IP blocking:

Canvas/WebGL Fingerprinting: Websites render hidden graphics and read back pixel data to create unique browser fingerprints that identify you across sessions. Different browsers, graphics cards, and operating systems produce slightly different rendering results, creating a unique signature.

navigator.webdriver Detection: Automated browsers expose navigator.webdriver = true which anti-bot systems detect instantly. This is one of the first checks many sites perform:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    page = browser.new_page()

    # Check if webdriver is detected
    is_webdriver = page.evaluate("navigator.webdriver")
    print(f"navigator.webdriver detected: {is_webdriver}")  # True in Playwright

    # Basic stealth: Try to mask webdriver (limited effectiveness)
    page.add_init_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

    page.goto("https://bot-detection-test.com")

TLS Fingerprinting: Server-side analysis of TLS handshake patterns (cipher suites, extensions order) can identify automated clients even before HTTP requests. Browsers have unique TLS signatures, and automation tools often have different signatures than regular browsers.

Timing Analysis: Bots perform actions with inhuman precision - no human clicks exactly 100ms after page load every time. Suspicious patterns include consistent delays, instant form filling, and no mouse movement.

Mouse Movement & Behavioral Patterns: Absence of natural mouse movements, scrolling patterns, and random pauses signal automation. Humans move mice erratically, pause to read, and scroll gradually. Bots typically don't generate these behaviors.

Managing these detection vectors at scale requires continuous effort:

Browser fingerprint rotation and maintenance
Staying current with evolving detection techniques
Infrastructure for proxy pools and rotation
Monitoring and adapting as sites update their defenses
Handling CAPTCHAs and challenge responses

For production environments where reliability and scale matter, tools like Scrapfly handle these fingerprinting challenges automatically:

Rotating residential proxies across multiple geolocations
Auto-retry for 403/429 errors with exponential backoff and jitter
Browser fingerprint management (Canvas, WebGL, TLS)
CAPTCHA solving integration
Managed anti-scraping protection that adapts to site changes

For more on this see our extensive article covering javascript fingerprinting and variable leaking:

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Introduction to how javascript is used to detect web scrapers. What's in javascript fingerprint and how to correctly spoof it for web scraping.

Power-Up with Scrapfly

Playwright is a powerful web scraping tool however it can be difficult to scale up and handle in some web scraping scenarios and this is where Scrapfly can be of assistance!

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Using ScrapFly SDK we can replicate the same actions we did in Playwright:

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")

# We can use a browser to render the page, screenshot it and return final HTML
result = client.scrape(ScrapeConfig(
    "https://www.twitch.tv/directory/game/Art",
    # enable browser rendering
    render_js=True,
    # we can wait for specific part to load just like with Playwright:
    wait_for_selector="div[data-target=directory-first-item]",
    # we can capture screenshots
    screenshots={"everything": "fullpage"},
    # for targets that block scrapers we can enable block bypass:
    asp=True
))

# It's also possible to execute complex javascript scenarios like button clicking
# and text typing:
result = client.scrape(ScrapeConfig(
    "https://www.twitch.tv/directory/game/Art",
    # enable browser rendering
    wait_for_selector="div[data-target=directory-first-item]",
    render_js=True,
    js_scenario=[
        # wait to load
        {"wait_for_selector": {"selector": 'input[autocomplete="twitch-nav-search"]'}},
        # input search
        {"fill": {"value": "watercolor", "selector": 'input[autocomplete="twitch-nav-search"]'}},
        # click search button
        {"click": {"selector": 'button[aria-label="Search Button"]'}},
        # wait explicit amount of time
        {"wait_for_navigation": {"timeout": 2000}}
    ]
))

Just like with Playwright we can control a web browser to navigate the website, click buttons, input text and return the final rendered HTML to us for parsing.

FAQs

What's the difference between Playwright and Selenium for web scraping?

Playwright uses modern Chrome DevTools Protocol (CDP) with a more intuitive API, supports multiple programming languages, and offers both sync/async modes. Selenium uses WebDriver protocol with a less modern API and only supports synchronous operations. Playwright generally performs better and is less likely to be detected.

How do I handle dynamic content loading with Playwright?

Use page.wait_for_selector() to wait for specific elements to appear, page.wait_for_timeout() for fixed delays, or page.wait_for_load_state() for page load completion. For infinite scrolling, use scroll_into_view_if_needed() on the last element to trigger new content loading.

Can I use Playwright for scraping without a headless browser?

Yes, set headless=False to run with a visible browser window. This is useful for debugging, but headless mode (headless=True) is faster and more resource-efficient for production scraping.

How do I avoid getting blocked when scraping with Playwright?

Use rotating proxies, implement realistic delays, rotate user-agents and headers, use residential IPs, avoid patterns that look automated, and consider using anti-bot bypass services. Playwright's real browser helps but isn't foolproof against advanced detection.

How to use a proxy with Playwright?

We can assign proxy IP address per playwright browser basis:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        headless=False,
        # direct proxy server
        proxy={"server": "11.11.11.1:9000"},
        # or with username/password:
        proxy={"server": "11.11.11.1:9000", "username": "A", "password": "B"},
    )
    page = browser.new_page()

How to speed up a Playwright Scraper?

We can greatly speed up scrapers using Playwright by ensuring that the headless browser is blocking the rendering of images and media. This can greatly reduce bandwidth and speed up scraping 2-5 times! For more see the Blocking Resources section.

Which headless browser is best to use for Playwright Scraping?

Headless chrome performs the best when it comes to scraping with Playwright. Though, Firefox can often help with avoiding blocking and captchas as it's a less popular browser. For more see: How Javascript is Used to Block Web Scrapers? In-Depth Guide

Is asyncio faster than threads for Playwright scraping?

Yes, asyncio is generally faster for I/O-bound scraping. Playwright has native async support and asyncio avoids thread overhead. For pure web scraping, asyncio typically uses 30-40% less memory than threads for the same number of concurrent operations.

How many pages can I scrape in parallel safely with Playwright?

Typical safe range: 3-10 concurrent browser contexts on consumer hardware. Each browser context uses approximately 50-100MB RAM. Use semaphores to limit concurrency (asyncio.Semaphore(5)). Start conservative (3-5) and increase gradually while monitoring for 429 rate limit errors and memory usage.

What's the best retry strategy for 403/429 errors in Playwright?

Use exponential backoff: wait_time = base_delay * (2 ** attempt) with jitter * random(0.5, 1.5). Typical config: 3-5 max retries with 1-2 second base delay. Give up after max retries or for non-retryable errors (404, 401).

Can AI tools like AutoGPT use Playwright for web scraping?

Yes, AI agents can use Playwright for autonomous browsing. LLMs can navigate sites, fill forms, and extract data based on natural language goals. Current use cases include AI determining which elements to click and adapting to page structure changes. However, production implementations still require significant engineering work beyond basic AI integration.

Summary

In this in-depth introduction, we learned how can we use Playwright web browser automation toolkit for web scraping. We explored core features such as navigation, button clicking, input typing and data parsing through real-life twitch.tv scraper example.

We've also taken a look at more advanced features like resource blocking which can reduce bandwidth use by our browser-powered webscrapers significantly. The same feature can also be used to intercept browser background requests to extract details like cookies or modify connections.

Products

Features

SDKs

No-Code Platforms

LLM & RAG Apps

Technical Challenges

Popular Targets

Real Estate

eCommerce

Social Media

Company & Reviews

Jobs

Search & SEO

Fashion

Travel & Hotels

Industry Solutions

Web Scraping with Playwright and Python

Explore this Article with AI

How to Scrape Dynamic Websites Using Headless Web Browsers

Key Takeaways

What is Playwright?

Playwright vs Selenium vs Puppeteer

Setup

Tip: Playwright in REPL

The Basics

Navigation and Waiting

Parsing Data

Clicking Buttons and Text Input

Scrolling and Infinite Pagination

Advanced Functions

Evaluating Javascript

Request and Response Intercepting

Blocking Resources

Why Do Playwright Scrapers Fail and How to Retry Intelligently?

How to Scrape Multiple Web Pages with Playwright

How to Loop Through Multiple URLs

How to Handle Pagination with Playwright

How to Scrape Pages Concurrently with Asyncio

Avoiding Blocking

What are Honeypot Traps and How to Avoid Them?

How to Configure Proxy Rotation

Advanced Detection Vectors

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Power-Up with Scrapfly

FAQs

What's the difference between Playwright and Selenium for web scraping?

How do I handle dynamic content loading with Playwright?

Can I use Playwright for scraping without a headless browser?

How do I avoid getting blocked when scraping with Playwright?

How to use a proxy with Playwright?

How to speed up a Playwright Scraper?

Which headless browser is best to use for Playwright Scraping?

Is asyncio faster than threads for Playwright scraping?

How many pages can I scrape in parallel safely with Playwright?

What's the best retry strategy for 403/429 errors in Playwright?

Can AI tools like AutoGPT use Playwright for web scraping?

Summary

Explore this Article with AI

Related Knowledgebase

How to get page source in Puppeteer?

How to save and load cookies in Selenium?

How to find elements by XPath in Selenium

How to take a screenshot with Selenium?

How to get page source in Selenium?

How to take screenshots in NodeJS?

Python httpx vs requests vs aiohttp - key differences

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to use headless browsers with scrapy?

What are some ways to parse JSON datasets in Python?

Related Articles

How To Take Screenshots In Python?

How to Scrape Forms

How to Scrape With Headless Firefox

Selenium Wire Tutorial: Intercept Background Requests

Web Scraping Dynamic Websites With Scrapy Playwright

Web Scraping Dynamic Web Pages With Scrapy Selenium