Web Scraping with Playwright and Python

Web Scraping with Playwright and Python

Playwright is a popular browser automation toolkit that can be used in web scraping to scrape dynamic web content or web apps.

Using Playwright we don't need to reverse engineer and understand the complex web technologies as the browser does everything for us. This makes Playwright a great tool to easily scrape data without advance web development knowledge.

In this in-depth practical tutorial we'll take a look at how to scrape with Playwright and Python. For that, we'll use an example scraping project by scraping twitch.tv.

We'll cover common questions like how Playwright works and compares to its competitors. How to execute common tasks like browser navigation, button clicking, text input and data parsing; as well as some advanced tasks like javascript evaluation and resource interception and blocking.

How to Scrape Dynamic Websites Using Headless Web Browsers

For more on scraping using browsers see our full introduction article which compares popular tools like Selenium, Playwright and Puppeteer

How to Scrape Dynamic Websites Using Headless Web Browsers

What is Playwright?

Playwright is a cross-platform and cross-language web browser automation toolkit. It's primarily intended to be used as a website test suite but it's perfectly capable of general browser automation and web scraping.

Using playwright we can automate web headless browsers like Firefox or Chrome to navigate the web just like a human being would: go to URLs, click buttons, write text and execute javascript.

illustration of python and playwright
Playwright allows us to communicate with web browsers through Python code

It's a great tool for web scraping as it allows to scrape dynamic javascript-powered websites without the need to reverse engineer their behavior. It can also help with blocking as the scraper is running a full browser which appears more human than standalone HTTP requests.

Playwright vs Selenium vs Puppeteer

Compared to other popular browser automation toolkits like Selenium or Puppeteer, Playwright has a few advantages:

  • Playwright supports many programming languages whereas Puppeteer is only available in Javasrcipt.
  • Playwright uses Chrome Devtools Protocol (CDP) and a more modern API, whereas Selenium is using webdriver protocol and a less modern API.
  • Playwright supports both asynchronous and synchronous clients, whereas Selenium only supports a synchronous client and Puppeteer an asynchronous one. In Playwright, we can write small scrapers using synchronous clients and scale up simply by switching to a more complex asynchronous architecture.

In other words, Playwright is a horizontal improvement over Selenium and Puppeteer. Though, every toolkit has its own strengths. If you'd like to learn more see our other introduction articles:

Setup

Playwright for Python can be installed through pip:

# install playwright package:
$ pip install playwright
# install playwright chrome and firefox browsers
$ playwright install chrome firefox

The above command will install playwright package and playwright browser binaries. For Playwright scraping, it's best to use either Chrome or Firefox browsers as these are the most stable implementations and often are least likely to be blocked.

Tip: Playwright in REPL

The easiest way to understand Playwright is to experiment with it in real-time through Python REPL (Read, Evaluate, Print, Loop) like ipython

Starting ipython we can launch a playwright browser and execute browser automation commands in real-time to experiment and prototype our web scraper:

$ pip install ipython nest_asyncio
$ ipython
import nest_asyncio; nest_asyncio.apply()  # This is needed to use sync API in repl
from playwright.sync_api import sync_playwright
pw = sync_playwright.start()
chrome = pw.chromium.launch(headless=False)
page = chrome.new_page()
page.goto("https://twitch.tv")

Here's a sneak peek of what we'll be doing in this article through the eyes of REPL:

0:00
/
Playwright through iPython REPL

Now, let's take a look at this in greater detail.

The Basics

To start, we need to launch a browser and start a new browser tab:

with sync_playwright() as pw:
    # create browser instance
    browser = pw.chromium.launch(
        # we can choose either a Headful (With GUI) or Headless mode:
        headless=False,
    )
    # create context
    # using context we can define page properties like viewport dimensions
    context = browser.new_context(
        # most common desktop viewport is 1920x1080
        viewport={"width": 1920, "height": 1080}
    )
    # create page aka browser tab which we'll be using to do everything
    page = context.new_page()

Once we have our browser page ready, we can start Playwright web scraping for which we need only a handful of Playwright features:

  • Navigation (i.e. go to URL)
  • Button clicking
  • Text input
  • Javascript Execution
  • Waiting for content to load

Let's take a look at these features through a real-life example.
For this, we'll be scraping video data from twitch.tv art section where users stream their art creation process. We'll be collecting dynamic data like stream name, viewer count and author's details.

Our task in Playwright for this exercise is:

  1. Start a browser instance, context and browser tab (page)
  2. Go to twitch.tv/directory/game/Art
  3. Wait for the page to fully load
  4. Parse loaded page data for all active streams

To navigate we can use page.goto() function which will direct the browser to any URL:

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    # go to url
    page.goto("https://twitch.tv/directory/game/Art")
    # get HTML
    print(page.content())

However, for javascript-heavy websites like twitch.tv our page.content() code might return data prematurely before everything is loaded.
To ensure that doesn't happen we can wait for a particular element to appear on the page. In other words, if the list of videos is present on the page then we can safely assume the page has loaded:

page.goto("https://twitch.tv/directory/game/Art")
# wait for first result to appear
page.wait_for_selector("div[data-target=directory-first-item]")
# retrieve final HTML content
print(page.content())

Above, we used page.wait_for_selector() function to wait for an element defined by our CSS selector to appear on the page.

Parsing Data

Since Playwright uses a real web browser with javascript environment we can use the browser's HTML parsing capabilities. In Playwright this is implemented through locators feature:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    page.goto("https://twitch.tv/directory/game/Art")  # go to url
    page.wait_for_selector("div[data-target=directory-first-item]")  # wait for content to load

    parsed = []
    stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
    for box in stream_boxes.element_handles():
        parsed.append({
            "title": box.query_selector("h3").inner_text(),
            "url": box.query_selector(".tw-link").get_attribute("href"),
            "username": box.query_selector(".tw-link").inner_text(),
            "viewers": box.query_selector(".tw-media-card-stat").inner_text(),
            # tags are not always present:
            "tags": box.query_selector(".tw-tag").inner_text() if box.query_selector(".tw-tag") else None,
        })
    for video in parsed:
        print(video)
Example Output
{"title": "art", "url": "/lunyatic/videos", "username": "Lunyatic", "viewers": "25 viewers", "tags": "en"}
{"title": "생존신고", "url": "/lilllly1/videos", "username": "생존신고\n\n릴리작가 (lilllly1)", "viewers": "51 viewers", "tags": "한국어"}
{"title": "The day 0914.", "url": "/niai_serie/videos", "username": "The day 0914", "viewers": "187 viewers", "tags": None}
...

In the code above, we selected each result box using XPath selectors and extracted details from within it using CSS selectors.

Unfortunately, playwrights parsing capabilities are a bit clunky and can break easily when parsing optional elements like the tags field in our example. Instead, we can use traditional Python parsing either through parsel or beautifulsoup packages which perform much faster and provide a more robust API:

...
# using Parsel:
from parsel import Selector

page_html = page.content()

sel = Selector(text=page_html)
parsed = []
for item in sel.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
    parsed.append({
        'title': item.css('h3::text').get(),
        'url': item.css('.tw-link::attr(href)').get(),
        'username': item.css('.tw-link::text').get(),
        'tags': item.css('.tw-tag ::text').getall(),
        'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
    })

# using Beautifulsoup:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content())
parsed = []
for item in soup.select(".tw-tower div[data-target]"):
    parsed.append({
        'title': item.select_one('h3').text,
        'url': item.select_one('.tw-link::attr(href)').attrs.get("href"),
        'username': item.select_one('.tw-link').text,
        'tags': [tag.text for tag in item.select('.tw-tag')],
        'viewers': item.select_one('.tw-media-card-stat').text,
    })

While playwright locators aren't great for parsing they are great for interacting with the website. Next, let's take a look at how we can click buttons and input text using locators.

Clicking Buttons and Text Input

To explore click and text input let's extend our twitch.tv scraper with search functionality:

  1. We'll go to twitch.tv
  2. Select the search box and input a search query
  3. Click the search button or press Enter
  4. Wait for the content to load
  5. Parse results

In playwright to interact with the web components we can use the same locator functionality we used in parsing:

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    
    page.goto("https://www.twitch.tv/directory/game/Art")
    # find search box and enter our query:
    search_box = page.locator('input[autocomplete="twitch-nav-search"]')
    search_box.type("Painting", delay=100)
    # then, we can either send Enter key:
    search_box.press("Enter")
    # or we can press the search button explicitly:
    search_button = page.locator('button[aria-label="Search Button"]')
    search_button.click()
    # click on tagged channels link:
    page.locator('.search-results .tw-link[href*="all/tags"]').click()

    # Finally, we can parse results like we did before:
    parsed = []
    stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
    for box in stream_boxes.element_handles():
        ...

Note: playwright's locator doesn't allow selectors that result in multiple values. It wouldn't know which one to click. Meaning, our selectors must be unique to one element we want to interact with.

We got search functionality working and extracted the first page of the results, though how do we get the rest of the pages? For this we'll need scrolling functionality - let's take a look at it.

Scrolling and Infinite Pagination

The stream results section of twitch.tv is using infinite scrolling pagination. To retrieve the rest of the results in our Playwright scraper we need to continuously scroll to the last result visible on the page to trigger new page loads.

We could do this by scrolling to the bottom of the entire page but that doesn't always work in headless browsers. A better way is to find all elements and scroll the last one into view expliclitly.

In playwright, this can be done by using locators and scroll_into_view_if_needed() function. We'll keep scrolling the last result into view to trigger the next page loading until no more new results appear:

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    page.goto("https://www.twitch.tv/directory/game/Art")
    # wait for content to fully load:
    page.wait_for_selector("div[data-target=directory-first-item]")

    # loop scrolling last element into view until no more new elements are created
    stream_boxes = None
    while True:
        stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
        stream_boxes.element_handles()[-1].scroll_into_view_if_needed()
        items_on_page = len(stream_boxes.element_handles())
        page.wait_for_timeout(2_000) # give some time for new items to load
        items_on_page_after_scroll = len(stream_boxes.element_handles())
        if items_on_page_after_scroll > items_on_page:
            continue  # more items loaded - keep scrolling
        else:
            break  # no more items - break scrolling loop
    # parse data:
    parsed = []
    for box in stream_boxes.element_handles():
        ...

In the example code above, we will continuously trigger new result loading until the pagination end is reached. In this case, our code should generate hundreds of parsed results.

Advanced Functions

We've covered the most common playwright features used in web scraping: navigation, waiting, clicking, typing and scrolling. However, there are a few advanced features that come in handy scraping more complex web scraping targets.

Evaluating Javascript

Playwright can evaluate any javacript code in the context of the current page. Using javascript we can do everything we did before like navigating, clicking and scrolling and even more! In fact, many of these playwright functions are implemented through javascript evaluation.

For example, if the built-in scrolling is failing us we can define our own scrolling javascript function and submit it to Playwright:

page.evaluate("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView({behavior: "smooth", block: "end", inline: "end"});
""")

The above code will scroll the last result into view just like previously but it'll scroll smoothly and to the very edge of the object. This approach is more likely to trigger next page loading compared to Playwright's scroll_into_view_if_needed function.

Javascript evaluation is a powerful feature that can be used to scrape complex web apps as it gives us full control of the browser's capabilities through javascript.

Request and Response Intercepting

Playwright tracks all of the background requests and responses the browser sends and receives. In web scraping, we can use this to modify background requests or collect secret data from background responses:

from playwright.sync_api import sync_playwright

def intercept_request(request):
    # we can update requests with custom headers
    if "secret" in request.url :
        request.headers['x-secret-token'] = "123"
        print("patched headers of a secret request")
    # or adjust sent data
    if request.method == "POST":
        request.post_data = "patched"
        print("patched POST request")
    return request

def intercept_response(response):
    # we can extract details from background requests
    if response.request.resource_type == "xhr":
        print(response.headers.get('cookie'))
    return response

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=False)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    # enable intercepting for this page
    page.on("request", intercept_request)
    page.on("response", intercept_response)

    page.goto("https://www.twitch.tv/directory/game/Art")
    page.wait_for_selector("div[data-target=directory-first-item]")

In the example above, we define our interceptor functions and attach them to our playwright page. This will allow us to inspect and modify every background and foreground request the browser makes.

Blocking Resources

Web scraping using headless browsers is really bandwidth intensive. The browser is downloading all of the images, fonts and other expensive resources our web scraper doesn't care about. To optimize this we can configure our Playwright instance to block these unnecessary resources:

from collections import Counter
from playwright.sync_api import sync_playwright

# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
  'beacon',
  'csp_report',
  'font',
  'image',
  'imageset',
  'media',
  'object',
  'texttrack',
#  we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',  
# 'xhr',
]


# we can also block popular 3rd party resources like tracking:
BLOCK_RESOURCE_NAMES = [
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
]

def intercept_route(route):
    """intercept all requests and abort blocked ones"""
    if route.request.resource_type in BLOCK_RESOURCE_TYPES:
        print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
        return route.abort()
    if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
        print(f"blocking background resource {route.request} blocked name {route.request.url}")
        return route.abort()
    return route.continue_()

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        headless=False, 
        # enable devtools so we can see total resource usage:
        devtools=True, 
    )
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    # enable intercepting for this page, **/* stands for all requests
    page.route("**/*", intercept_route)
    page.goto("https://www.twitch.tv/directory/game/Art")
    page.wait_for_selector("div[data-target=directory-first-item]")

In the example above, we are defining an interception rule which tells Playwright to drop any unwanted background resource requests that are either of ignored type or contain ignored phrases in the URL (like google analytics).

We can see the amount of data saved in Devtools' Network tab:

screencap of devtools network tab comparing blocked and unblocked bandwidth usage
With blocking we used almost 4 times less traffic!

Avoiding Blocking

While Playwright is using a real browser it's still possible to determine whether it's controlled by a real user or automated by an automation toolkit. For more on this see our extensive article covering javascript fingerprinting and variable leaking:

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Introduction to javascript fingerprinting and how to fortify automated web browsers against it.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

ScrapFly's Alternative

Playwright is a powerful web scraping tool however it can be difficult to scale up and handle in some web scraping scenarios and this is where Scrapfly can be of assistance!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Using ScrapFly SDK we can replicate the same actions we did in Playwright:

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")

# We can use a browser to render the page, screenshot it and return final HTML
result = client.scrape(ScrapeConfig(
    "https://www.twitch.tv/directory/game/Art",
    # enable browser rendering
    render_js=True,
    # we can wait for specific part to load just like with Playwright:
    wait_for_selector="div[data-target=directory-first-item]",
    # we can capture screenshots
    screenshots={"everything": "fullpage"},
    # for targets that block scrapers we can enable block bypass:
    asp=True
))

# It's also possible to execute complex javascript scenarios like button clicking
# and text typing:
result = client.scrape(ScrapeConfig(
    "https://www.twitch.tv/directory/game/Art",
    # enable browser rendering
    wait_for_selector="div[data-target=directory-first-item]",
    render_js=True,
    js_scenario=[
        # wait to load
        {"wait_for_selector": {"selector": 'input[autocomplete="twitch-nav-search"]'}},
        # input search
        {"fill": {"value": "watercolor", "selector": 'input[autocomplete="twitch-nav-search"]'}},
        # click search button
        {"click": {"selector": 'button[aria-label="Search Button"]'}},
        # wait explicit amount of time
        {"wait_for_navigation": {"timeout": 2000}}
    ]
))

Just like with Playwright we can control a web browser to navigate the website, click buttons, input text and return the final rendered HTML to us for parsing.

FAQ

To wrap this introduction up let's take a look at some frequently asked questions regarding web scraping with Playwright:

How to use a proxy with Playwright?

We can assign proxy IP address per playwright browser basis:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(
        headless=False,
        # direct proxy server
        proxy={"server": "11.11.11.1:9000"},
        # or with username/password:
        proxy={"server": "11.11.11.1:9000", "username": "A", "password": "B"},
    )
    page = browser.new_page()

How to speed up a Playwright Scraper?

We can greatly speed up scrapers using Playwright by ensuring that the headless browser is blocking the rendering of images and media. This can greatly reduce bandwidth and speed up scraping 2-5 times! For more see the Blocking Resources section.

Which headless browser is best to use for Playwright Scraping?

Headless chrome performs the best when it comes to scraping with Playwright. Though, Firefox can often help with avoiding blocking and captchas as it's a less popular browser. For more see: How Javascript is Used to Block Web Scrapers? In-Depth Guide

Summary

In this in-depth introduction, we learned how can we use Playwright web browser automation toolkit for web scraping. We explored core features such as navigation, button clicking, input typing and data parsing through real-life twitch.tv scraper example.

We've also taken a look at more advanced features like resource blocking which can reduce bandwidth use by our browser-powered webscrapers significantly. The same feature can also be used to intercept browser background requests to extract details like cookies or modify connections.

Related Posts

Playwright Examples for Web Scraping and Automation

Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.

How to use wget in Python

Learn how to use wget in Python through subprocess calls and what are other options.

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup