Playwright is a popular browser automation toolkit that can be used in web scraping to scrape dynamic web content or web apps.
Using Playwright we don't need to reverse engineer and understand the complex web technologies as the browser does everything for us. This makes Playwright a great tool to easily scrape data without advance web development knowledge.
In this in-depth practical tutorial we'll take a look at how to scrape with Playwright and Python. For that, we'll use an example scraping project by scraping twitch.tv.
We'll cover common questions like how Playwright works and compares to its competitors. How to execute common tasks like browser navigation, button clicking, text input and data parsing; as well as some advanced tasks like javascript evaluation and resource interception and blocking.
What is Playwright?
Playwright is a cross-platform and cross-language web browser automation toolkit. It's primarily intended to be used as a website test suite but it's perfectly capable of general browser automation and web scraping.
Using playwright we can automate web headless browsers like Firefox or Chrome to navigate the web just like a human being would: go to URLs, click buttons, write text and execute javascript.
It's a great tool for web scraping as it allows to scrape dynamic javascript-powered websites without the need to reverse engineer their behavior. It can also help with blocking as the scraper is running a full browser which appears more human than standalone HTTP requests.
Playwright vs Selenium vs Puppeteer
Compared to other popular browser automation toolkits like Selenium or Puppeteer, Playwright has a few advantages:
Playwright supports many programming languages whereas Puppeteer is only available in Javasrcipt.
Playwright uses Chrome Devtools Protocol (CDP) and a more modern API, whereas Selenium is using webdriver protocol and a less modern API.
Playwright supports both asynchronous and synchronous clients, whereas Selenium only supports a synchronous client and Puppeteer an asynchronous one. In Playwright, we can write small scrapers using synchronous clients and scale up simply by switching to a more complex asynchronous architecture.
In other words, Playwright is a horizontal improvement over Selenium and Puppeteer. Though, every toolkit has its own strengths. If you'd like to learn more see our other introduction articles:
The above command will install playwright package and playwright browser binaries. For Playwright scraping, it's best to use either Chrome or Firefox browsers as these are the most stable implementations and often are least likely to be blocked.
Tip: Playwright in REPL
The easiest way to understand Playwright is to experiment with it in real-time through Python REPL (Read, Evaluate, Print, Loop) like ipython
Starting ipython we can launch a playwright browser and execute browser automation commands in real-time to experiment and prototype our web scraper:
$ pip install ipython nest_asyncio
$ ipython
import nest_asyncio; nest_asyncio.apply() # This is needed to use sync API in repl
from playwright.sync_api import sync_playwright
pw = sync_playwright.start()
chrome = pw.chromium.launch(headless=False)
page = chrome.new_page()
page.goto("https://twitch.tv")
Here's a sneak peek of what we'll be doing in this article through the eyes of REPL:
Now, let's take a look at this in greater detail.
The Basics
To start, we need to launch a browser and start a new browser tab:
with sync_playwright() as pw:
# create browser instance
browser = pw.chromium.launch(
# we can choose either a Headful (With GUI) or Headless mode:
headless=False,
)
# create context
# using context we can define page properties like viewport dimensions
context = browser.new_context(
# most common desktop viewport is 1920x1080
viewport={"width": 1920, "height": 1080}
)
# create page aka browser tab which we'll be using to do everything
page = context.new_page()
Once we have our browser page ready, we can start Playwright web scraping for which we need only a handful of Playwright features:
Navigation (i.e. go to URL)
Button clicking
Text input
Javascript Execution
Waiting for content to load
Let's take a look at these features through a real-life example.
For this, we'll be scraping video data from twitch.tv art section where users stream their art creation process. We'll be collecting dynamic data like stream name, viewer count and author's details.
Our task in Playwright for this exercise is:
Start a browser instance, context and browser tab (page)
To navigate we can use page.goto() function which will direct the browser to any URL:
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# go to url
page.goto("https://twitch.tv/directory/game/Art")
# get HTML
print(page.content())
However, for javascript-heavy websites like twitch.tv our page.content() code might return data prematurely before everything is loaded.
To ensure that doesn't happen we can wait for a particular element to appear on the page. In other words, if the list of videos is present on the page then we can safely assume the page has loaded:
page.goto("https://twitch.tv/directory/game/Art")
# wait for first result to appear
page.wait_for_selector("div[data-target=directory-first-item]")
# retrieve final HTML content
print(page.content())
Above, we used page.wait_for_selector() function to wait for an element defined by our CSS selector to appear on the page.
Parsing Data
Since Playwright uses a real web browser with javascript environment we can use the browser's HTML parsing capabilities. In Playwright this is implemented through locators feature:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://twitch.tv/directory/game/Art") # go to url
page.wait_for_selector("div[data-target=directory-first-item]") # wait for content to load
parsed = []
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
for box in stream_boxes.element_handles():
parsed.append({
"title": box.query_selector("h3").inner_text(),
"url": box.query_selector(".tw-link").get_attribute("href"),
"username": box.query_selector(".tw-link").inner_text(),
"viewers": box.query_selector(".tw-media-card-stat").inner_text(),
# tags are not always present:
"tags": box.query_selector(".tw-tag").inner_text() if box.query_selector(".tw-tag") else None,
})
for video in parsed:
print(video)
Example Output
{"title": "art", "url": "/lunyatic/videos", "username": "Lunyatic", "viewers": "25 viewers", "tags": "en"}
{"title": "생존신고", "url": "/lilllly1/videos", "username": "생존신고\n\n릴리작가 (lilllly1)", "viewers": "51 viewers", "tags": "한국어"}
{"title": "The day 0914.", "url": "/niai_serie/videos", "username": "The day 0914", "viewers": "187 viewers", "tags": None}
...
In the code above, we selected each result box using XPath selectors and extracted details from within it using CSS selectors.
Unfortunately, playwrights parsing capabilities are a bit clunky and can break easily when parsing optional elements like the tags field in our example. Instead, we can use traditional Python parsing either through parsel or beautifulsoup packages which perform much faster and provide a more robust API:
...
# using Parsel:
from parsel import Selector
page_html = page.content()
sel = Selector(text=page_html)
parsed = []
for item in sel.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
parsed.append({
'title': item.css('h3::text').get(),
'url': item.css('.tw-link::attr(href)').get(),
'username': item.css('.tw-link::text').get(),
'tags': item.css('.tw-tag ::text').getall(),
'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
})
# using Beautifulsoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content())
parsed = []
for item in soup.select(".tw-tower div[data-target]"):
parsed.append({
'title': item.select_one('h3').text,
'url': item.select_one('.tw-link::attr(href)').attrs.get("href"),
'username': item.select_one('.tw-link').text,
'tags': [tag.text for tag in item.select('.tw-tag')],
'viewers': item.select_one('.tw-media-card-stat').text,
})
While playwright locators aren't great for parsing they are great for interacting with the website. Next, let's take a look at how we can click buttons and input text using locators.
Clicking Buttons and Text Input
To explore click and text input let's extend our twitch.tv scraper with search functionality:
We'll go to twitch.tv
Select the search box and input a search query
Click the search button or press Enter
Wait for the content to load
Parse results
In playwright to interact with the web components we can use the same locator functionality we used in parsing:
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://www.twitch.tv/directory/game/Art")
# find search box and enter our query:
search_box = page.locator('input[autocomplete="twitch-nav-search"]')
search_box.type("Painting", delay=100)
# then, we can either send Enter key:
search_box.press("Enter")
# or we can press the search button explicitly:
search_button = page.locator('button[aria-label="Search Button"]')
search_button.click()
# click on tagged channels link:
page.locator('.search-results .tw-link[href*="all/tags"]').click()
# Finally, we can parse results like we did before:
parsed = []
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
for box in stream_boxes.element_handles():
...
Note: playwright's locator doesn't allow selectors that result in multiple values. It wouldn't know which one to click. Meaning, our selectors must be unique to one element we want to interact with.
We got search functionality working and extracted the first page of the results, though how do we get the rest of the pages? For this we'll need scrolling functionality - let's take a look at it.
Scrolling and Infinite Pagination
The stream results section of twitch.tv is using infinite scrolling pagination. To retrieve the rest of the results in our Playwright scraper we need to continuously scroll to the last result visible on the page to trigger new page loads.
We could do this by scrolling to the bottom of the entire page but that doesn't always work in headless browsers. A better way is to find all elements and scroll the last one into view expliclitly.
In playwright, this can be done by using locators and scroll_into_view_if_needed() function. We'll keep scrolling the last result into view to trigger the next page loading until no more new results appear:
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://www.twitch.tv/directory/game/Art")
# wait for content to fully load:
page.wait_for_selector("div[data-target=directory-first-item]")
# loop scrolling last element into view until no more new elements are created
stream_boxes = None
while True:
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
stream_boxes.element_handles()[-1].scroll_into_view_if_needed()
items_on_page = len(stream_boxes.element_handles())
page.wait_for_timeout(2_000) # give some time for new items to load
items_on_page_after_scroll = len(stream_boxes.element_handles())
if items_on_page_after_scroll > items_on_page:
continue # more items loaded - keep scrolling
else:
break # no more items - break scrolling loop
# parse data:
parsed = []
for box in stream_boxes.element_handles():
...
In the example code above, we will continuously trigger new result loading until the pagination end is reached. In this case, our code should generate hundreds of parsed results.
Advanced Functions
We've covered the most common playwright features used in web scraping: navigation, waiting, clicking, typing and scrolling. However, there are a few advanced features that come in handy scraping more complex web scraping targets.
Evaluating Javascript
Playwright can evaluate any javacript code in the context of the current page. Using javascript we can do everything we did before like navigating, clicking and scrolling and even more! In fact, many of these playwright functions are implemented through javascript evaluation.
For example, if the built-in scrolling is failing us we can define our own scrolling javascript function and submit it to Playwright:
page.evaluate("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView({behavior: "smooth", block: "end", inline: "end"});
""")
The above code will scroll the last result into view just like previously but it'll scroll smoothly and to the very edge of the object. This approach is more likely to trigger next page loading compared to Playwright's scroll_into_view_if_needed function.
Javascript evaluation is a powerful feature that can be used to scrape complex web apps as it gives us full control of the browser's capabilities through javascript.
Request and Response Intercepting
Playwright tracks all of the background requests and responses the browser sends and receives. In web scraping, we can use this to modify background requests or collect secret data from background responses:
from playwright.sync_api import sync_playwright
def intercept_request(request):
# we can update requests with custom headers
if "secret" in request.url :
request.headers['x-secret-token'] = "123"
print("patched headers of a secret request")
# or adjust sent data
if request.method == "POST":
request.post_data = "patched"
print("patched POST request")
return request
def intercept_response(response):
# we can extract details from background requests
if response.request.resource_type == "xhr":
print(response.headers.get('cookie'))
return response
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# enable intercepting for this page
page.on("request", intercept_request)
page.on("response", intercept_response)
page.goto("https://www.twitch.tv/directory/game/Art")
page.wait_for_selector("div[data-target=directory-first-item]")
In the example above, we define our interceptor functions and attach them to our playwright page. This will allow us to inspect and modify every background and foreground request the browser makes.
Blocking Resources
Web scraping using headless browsers is really bandwidth intensive. The browser is downloading all of the images, fonts and other expensive resources our web scraper doesn't care about. To optimize this we can configure our Playwright instance to block these unnecessary resources:
from collections import Counter
from playwright.sync_api import sync_playwright
# block pages by resource type. e.g. image, stylesheet
BLOCK_RESOURCE_TYPES = [
'beacon',
'csp_report',
'font',
'image',
'imageset',
'media',
'object',
'texttrack',
# we can even block stylsheets and scripts though it's not recommended:
# 'stylesheet',
# 'script',
# 'xhr',
]
# we can also block popular 3rd party resources like tracking:
BLOCK_RESOURCE_NAMES = [
'adzerk',
'analytics',
'cdn.api.twitter',
'doubleclick',
'exelator',
'facebook',
'fontawesome',
'google',
'google-analytics',
'googletagmanager',
]
def intercept_route(route):
"""intercept all requests and abort blocked ones"""
if route.request.resource_type in BLOCK_RESOURCE_TYPES:
print(f'blocking background resource {route.request} blocked type "{route.request.resource_type}"')
return route.abort()
if any(key in route.request.url for key in BLOCK_RESOURCE_NAMES):
print(f"blocking background resource {route.request} blocked name {route.request.url}")
return route.abort()
return route.continue_()
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=False,
# enable devtools so we can see total resource usage:
devtools=True,
)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# enable intercepting for this page, **/* stands for all requests
page.route("**/*", intercept_route)
page.goto("https://www.twitch.tv/directory/game/Art")
page.wait_for_selector("div[data-target=directory-first-item]")
In the example above, we are defining an interception rule which tells Playwright to drop any unwanted background resource requests that are either of ignored type or contain ignored phrases in the URL (like google analytics).
We can see the amount of data saved in Devtools' Network tab:
Avoiding Blocking
While Playwright is using a real browser it's still possible to determine whether it's controlled by a real user or automated by an automation toolkit. For more on this see our extensive article covering javascript fingerprinting and variable leaking:
ScrapFly's Alternative
Playwright is a powerful web scraping tool however it can be difficult to scale up and handle in some web scraping scenarios and this is where Scrapfly can be of assistance!
Using ScrapFly SDK we can replicate the same actions we did in Playwright:
client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
# We can use a browser to render the page, screenshot it and return final HTML
result = client.scrape(ScrapeConfig(
"https://www.twitch.tv/directory/game/Art",
# enable browser rendering
render_js=True,
# we can wait for specific part to load just like with Playwright:
wait_for_selector="div[data-target=directory-first-item]",
# we can capture screenshots
screenshots={"everything": "fullpage"},
# for targets that block scrapers we can enable block bypass:
asp=True
))
# It's also possible to execute complex javascript scenarios like button clicking
# and text typing:
result = client.scrape(ScrapeConfig(
"https://www.twitch.tv/directory/game/Art",
# enable browser rendering
wait_for_selector="div[data-target=directory-first-item]",
render_js=True,
js_scenario=[
# wait to load
{"wait_for_selector": {"selector": 'input[autocomplete="twitch-nav-search"]'}},
# input search
{"fill": {"value": "watercolor", "selector": 'input[autocomplete="twitch-nav-search"]'}},
# click search button
{"click": {"selector": 'button[aria-label="Search Button"]'}},
# wait explicit amount of time
{"wait_for_navigation": {"timeout": 2000}}
]
))
Just like with Playwright we can control a web browser to navigate the website, click buttons, input text and return the final rendered HTML to us for parsing.
FAQ
To wrap this introduction up let's take a look at some frequently asked questions regarding web scraping with Playwright:
How to use a proxy with Playwright?
We can assign proxy IP address per playwright browser basis:
from playwright.sync_api import sync_playwright
with sync_playwright() as pw:
browser = pw.chromium.launch(
headless=False,
# direct proxy server
proxy={"server": "11.11.11.1:9000"},
# or with username/password:
proxy={"server": "11.11.11.1:9000", "username": "A", "password": "B"},
)
page = browser.new_page()
How to speed up a Playwright Scraper?
We can greatly speed up scrapers using Playwright by ensuring that the headless browser is blocking the rendering of images and media. This can greatly reduce bandwidth and speed up scraping 2-5 times! For more see the Blocking Resources section.
Which headless browser is best to use for Playwright Scraping?
Headless chrome performs the best when it comes to scraping with Playwright. Though, Firefox can often help with avoiding blocking and captchas as it's a less popular browser. For more see: How Javascript is Used to Block Web Scrapers? In-Depth Guide
Summary
In this in-depth introduction, we learned how can we use Playwright web browser automation toolkit for web scraping. We explored core features such as navigation, button clicking, input typing and data parsing through real-life twitch.tv scraper example.
We've also taken a look at more advanced features like resource blocking which can reduce bandwidth use by our browser-powered webscrapers significantly. The same feature can also be used to intercept browser background requests to extract details like cookies or modify connections.