How to Capture and Convert a Screenshot to PDF
Quick guide on how to effectively capture web screenshots as PDF documents
There are many different ways to monitor web page changes and one of the most popular techniques is screenshot tracking.
With this method only the final visual representation is tracked which is a convenient way to track real web changes and ignore website code updates making for a great web page detection tool!
Screenshot tracking is used to track product page updates, real estate listing changes and other visually critical web pages — all of which can be done with automatic screenshots of websites.
In this guide we'll explore automated webpage screenshots using Python and a few of it's key packages. We'll use headless browsers to capture screenshots of websites, including styles and dynamic content and image analysis tools to find the changes. Let's dive in!
Capturing webpage screenshots programmatically using browser automation tools like selenium or playwright can be useful in various scenarios. Some examples of practical use cases are:
Knowing how to track changes and highlight them programatically is an essential step for any of the above scenarios. That's why in this guide, we will focus on how to track changes in a webpage by comparing two screenshots of the same website.
Let's set up our work environment with all the required packages and tools
To capture webpage screenshots programatically, we will be using Playwright which is a web browser automation library (like Puppeteer and Selenium) and has a growing web-scraping community. Playwright is available in multiple programming languages, including Python, which we will be using in this guide.
Make sure python is installed on your device, if not, you can install it from the official python website.
To install Playwright with Python's package manager pip
. For that, run this following command in your terminal
$ pip install playwright
Next, install your preferred browser web drivers of choice, we will use the Chromium web drivers
$ playwright install chromium
# alternatively install `firefox` or `webkit` instead of `chromium`
We'll be using Playwright to capture screenshots but to compare them we need another different image computing tool next.
To compare screenshots and track changes between them, we will be using ImageMagick through its Python binding - Wand. ImageMagick is a free, open-source software suite, used for editing and manipulating digital images.
To install wand
with pip, run the following command in your terminal
$ pip install wand
Note that as Wand is a binding of ImageMagick, you have to install ImageMagick
on your device:
To verify you wand installation try this basic wand scrip that resizes an image:
from wand.image import Image
# Open an existing image
with Image(filename='input.jpg') as img:
# Print the original size
print(f'Original size: {img.size}')
# Resize the image
img.resize(200, 200)
# Print the new size
print(f'Resized size: {img.size}')
# Save the image
img.save(filename='output.jpg')
Besides image resizing, wand and ImageMagick can do a lot of interesting functions with images. With ImageMagick we'll be able to compare captured screenshots to find the differences in them.
Next we need a tool to schedule the capturing process.
To automate the screen capturing process and monitor changes between screenshots, we will use the schedule library in python. Schedule is a job scheduling library that allows us to run python functions periodically using a friendly syntax.
To install schedule with pip, run the following command in your terminal
$ pip install schedule
Now schedule will allow us to run recurring screenshot capturing tasks from a single Python script. See this code to test schedule:
import schedule
import time
def job():
print("capturing screenshot")
# every 10 seconds
schedule.every(10).seconds.do(job)
# run an endless loop checking for tasks
while True:
schedule.run_pending()
time.sleep(1)
Now that we have all the tools required for our screenshot capturing project we can begin our project. Let's start with Playwright screenshot capturing script.
All browser automation tools support capturing webpage screenshots. In this guide, we will provide a simple example of using Playwright to take webpage screenshots.
If you're more familiar with Selenium
or Puppeteer
they're mostly the same and you can check our in-depth guides for these alternatives instead:
Playwright runs in headless mode by default. This means that it launches a headless browser instance instead of a normal one.
A headless browser is a browser instance without visible GUI elements. This allows it to run much faster compared to its headful counterpart.
To capture a screenshot of a website using Playwirght, we can use the Page.screenshot()
method:
from pathlib import Path
from playwright.sync_api import sync_playwright
def get_screenshot(url, path):
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context()
page = context.new_page()
# request target web page
page.goto(url)
# screenshot as bytes
image_bytes = page.screenshot()
Path(path).write_bytes(image_bytes)
# for example:
get_screenshot("https://web-scraping.dev", "./screenshot.png")
Above we made a small function that:
screenshot.png
Next, let's use the above function to take two screenshots of two product variants from web-scraping.dev
get_screenshot("https://web-scraping.dev/product/1?variant=orange-small", "product-1.png")
and our second product variant:
get_screenshot("https://web-scraping.dev/product/1?variant=cherry-small", "product-2.png")
Now that we have captured the screenshots, we are ready to compare them and highlight the differences between them.
To compare the two screenshots and highlight changes, we first need to read the screenshots files using wand's Image constructor.
Wand provides a .compare()
method on the Image
object that does exactly that! This method compares an image with another, and returns a reconstructed image & computed distortion. The reconstructed image will show the differences highlighted with red color by default.
from wand.image import Image
def compare_images(image1, image2, diff_image):
# Open the two images you want to compare
with Image(filename=image1) as img1, Image(filename=image2) as img2:
# Compare the images
diff = img1.compare(img2, metric="root_mean_square")
# The result is a tuple containing the difference image and the computed difference
diff_image, diff_value = diff
print(f"The difference between the images is: {diff_value}")
diff_image.save(filename=diff_image)
return (diff_image, diff_value)
compare_images("product-1.png", "product-2.png", "difference.png")
Which produces results like this:
Now that we can compare two screenshots and get the computed difference between them, we need to automate this process to run regularly and notify us with detected screenshot changes.
To streamline the process of capturing and comparing webpage screenshots, you can automate it in python using the schedule
library. You can schedule the script to run at regular intervals (e.g., daily or hourly), capture a new screenshot, compare it with the one from the last run, and send a notification if changes are detected.
To send a notification when changes are detected, you can integrate an email or messaging API (like Twilio, SendGrid, or Slack) into your script. For example, you could add the following lines to send an email when a difference is found:
import smtplib
from email.mime.text import MIMEText
def send_change_email(diff_percentage):
msg = MIMEText(f"Changes detected: {diff_percentage:.2f}% of pixels changed.")
msg["Subject"] = "Webpage Change Detected"
msg["From"] = "your_email@example.com"
msg["To"] = "recipient_email@example.com"
with smtplib.SMTP("smtp.example.com") as server:
server.login("your_email@example.com", "your_password")
server.send_message(msg)
In the example above, we create a small function for sending an email with our screenshot comparison details.
Now that we have all the functions needed to capture the screenshots, compare them, and send an email with the difference between them, let's schedule it to run every day.
import schedule
import time
def job():
get_screenshot("https://example.com", "new-screenshot.png")
diff_image, diff_value = compare_images(
"new-screenshot.png", "old-screenshot.png", "diff-screenshot.png"
)
if diff_value > 0:
send_change_email(diff_image)
else:
print("No change detected")
schedule.every().day.do(job)
while True:
schedule.run_pending()
time.sleep(1)
With this schedule
script running we can check for website changes through our page screenshot capture function that runs once every day. If changes are detected, an email will be sent to notify us!
This concludes our screen scraping project but scaling up projects like this can be difficult and this is where Scrapfly can lend you a hand next!
So far, we've explored how to capture website screenshots using basic headless browser configurations. However, many modern websites employ anti-bot measures to prevent automated screenshot capture, making it challenging to scale your efforts. This is where ScrapFly comes into play!
ScrapFly provides advanced web scraping, screenshot, and data extraction APIs designed for large-scale operations. Here's how ScrapFly can enhance your screenshot automation:
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Using ScrapFly’s screenshot API is straightforward and can be done with a simple API request. Here’s an example of how to take screenshots using Python:
from pathlib import Path
import urllib.parse
import requests
# Base URL for ScrapFly's screenshot API
base_url = 'https://api.scrapfly.io/screenshot?'
# Define the parameters for the API request
params = {
'key': 'Your ScrapFly API key', # Your ScrapFly API key
'url': 'https://web-scraping.dev/product/1?variant=cherry-small', # URL of the webpage to capture
'format': 'png', # Desired screenshot format (e.g., png, jpeg)
'capture': 'fullpage', # Area to capture (specific element, fullpage, viewport)
'resolution': '1920x1080', # Screen resolution for the capture
'country': 'us', # Proxy country
'rendering_wait': 5000, # Wait time in milliseconds before capturing
'options': [
'dark_mode', # Enable dark mode
'block_banners', # Block pop-up banners
'print_media_format' # Emulate print media format
],
'auto_scroll': True # Automatically scroll down the page before capturing
}
# Convert the list of options into a comma-separated string
params['options'] = ','.join(params['options'])
query_string = urllib.parse.urlencode(params)
full_url = base_url + query_string
# Make the API request to capture the screenshot
response = requests.get(full_url)
image_bytes = response.content
# Save the screenshot to disk
Path("screenshot.png").write_bytes(image_bytes)
Scrapfly powers up many of the same Playwright capabilities in a much more performant, scalable and reliable way!
To conclude this guide on tracking webpage changes using automated screenshots, let's address some frequently asked questions
A screenshot API is a service that allows you to capture images of websites via HTTP requests, eliminating the need to manage headless browser instances directly. It enables customized screenshots using various headless browser features, such as setting the device viewport, resolution, JavaScript execution, banner blocking, and more.
Other browser automation tools like selenium can be used to take screenshots in python. Check out our dedicated guide on how to take screenshots in python
Screenshot APIs are also a great alternative. We have already compared them all for you and rendered out the best screenshot API.
Pupeteer is the go-to tool to for browser automation in node.js. It as bult-in method to capture screenshots. Take a look at our guide on taking screenshots with Puppeteer. Selenium and Playwright are also available in node.js and can be used to capture screenshots.
In this tutorial we've created a visual tracking tool using Python that monitors websites for changes.
We've started by using Playwright
and the screenshot()
method to capture screenshots of web pages. Then we loaded the captured screenshots using Wand
(which is a binding of ImageMagick
) and compared them using the compare()
method. Finally, we automated the process using the schedule
library to run the script periodically and send email notifications when changes are detected.