Playwright Examples for Web Scraping and Automation
Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.
To scrape images from a website we can use Python with HTML parsing tools like beautifulsoup to select all <img>
elements and save them.
Here's an example using httpx
and beautifulsoup
(install using pip install httpx beautifulsoup4
):
import asyncio
import httpx
from bs4 import BeautifulSoup
from pathlib import Path
async def download_image(url, filepath, client):
response = await client.get(url)
filepath.write_bytes(response.content)
print(f"Downloaded {url} to {filepath}")
async def scrape_images(url):
download_dir = Path('images')
download_dir.mkdir(parents=True, exist_ok=True)
async with httpx.AsyncClient() as client:
response = await client.get(url)
soup = BeautifulSoup(response.text, "html.parser")
download_tasks = []
for img_tag in soup.find_all("img"):
img_url = img_tag.get("src") # get image url
if img_url:
img_url = response.url.join(img_url) # turn url absolute
img_filename = download_dir / Path(str(img_url)).name
download_tasks.append(
download_image(img_url, img_filename, client)
)
await asyncio.gather(*download_tasks)
# example - scrape all scrapfly blog images:
url = "https://scrapfly.io/blog/"
asyncio.run(scrape_images(url))
Above we are using httpx.AsyncClient
to first retrieve the target page HTML. Then, we extract all src
attributes of all <img>
elements. Finally, we download all images concurrently and save them to ./images
directory.
This knowledgebase is provided by Scrapfly data APIs, check us out! 👇