How to scrape images from a website?

To scrape images from a website we can use Python with HTML parsing tools like beautifulsoup to select all <img> elements and save them.

Here's an example using httpx and beautifulsoup (install using pip install httpx beautifulsoup4):

import asyncio
import httpx
from bs4 import BeautifulSoup
from pathlib import Path


async def download_image(url, filepath, client):
    response = await client.get(url)
    filepath.write_bytes(response.content)
    print(f"Downloaded {url} to {filepath}")


async def scrape_images(url):
    download_dir = Path('images')
    download_dir.mkdir(parents=True, exist_ok=True)

    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        download_tasks = []
        for img_tag in soup.find_all("img"):
            img_url = img_tag.get("src")  # get image url
            if img_url:
                img_url = response.url.join(img_url)  # turn url absolute
                img_filename = download_dir / Path(str(img_url)).name
                download_tasks.append(
                    download_image(img_url, img_filename, client)
                )
        await asyncio.gather(*download_tasks)

# example - scrape all scrapfly blog images:
url = "https://scrapfly.io/blog/"
asyncio.run(scrape_images(url))

Above we are using httpx.AsyncClient to first retrieve the target page HTML. Then, we extract all src attributes of all <img> elements. Finally, we download all images concurrently and save them to ./images directory.

Question tagged: Python

Related Posts

How to Scrape Forms

Learn how to scrape forms through a step-by-step guide using HTTP clients and headless browsers.

How to Build a Minimum Advertised Price (MAP) Monitoring Tool

Learn what minimum advertised price monitoring is and how to apply its concept using Python web scraping.

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.