Guide to List Crawling: Everything You Need to Know

Guide to List Crawling: Everything You Need to Know

List crawling is a specialized form of web scraping that focuses on extracting collections of similar items from websites. Whether you're gathering product catalogs, monitoring pricing across e-commerce platforms, or building a database of ranked content, list crawling provides the foundation for efficient and organized data collection.

In this article, we will explore practical techniques for crawling different types of web lists from product catalogs and infinite scrolling pages to articles, tables, and search results.

What is List Crawling?

List crawling refers to the automated process of extracting collections of similar items from web pages.

Unlike general web scraping that might target diverse information from a page, list crawling specifically focuses on groups of structured data that follow consistent patterns such as product listings, search results, rankings, or tabular data.

Crawler Setup

Setting up a basic list crawler requires a few essential components. Python, with its rich ecosystem of libraries, offers an excellent foundation for building effective crawlers.

For our list crawling examples we'll use Python with the following libraries:

All of these can be installed using this pip command:

$ pip install beautifulsoup4 requests playwright

Once you have these libraries installed see this simple example item list crawler that scrapes 1 item page:

import requests
from bs4 import BeautifulSoup

def crawl_static_list(url):
    # Send HTTP request to the target URL
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

    # Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all product items
    items = soup.select("div.row.product")

    # Extract data from each item
    results = []
    for item in items:
        title = item.select_one("h3.mb-0 a").text.strip()
        price = item.select_one("div.price").text.strip()
        results.append({"title": title, "price": price})

    return results

url = "https://web-scraping.dev/products"
data = crawl_static_list(url)
print(f"Found {len(data)} items")
for item in data[:3]:  # Print first 3 items as example
    print(f"Title: {item['title']}, Price: {item['price']}")
Example Output

Found 5 items
Title: Box of Chocolate Candy, Price: 24.99
Title: Dark Red Energy Potion, Price: 4.99
Title: Teal Energy Potion, Price: 4.99

In the above code, we're making an HTTP request to a target URL, parsing the HTML content using BeautifulSoup, and then extracting specific data points from each list item.

This approach works well for simple, static lists where all content is loaded immediately. For more complex scenarios like paginated or dynamically loaded lists, you'll need to extend this foundation with additional techniques we'll cover in subsequent sections.

Your crawler's effectiveness largely depends on how well you understand the structure of the target website. Taking time to inspect the HTML using browser developer tools will help you craft precise selectors that accurately target the desired elements.

Let's now see how we can enhance our basic crawler with more advanced capabilities and different list crawling scenarios

Power-Up with Scrapfly

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

scrapfly middleware

Here's an example of how to scrape a product with the Scrapfly web scraping API:

from scrapfly import ScrapflyClient, ScrapeConfig

# Create a ScrapflyClient instance
client = ScrapflyClient(key='YOUR-SCRAPFLY-KEY')
# Create scrape requests
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # optional: set country to get localized results
    country="us",
    # optional: use cloud browsers
    render_js=True,
    # optional: scroll to the bottom of the page
    auto_scroll=True,
))

print(api_result.result["context"])  # metadata
print(api_result.result["config"])  # request data
print(api_result.scrape_result["content"])  # result html content

# parse data yourself
product = {
    "title": api_result.selector.css("h3.product-title::text").get(),
    "price": api_result.selector.css(".product-price::text").get(),
    "description": api_result.selector.css(".product-description::text").get(),
}
print(product)

# or let AI parser extract it for you!
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # use AI models to find ALL product data available on the page
    extraction_model="product"
))
Example Output

{
    "title": "Box of Chocolate Candy",
    "price": "$9.99 ",
    "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
}

Paginated List Crawling

Paginated lists split the data across multiple pages with numbered navigation. This technique is common in e-commerce, search results, and data directories.

One example of paginated pages is web-scraping.dev/products which splits products through several pages.

example of a paginated list
paginated list on web-scraping.dev/products

Example Crawler

Here's how to build a product list crawler that handles traditional pagination:

import requests
from bs4 import BeautifulSoup

# Get first page and extract pagination URLs
url = "https://web-scraping.dev/products"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href"))

# Extract product titles from first page
all_product_titles = [a.text.strip() for a in soup.select(".product h3 a")]

# Extract product titles from other pages
for url in other_page_urls:
    page_soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a"))

# Print results
print(f"Total products found: {len(all_product_titles)}")
print("\nProduct Titles:")
for i, title in enumerate(all_product_titles, 1):
    print(f"{i}. {title}")
Example Output

Total products found: 30

Product Titles:

  1. Box of Chocolate Candy
  2. Dark Red Energy Potion
  3. Teal Energy Potion
  4. Red Energy Potion
  5. Blue Energy Potion
  6. Box of Chocolate Candy
  7. Dark Red Energy Potion
  8. Teal Energy Potion
  9. Red Energy Potion
  10. Blue Energy Potion
  11. Dragon Energy Potion
  12. Hiking Boots for Outdoor Adventures
  13. Women's High Heel Sandals
  14. Running Shoes for Men
  15. Kids' Light-Up Sneakers
  16. Classic Leather Sneakers
  17. Cat-Ear Beanie
  18. Box of Chocolate Candy
  19. Dark Red Energy Potion
  20. Teal Energy Potion
  21. Red Energy Potion
  22. Blue Energy Potion
  23. Dragon Energy Potion
  24. Hiking Boots for Outdoor Adventures
  25. Women's High Heel Sandals
  26. Running Shoes for Men
  27. Kids' Light-Up Sneakers
  28. Classic Leather Sneakers
  29. Cat-Ear Beanie
  30. Box of Chocolate Candy

In the above code, we first get the first page and extract pagination URLs. Then, we extract product titles from the first page and other pages. Finally, we print the total number of products found and the product titles.

Crawling Challenges

While crawling product lists, you'll encounter several challenges:

  1. Pagination Variations: Some sites use parameters like ?page=2 while others might use path segments like /page/2/ or even completely different URL structures.

  2. Paging Limiting: Many sites restrict the maximum number of viewable pages (typically 20-50), even with thousands of products. Overcome this by using filters like price ranges to access the complete dataset as demonstrated in paging limit bypass tutorial.

  3. Changing Layouts: Product list layouts may vary across different categories or during site updates.

  4. Missing Data: Not all products will have complete information, requiring robust error handling.

Effective product list crawling requires adapting to these challenges with techniques like request throttling, robust selectors, and comprehensive error handling.

Let's now explore how to handle more dynamic lists that load content as you scroll.

Endless List Crawling

Modern websites often implement infinite scrolling—a technique that continuously loads new content as the user scrolls down the page.

These "endless" lists present unique challenges for crawlers since the content isn't divided into distinct pages but is loaded dynamically via JavaScript.

One example of infinite data lists is the web-scraping.dev/testimonials page:

example of a paginated list
endless list on web-scraping.dev/testimonials

Let's see how we can crawl it next.

Example Crawler

To tackle endless lists, the easiet method is to use a headless browser that can execute JavaScript and simulate scrolling. Here's an example using Playwright and Python:

# This example is using Playwright but it's also possible to use Selenium with similar approach
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://web-scraping.dev/testimonials/")

    # scroll to the bottom:
    _prev_height = -1
    _max_scrolls = 100
    _scroll_count = 0
    while _scroll_count < _max_scrolls:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        # Wait for new content to load (change this value as needed)
        page.wait_for_timeout(1000)
        # Check whether the scroll height changed - means more pages are there
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == _prev_height:
            break
        _prev_height = new_height
        _scroll_count += 1
    # now we can collect all loaded data:
    results = []
    for element in page.locator(".testimonial").element_handles():
        text = element.query_selector(".text").inner_html()
        results.append(text)
    print(f"scraped {len(results)} results")
Example Output
scraped 60 results

In the above code, we are using Playwright to control a browser and scroll to the bottom of the page to load all the testimonials. We are then collecting the text of each testimonial and printing the number of testimonials scraped. This approach effectively handles endless lists that load content dynamically.

Crawling Challenges

Endless list crawling comes with its own set of challenges:

  1. Speed: Browser crawling is much slower than API-based approaches. When possible, reverse engineer the site's API endpoints for direct data fetching often thousands of times faster, as shown in our reverse engineering of endless paging guide).

  2. Resource Intensity: Running a headless browser consumes significantly more resources than simple HTTP requests.

  3. Element Staleness: As the page updates, previously found elements may become "stale" and unusable, requiring refetching.

  4. Scroll Triggers: Some sites use scroll-percentage triggers rather than scrolling to the bottom, requiring more nuanced scroll simulation.

Now that we've covered dynamic content loading, let's explore how to extract structured data from article-based lists, which present their own unique challenges.

List Article Crawling

Articles featuring lists (like "Top 10 Programming Languages" or "5 Best Travel Destinations") represent another valuable source of structured data. These lists are typically embedded within article content, organized under headings or with numbered sections.

Example Crawler

For this example, let's scrape Scrapfly's own top-10 listicle article using requests and beautifulsoup:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/")

# Check if the request was successful
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    libraries = []
else:
    # Parse the HTML content with BeautifulSoup
    # Using 'lxml' parser for better performance and more robust parsing
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Find all h2 headings which represent the list items
    headings = soup.find_all('h2')
    libraries = []
    
    for heading in headings:
        # Get the heading text (library name)
        title = heading.text.strip()
        
        # Skip the "Summary" section
        if title.lower() == "summary":
            continue
        
        # Get the next paragraph for a brief description
        # In BeautifulSoup, we use .find_next() to get the next element
        next_paragraph = heading.find_next('p')
        description = next_paragraph.text.strip() if next_paragraph else ''
        
        libraries.append({
            "name": title,
            "description": description
        })

# Print the results
print("Top Web Scraping Libraries in Python:")
for i, lib in enumerate(libraries, 1):
    print(f"{i}. {lib['name']}")
    print(f"   {lib['description'][:100]}...")  # Print first 100 chars of description
Example Output

Top Web Scraping Libraries in Python:
1. HTTPX
   HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p...
2. Parsel and LXML
   LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib...
3. BeautifulSoup
   Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that....
4. JMESPath and JSONPath
   JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim...
5. Playwright and Selenium
   Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript...
6. Cerberus and Pydantic
   An often overlooked process of web scraping is the data quality assurance step. Web scraping is a un...
7. Scrapfly Python SDK
   ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale....
8. Related Posts
   Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...

In this example, we used the requests library to make an HTTP GET request to a blog post about the top web scraping libraries in Python. We then used BeatifulSoup to parse the HTML content of the page and extract the list of libraries and their descriptions. Finally, we printed the results to the console.

Crawling Challenges

Extracting data from list articles requires understanding the content structure and accounting for variations in formatting. Some articles may use numbering in headings, while others rely solely on heading hierarchy. A robust crawler should handle these variations and clean the extracted text to remove extraneous content.

There are some tools that can assist you with listicle scraping:

  • newspaper4k (previously newspaper3k implements article parsing from HTML and various helper functions that can help to identify lists.
  • goose3 is another library that can extract structured data from articles, including lists.
  • trafilatura another powerful html parser with a lot of prebuilt functions to extract structured data from articles.
  • parsel extracts using powerful Xpath selectors allowing for very flexible and reliable extraction.
  • LLMs with RAG can be an easy way to extract data from list articles.

Let's see tabular data next, which presents yet another structure for list information.

Table List Crawling

Tables represent another common format for presenting list data on the web. Whether implemented as HTML <table> elements or styled as tables using CSS grids or other layout techniques, they provide a structured way to display related data in rows and columns.

Example Crawler

For this example let's see the table data section on web-scraping.dev/product/1 page:

example of a table list
table list on web-scraping.dev/product/1

Here's how to extract data from HTML tables using BeautifulSoup html parsing library:

from bs4 import BeautifulSoup
import requests

response = requests.get("https://web-scraping.dev/product/1")
html = response.text

soup = BeautifulSoup(html, "lxml")

# First, select the desired table element (the 2nd one on the page)
table = soup.find_all('table', {'class': 'table-product'})[1]

headers = []
rows = []

for i, row in enumerate(table.find_all('tr')):
    if i == 0:
        headers = [el.text.strip() for el in row.find_all('th')]
    else:
        rows.append([el.text.strip() for el in row.find_all('td')])
        
print(headers)
print(rows)
Example Output

['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type']
[['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]

In the above code, we're identifying and parsing HTML tables, extracting both headers and data rows. The function handles various table structures, including those with and without explicit header elements. This approach gives you structured data that preserves the relationships between columns and rows.

Crawling Challenges

When crawling tables, it's important to look beyond the obvious <table> elements. Many modern websites implement table-like layouts using CSS grid, flexbox, or other techniques. Identifying these structures requires careful inspection of the DOM and adapting your selectors accordingly.

All table structures are easy to handle using beautifulsoup, CSS Selectors or XPath powered algorithms though for more generic solutions can use LLMs and AI. One commonly used technique is to use LLMs to convert HTML to Markdown format which can often create accurate tables from flexible HTML table structures.

Now, let's explore how to crawl search engine results pages for list-type content.

SERP List Crawling

Search Engine Results Pages (SERPs) offer a treasure trove of list-based content, presenting curated links to pages relevant to specific keywords. Crawling SERPs can help you discover list articles and other structured content across the web.

Example Crawler

Here's a basic approach to crawling Google search results:

Python
ScrapFly AI
import requests
from bs4 import BeautifulSoup
import urllib.parse

def crawl_google_serp(query, num_results=10):
    # Format the query for URL
    encoded_query = urllib.parse.quote(query)
    
    # Create Google search URL
    url = f"https://www.google.com/search?q={encoded_query}&num={num_results}"
    
    # Add headers to mimic a browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Extract search results
    results = []
    
    # Target the organic search results
    for result in soup.select("div.g"):
        title_element = result.select_one("h3")
        
        if title_element:
            title = title_element.text
            
            # Extract URL
            link_element = result.select_one("a")
            link = link_element.get("href") if link_element else None
            
            # Extract snippet
            snippet_element = result.select_one("div.VwiC3b")
            snippet = snippet_element.text if snippet_element else None
            
            results.append({
                "title": title,
                "url": link,
                "snippet": snippet
            })
    
    return results
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY")

result = scrapfly.scrape(ScrapeConfig(
    url="https://www.google.com/search?q=python"
    # select country to get localized results
    country="us",
    # enable cloud browsers
    render_js=True,
    # scroll to the bottom of the page
    auto_scroll=True,
    # use AI to extract data 
    extraction_model="search_engine_results",
))

print(result.content)

In the above code, we're constructing a Google search query URL, sending an HTTP request with browser-like headers, and then parsing the HTML to extract organic search results. Each result includes the title, URL, and snippet text, which can help you identify list-type content for further crawling.

Crawling Challenges

It's worth noting that directly crawling search engines can be challenging due to very strong anti-bot measures. For production applications, you may need to consider more sophisticated techniques to avoid blocks and for that see our blocking bypass introduction tutorial.

Scrapfly can easily bypass all SERP blocking measures and return AI extracted data for any SERP page using AI Web Scraping API.

To wrap up - let's move on to some frequently asked questions about list crawling.

FAQ

Below are quick answers to common questions about list crawling techniques and best practices:

What is the difference between list crawling and general web scraping?

List crawling focuses on extracting structured data from lists, such as paginated content, infinite scrolls, and tables. General web scraping targets various elements across different pages, while list crawling requires specific techniques for handling pagination, scroll events, and nested structures.

How do I handle rate limiting when crawling large lists?

Use adaptive delays (1-3 seconds) and increase them if you get 429 errors. Implement exponential backoff for failed requests and rotate proxies to distribute traffic. A request queuing system helps maintain a steady and sustainable request rate.

How can I extract structured data from deeply nested lists?

Identify nesting patterns using developer tools. Use a recursive function to process items and their children while preserving relationships. CSS selectors, XPath, and depth-first traversal help extract data while maintaining hierarchy.

Summary

List crawling is essential for extracting structured data from the web's many list formats. From product catalogs and social feeds to nested articles and data tables, each list type requires a tailored approach.

This guide has covered:

  • Setting up basic crawlers with Python libraries like BeautifulSoup and requests
  • Handling paginated lists that split content across multiple pages
  • Tackling endless scroll lists with headless browsers
  • Extracting structured data from article-based lists
  • Processing tabular data for row-column relationships
  • Crawling search engine results to discover more list content

The techniques demonstrated here from HTTP requests for static content to browser automation for dynamic pages provide powerful tools for transforming unstructured web data into valuable, actionable insights.

Related Posts

How to Find All URLs on a Domain

Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to crawl entire domain to collect all website data

What is Googlebot User Agent String?

Learn about Googlebot user agents, how to verify them, block unwanted crawlers, and optimize your site for better indexing and SEO performance.

Intro to Web Scraping Images with Python

In this guide, we’ll explore how to scrape images from websites using different methods. We'll also cover the most common image scraping challenges and how to overcome them. By the end of this article, you will be an image scraping master!