How to Find All URLs on a Domain
Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to crawl entire domain to collect all website data
List crawling is a specialized form of web scraping that focuses on extracting collections of similar items from websites. Whether you're gathering product catalogs, monitoring pricing across e-commerce platforms, or building a database of ranked content, list crawling provides the foundation for efficient and organized data collection.
In this article, we will explore practical techniques for crawling different types of web lists from product catalogs and infinite scrolling pages to articles, tables, and search results.
List crawling refers to the automated process of extracting collections of similar items from web pages.
Unlike general web scraping that might target diverse information from a page, list crawling specifically focuses on groups of structured data that follow consistent patterns such as product listings, search results, rankings, or tabular data.
Setting up a basic list crawler requires a few essential components. Python, with its rich ecosystem of libraries, offers an excellent foundation for building effective crawlers.
For our list crawling examples we'll use Python with the following libraries:
All of these can be installed using this pip
command:
$ pip install beautifulsoup4 requests playwright
Once you have these libraries installed see this simple example item list crawler that scrapes 1 item page:
import requests
from bs4 import BeautifulSoup
def crawl_static_list(url):
# Send HTTP request to the target URL
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Find all product items
items = soup.select("div.row.product")
# Extract data from each item
results = []
for item in items:
title = item.select_one("h3.mb-0 a").text.strip()
price = item.select_one("div.price").text.strip()
results.append({"title": title, "price": price})
return results
url = "https://web-scraping.dev/products"
data = crawl_static_list(url)
print(f"Found {len(data)} items")
for item in data[:3]: # Print first 3 items as example
print(f"Title: {item['title']}, Price: {item['price']}")
Found 5 items
Title: Box of Chocolate Candy, Price: 24.99
Title: Dark Red Energy Potion, Price: 4.99
Title: Teal Energy Potion, Price: 4.99
In the above code, we're making an HTTP request to a target URL, parsing the HTML content using BeautifulSoup, and then extracting specific data points from each list item.
This approach works well for simple, static lists where all content is loaded immediately. For more complex scenarios like paginated or dynamically loaded lists, you'll need to extend this foundation with additional techniques we'll cover in subsequent sections.
Your crawler's effectiveness largely depends on how well you understand the structure of the target website. Taking time to inspect the HTML using browser developer tools will help you craft precise selectors that accurately target the desired elements.
Let's now see how we can enhance our basic crawler with more advanced capabilities and different list crawling scenarios
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Here's an example of how to scrape a product with the Scrapfly web scraping API:
from scrapfly import ScrapflyClient, ScrapeConfig
# Create a ScrapflyClient instance
client = ScrapflyClient(key='YOUR-SCRAPFLY-KEY')
# Create scrape requests
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/product/1",
# optional: set country to get localized results
country="us",
# optional: use cloud browsers
render_js=True,
# optional: scroll to the bottom of the page
auto_scroll=True,
))
print(api_result.result["context"]) # metadata
print(api_result.result["config"]) # request data
print(api_result.scrape_result["content"]) # result html content
# parse data yourself
product = {
"title": api_result.selector.css("h3.product-title::text").get(),
"price": api_result.selector.css(".product-price::text").get(),
"description": api_result.selector.css(".product-description::text").get(),
}
print(product)
# or let AI parser extract it for you!
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/product/1",
# use AI models to find ALL product data available on the page
extraction_model="product"
))
{
"title": "Box of Chocolate Candy",
"price": "$9.99 ",
"description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
}
Paginated lists split the data across multiple pages with numbered navigation. This technique is common in e-commerce, search results, and data directories.
One example of paginated pages is web-scraping.dev/products which splits products through several pages.
Here's how to build a product list crawler that handles traditional pagination:
import requests
from bs4 import BeautifulSoup
# Get first page and extract pagination URLs
url = "https://web-scraping.dev/products"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href"))
# Extract product titles from first page
all_product_titles = [a.text.strip() for a in soup.select(".product h3 a")]
# Extract product titles from other pages
for url in other_page_urls:
page_soup = BeautifulSoup(requests.get(url).text, 'html.parser')
all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a"))
# Print results
print(f"Total products found: {len(all_product_titles)}")
print("\nProduct Titles:")
for i, title in enumerate(all_product_titles, 1):
print(f"{i}. {title}")
Total products found: 30
Product Titles:
- Box of Chocolate Candy
- Dark Red Energy Potion
- Teal Energy Potion
- Red Energy Potion
- Blue Energy Potion
- Box of Chocolate Candy
- Dark Red Energy Potion
- Teal Energy Potion
- Red Energy Potion
- Blue Energy Potion
- Dragon Energy Potion
- Hiking Boots for Outdoor Adventures
- Women's High Heel Sandals
- Running Shoes for Men
- Kids' Light-Up Sneakers
- Classic Leather Sneakers
- Cat-Ear Beanie
- Box of Chocolate Candy
- Dark Red Energy Potion
- Teal Energy Potion
- Red Energy Potion
- Blue Energy Potion
- Dragon Energy Potion
- Hiking Boots for Outdoor Adventures
- Women's High Heel Sandals
- Running Shoes for Men
- Kids' Light-Up Sneakers
- Classic Leather Sneakers
- Cat-Ear Beanie
- Box of Chocolate Candy
In the above code, we first get the first page and extract pagination URLs. Then, we extract product titles from the first page and other pages. Finally, we print the total number of products found and the product titles.
While crawling product lists, you'll encounter several challenges:
Pagination Variations: Some sites use parameters like ?page=2
while others might use path segments like /page/2/
or even completely different URL structures.
Paging Limiting: Many sites restrict the maximum number of viewable pages (typically 20-50), even with thousands of products. Overcome this by using filters like price ranges to access the complete dataset as demonstrated in paging limit bypass tutorial.
Changing Layouts: Product list layouts may vary across different categories or during site updates.
Missing Data: Not all products will have complete information, requiring robust error handling.
Effective product list crawling requires adapting to these challenges with techniques like request throttling, robust selectors, and comprehensive error handling.
Let's now explore how to handle more dynamic lists that load content as you scroll.
Modern websites often implement infinite scrolling—a technique that continuously loads new content as the user scrolls down the page.
These "endless" lists present unique challenges for crawlers since the content isn't divided into distinct pages but is loaded dynamically via JavaScript.
One example of infinite data lists is the web-scraping.dev/testimonials page:
Let's see how we can crawl it next.
To tackle endless lists, the easiet method is to use a headless browser that can execute JavaScript and simulate scrolling. Here's an example using Playwright and Python:
# This example is using Playwright but it's also possible to use Selenium with similar approach
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://web-scraping.dev/testimonials/")
# scroll to the bottom:
_prev_height = -1
_max_scrolls = 100
_scroll_count = 0
while _scroll_count < _max_scrolls:
# Execute JavaScript to scroll to the bottom of the page
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# Wait for new content to load (change this value as needed)
page.wait_for_timeout(1000)
# Check whether the scroll height changed - means more pages are there
new_height = page.evaluate("document.body.scrollHeight")
if new_height == _prev_height:
break
_prev_height = new_height
_scroll_count += 1
# now we can collect all loaded data:
results = []
for element in page.locator(".testimonial").element_handles():
text = element.query_selector(".text").inner_html()
results.append(text)
print(f"scraped {len(results)} results")
scraped 60 results
In the above code, we are using Playwright to control a browser and scroll to the bottom of the page to load all the testimonials. We are then collecting the text of each testimonial and printing the number of testimonials scraped. This approach effectively handles endless lists that load content dynamically.
Endless list crawling comes with its own set of challenges:
Speed: Browser crawling is much slower than API-based approaches. When possible, reverse engineer the site's API endpoints for direct data fetching often thousands of times faster, as shown in our reverse engineering of endless paging guide).
Resource Intensity: Running a headless browser consumes significantly more resources than simple HTTP requests.
Element Staleness: As the page updates, previously found elements may become "stale" and unusable, requiring refetching.
Scroll Triggers: Some sites use scroll-percentage triggers rather than scrolling to the bottom, requiring more nuanced scroll simulation.
Now that we've covered dynamic content loading, let's explore how to extract structured data from article-based lists, which present their own unique challenges.
Articles featuring lists (like "Top 10 Programming Languages" or "5 Best Travel Destinations") represent another valuable source of structured data. These lists are typically embedded within article content, organized under headings or with numbered sections.
For this example, let's scrape Scrapfly's own top-10 listicle article using requests and beautifulsoup:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/")
# Check if the request was successful
if response.status_code != 200:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
libraries = []
else:
# Parse the HTML content with BeautifulSoup
# Using 'lxml' parser for better performance and more robust parsing
soup = BeautifulSoup(response.text, 'lxml')
# Find all h2 headings which represent the list items
headings = soup.find_all('h2')
libraries = []
for heading in headings:
# Get the heading text (library name)
title = heading.text.strip()
# Skip the "Summary" section
if title.lower() == "summary":
continue
# Get the next paragraph for a brief description
# In BeautifulSoup, we use .find_next() to get the next element
next_paragraph = heading.find_next('p')
description = next_paragraph.text.strip() if next_paragraph else ''
libraries.append({
"name": title,
"description": description
})
# Print the results
print("Top Web Scraping Libraries in Python:")
for i, lib in enumerate(libraries, 1):
print(f"{i}. {lib['name']}")
print(f" {lib['description'][:100]}...") # Print first 100 chars of description
Top Web Scraping Libraries in Python:
1. HTTPX
HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p...
2. Parsel and LXML
LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib...
3. BeautifulSoup
Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that....
4. JMESPath and JSONPath
JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim...
5. Playwright and Selenium
Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript...
6. Cerberus and Pydantic
An often overlooked process of web scraping is the data quality assurance step. Web scraping is a un...
7. Scrapfly Python SDK
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale....
8. Related Posts
Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...
In this example, we used the requests library to make an HTTP GET request to a blog post about the top web scraping libraries in Python. We then used BeatifulSoup to parse the HTML content of the page and extract the list of libraries and their descriptions. Finally, we printed the results to the console.
Extracting data from list articles requires understanding the content structure and accounting for variations in formatting. Some articles may use numbering in headings, while others rely solely on heading hierarchy. A robust crawler should handle these variations and clean the extracted text to remove extraneous content.
There are some tools that can assist you with listicle scraping:
Let's see tabular data next, which presents yet another structure for list information.
Tables represent another common format for presenting list data on the web. Whether implemented as HTML <table>
elements or styled as tables using CSS grids or other layout techniques, they provide a structured way to display related data in rows and columns.
For this example let's see the table data section on web-scraping.dev/product/1 page:
Here's how to extract data from HTML tables using BeautifulSoup html parsing library:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://web-scraping.dev/product/1")
html = response.text
soup = BeautifulSoup(html, "lxml")
# First, select the desired table element (the 2nd one on the page)
table = soup.find_all('table', {'class': 'table-product'})[1]
headers = []
rows = []
for i, row in enumerate(table.find_all('tr')):
if i == 0:
headers = [el.text.strip() for el in row.find_all('th')]
else:
rows.append([el.text.strip() for el in row.find_all('td')])
print(headers)
print(rows)
['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type']
[['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]
In the above code, we're identifying and parsing HTML tables, extracting both headers and data rows. The function handles various table structures, including those with and without explicit header elements. This approach gives you structured data that preserves the relationships between columns and rows.
When crawling tables, it's important to look beyond the obvious <table>
elements. Many modern websites implement table-like layouts using CSS grid, flexbox, or other techniques. Identifying these structures requires careful inspection of the DOM and adapting your selectors accordingly.
All table structures are easy to handle using beautifulsoup, CSS Selectors or XPath powered algorithms though for more generic solutions can use LLMs and AI. One commonly used technique is to use LLMs to convert HTML to Markdown format which can often create accurate tables from flexible HTML table structures.
Now, let's explore how to crawl search engine results pages for list-type content.
Search Engine Results Pages (SERPs) offer a treasure trove of list-based content, presenting curated links to pages relevant to specific keywords. Crawling SERPs can help you discover list articles and other structured content across the web.
Here's a basic approach to crawling Google search results:
import requests
from bs4 import BeautifulSoup
import urllib.parse
def crawl_google_serp(query, num_results=10):
# Format the query for URL
encoded_query = urllib.parse.quote(query)
# Create Google search URL
url = f"https://www.google.com/search?q={encoded_query}&num={num_results}"
# Add headers to mimic a browser
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
# Extract search results
results = []
# Target the organic search results
for result in soup.select("div.g"):
title_element = result.select_one("h3")
if title_element:
title = title_element.text
# Extract URL
link_element = result.select_one("a")
link = link_element.get("href") if link_element else None
# Extract snippet
snippet_element = result.select_one("div.VwiC3b")
snippet = snippet_element.text if snippet_element else None
results.append({
"title": title,
"url": link,
"snippet": snippet
})
return results
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY")
result = scrapfly.scrape(ScrapeConfig(
url="https://www.google.com/search?q=python"
# select country to get localized results
country="us",
# enable cloud browsers
render_js=True,
# scroll to the bottom of the page
auto_scroll=True,
# use AI to extract data
extraction_model="search_engine_results",
))
print(result.content)
In the above code, we're constructing a Google search query URL, sending an HTTP request with browser-like headers, and then parsing the HTML to extract organic search results. Each result includes the title, URL, and snippet text, which can help you identify list-type content for further crawling.
It's worth noting that directly crawling search engines can be challenging due to very strong anti-bot measures. For production applications, you may need to consider more sophisticated techniques to avoid blocks and for that see our blocking bypass introduction tutorial.
Scrapfly can easily bypass all SERP blocking measures and return AI extracted data for any SERP page using AI Web Scraping API.
To wrap up - let's move on to some frequently asked questions about list crawling.
Below are quick answers to common questions about list crawling techniques and best practices:
List crawling focuses on extracting structured data from lists, such as paginated content, infinite scrolls, and tables. General web scraping targets various elements across different pages, while list crawling requires specific techniques for handling pagination, scroll events, and nested structures.
Use adaptive delays (1-3 seconds) and increase them if you get 429 errors. Implement exponential backoff for failed requests and rotate proxies to distribute traffic. A request queuing system helps maintain a steady and sustainable request rate.
Identify nesting patterns using developer tools. Use a recursive function to process items and their children while preserving relationships. CSS selectors, XPath, and depth-first traversal help extract data while maintaining hierarchy.
List crawling is essential for extracting structured data from the web's many list formats. From product catalogs and social feeds to nested articles and data tables, each list type requires a tailored approach.
This guide has covered:
The techniques demonstrated here from HTTP requests for static content to browser automation for dynamic pages provide powerful tools for transforming unstructured web data into valuable, actionable insights.