How to ignore non HTML URLs when web crawling?

by scrapecrow Oct 31, 2022

When web crawling we often want to avoid non-HTML URLs which could slow down or break our scraper. To do this we can employ two types of validation rules:

First, we can check the URL extension for common file formats:

import posixpath

IGNORED_EXTENSIONS = [
    # archives
    '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',
    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm',
    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp',
    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk',
]

url = "https://example.com/foo.pdf"
if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS:
    print("+ url is valid")
else:
    print("- url is invalid")

However, not all URLs include file extensions. Alternatively, we can check the Content-Type header of potential URLs using HEAD-type requests which only scrape the document's metadata:

import requests

VALID_TYPES = [
    "text/html",
    # we might also want to scrape plain text files:
    "text/plain",
    # or json files:
    "application/json",
    # or even javascript files
    "application/javascript",
]

url = "https://example.com/foo.pdf"
response = requests.head(url)
if response.headers['Content-Type'] in VALID_TYPES:
    print("+ url is valid")
else:
    print("- url is invalid")

Related Articles

What is Rate Limiting? Everything You Need to Know

Discover what rate limiting is, why it matters, how it works, and how developers can implement it to build stable, scalable applications.

BLOCKING
CRAWLING
HTTP
What is Rate Limiting? Everything You Need to Know

GPT Crawler: The AI Training Data Collection Guide

Learn how to use GPT Crawler to collect web data for AI training. A developer's guide with setup tips, configuration steps, and best practices.

AI
CRAWLING
GPT Crawler: The AI Training Data Collection Guide

Guide to List Crawling: Everything You Need to Know

In-depth look at list crawling - how to extract valuable data from list-formatted content like tables, listicles and paginated pages.

CRAWLING
BEAUTIFULSOUP
PYTHON
Guide to List Crawling: Everything You Need to Know

How to Find All URLs on a Domain

Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to crawl entire domain to collect all website data

CRAWLING
PYTHON
How to Find All URLs on a Domain

What is Googlebot User Agent String?

Learn about Googlebot user agents, how to verify them, block unwanted crawlers, and optimize your site for better indexing and SEO performance.

CRAWLING
SEARCH-ENGINE
SEO
What is Googlebot User Agent String?

Intro to Web Scraping Images with Python

In this guide, we’ll explore how to scrape images from websites using different methods. We'll also cover the most common image scraping challenges and how to overcome them. By the end of this article, you will be an image scraping master!

INTRO
CRAWLING
DATA-PARSING
PYTHON
Intro to Web Scraping Images with Python