How to ignore non HTML URLs when web crawling?

When web crawling we often want to avoid non-HTML URLs which could slow down or break our scraper. To do this we can employ two types of validation rules:

First, we can check the URL extension for common file formats:

import posixpath

IGNORED_EXTENSIONS = [
    # archives
    '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',
    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm',
    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp',
    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk',
]

url = "https://example.com/foo.pdf"
if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS:
    print("+ url is valid")
else:
    print("- url is invalid")

However, not all URLs include file extensions. Alternatively, we can check the Content-Type header of potential URLs using HEAD-type requests which only scrape the document's metadata:

import requests

VALID_TYPES = [
    "text/html",
    # we might also want to scrape plain text files:
    "text/plain",
    # or json files:
    "application/json",
    # or even javascript files
    "application/javascript",
]

url = "https://example.com/foo.pdf"
response = requests.head(url)
if response.headers['Content-Type'] in VALID_TYPES:
    print("+ url is valid")
else:
    print("- url is invalid")

Question tagged: Web Crawling

Oct 31, 2022 (Updated a year ago)

Discover ScrapFly

Try ScrapFly for FREE!

Sep 25, 2023

Intro to Web Scraping Images with Python

In this guide, we’ll explore how to scrape images from websites using different methods. We'll also cover the most common image scraping challenges and how to overcome them. By the end of this article, you will be an image scraping master!

Apr 07, 2023

How to Scrape Sitemaps to Discover Scraping Targets

Usually to find scrape targets we look at site search or category pages but there's a better way - sitemaps! In this tutorial, we'll be taking a look at how to find and scrape sitemaps for target locations.

May 30, 2022

Creating Search Engine for any Website using Web Scraping

Guide for creating a search engine for any website using web scraping in Python. How to crawl data, index it and display it via js powered GUI.

How to ignore non HTML URLs when web crawling?

Company

Tools

Resources

Learn Web Scraping

Usage

How to ignore non HTML URLs when web crawling?

Related Questions

Related Posts

Intro to Web Scraping Images with Python

How to Scrape Sitemaps to Discover Scraping Targets

Creating Search Engine for any Website using Web Scraping

Company

Tools

Resources

Learn Web Scraping

Usage