How to ignore non HTML URLs when web crawling?

When web crawling we often want to avoid non-HTML URLs which could slow down or break our scraper. To do this we can employ two types of validation rules:

First, we can check the URL extension for common file formats:

import posixpath

    # archives
    '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',
    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm',
    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp',
    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk',

url = ""
if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS:
    print("+ url is valid")
    print("- url is invalid")

However, not all URLs include file extensions. Alternatively, we can check the Content-Type header of potential URLs using HEAD-type requests which only scrape the document's metadata:

import requests

    # we might also want to scrape plain text files:
    # or json files:
    # or even javascript files

url = ""
response = requests.head(url)
if response.headers['Content-Type'] in VALID_TYPES:
    print("+ url is valid")
    print("- url is invalid")
Question tagged: Web Crawling

Related Posts

How to Scrape Sitemaps to Discover Scraping Targets

Usually to find scrape targets we look at site search or category pages but there's a better way - sitemaps! In this tutorial, we'll be taking a look at how to find and scrape sitemaps for target locations.

Creating Search Engine for any Website using Web Scraping

Guide for creating a search engine for any website using web scraping in Python. How to crawl data, index it and display it via js powered GUI.