How to ignore non HTML URLs when web crawling?

When web crawling we often want to avoid non-HTML URLs which could slow down or break our scraper. To do this we can employ two types of validation rules:

First, we can check the URL extension for common file formats:

import posixpath

IGNORED_EXTENSIONS = [
    # archives
    '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',
    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm',
    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp',
    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk',
]

url = "https://example.com/foo.pdf"
if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS:
    print("+ url is valid")
else:
    print("- url is invalid")

However, not all URLs include file extensions. Alternatively, we can check the Content-Type header of potential URLs using HEAD-type requests which only scrape the document's metadata:

import requests

VALID_TYPES = [
    "text/html",
    # we might also want to scrape plain text files:
    "text/plain",
    # or json files:
    "application/json",
    # or even javascript files
    "application/javascript",
]

url = "https://example.com/foo.pdf"
response = requests.head(url)
if response.headers['Content-Type'] in VALID_TYPES:
    print("+ url is valid")
else:
    print("- url is invalid")

Provided by Scrapfly

This knowledgebase is provided by Scrapfly data APIs, check us out! 👇