When web crawling we often want to avoid non-HTML URLs which could slow down or break our scraper. To do this we can employ two types of validation rules:
First, we can check the URL extension for common file formats:
import posixpath
IGNORED_EXTENSIONS = [
# archives
'7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',
# images
'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',
# audio
'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
# video
'3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm',
# office suites
'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp',
# other
'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk',
]
url = "https://example.com/foo.pdf"
if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS:
print("+ url is valid")
else:
print("- url is invalid")
However, not all URLs include file extensions. Alternatively, we can check the Content-Type
header of potential URLs using HEAD-type requests which only scrape the document's metadata:
import requests
VALID_TYPES = [
"text/html",
# we might also want to scrape plain text files:
"text/plain",
# or json files:
"application/json",
# or even javascript files
"application/javascript",
]
url = "https://example.com/foo.pdf"
response = requests.head(url)
if response.headers['Content-Type'] in VALID_TYPES:
print("+ url is valid")
else:
print("- url is invalid")