How to ignore non HTML URLs when web crawling?

import posixpath IGNORED_EXTENSIONS = [ # archives '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip', # images 'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico', # audio 'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff', # video '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm', # office suites 'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp', # other 'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk', ] url = "https://example.com/foo.pdf" if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS: print("+ url is valid") else: print("- url is invalid")

import requests VALID_TYPES = [ "text/html", # we might also want to scrape plain text files: "text/plain", # or json files: "application/json", # or even javascript files "application/javascript", ] url = "https://example.com/foo.pdf" response = requests.head(url) if response.headers['Content-Type'] in VALID_TYPES: print("+ url is valid") else: print("- url is invalid")

Provided by Scrapfly

This knowledgebase is provided by Scrapfly data APIs, check us out! 👇

Web Scraping API - scrape without blocking, control cloud browsers, and more.

Extraction API - AI and LLM for parsing data.

Screenshot API - capture pages or elements with no blocks.

GPT Crawler: The AI Training Data Collection Guide

Mar 20, 2025

How to ignore non HTML URLs when web crawling?

Provided by Scrapfly

Related Questions

Related Posts

GPT Crawler: The AI Training Data Collection Guide

Guide to List Crawling: Everything You Need to Know

How to Find All URLs on a Domain

What is Googlebot User Agent String?