How to get file type of an URL in Python?

To get the file type of an URL we have 2 options - check the URL string for file suffix or perform a HEAD request:

import mimetypes

# mimetypes module can analysize string for file extensions:
('application/pdf', None)
('audio/mpeg', None)

(None, None)
# for files without extension we can make head request which only downloads the metadata
import httpx
response = httpx.head("").headers['Content-Type']
'text/html; charset=utf-8'

When web scraping and web crawling knowing content type before retrieving URL contents can save a lot of bandwidth and speed up the web scraping process. For example, when crawling we only want to follow HTML pages and avoid media files.

Question tagged: Web Crawling, Python

Related Posts

How to Scrape Sitemaps to Discover Scraping Targets

Usually to find scrape targets we look at site search or category pages but there's a better way - sitemaps! In this tutorial, we'll be taking a look at how to find and scrape sitemaps for target locations.

Creating Search Engine for any Website using Web Scraping

Guide for creating a search engine for any website using web scraping in Python. How to crawl data, index it and display it via js powered GUI.