How to get file type of an URL in Python?

To get the file type of an URL we have 2 options - check the URL string for file suffix or perform a HEAD request:

import mimetypes

# mimetypes module can analysize string for file extensions:
mimetypes.guess_type("http://example.com/file.pdf")
('application/pdf', None)
mimetypes.guess_type("http://example.com/song.mp3")
('audio/mpeg', None)


mimetypes.guess_type("http://example.com/file-without-extension")
(None, None)
# for files without extension we can make head request which only downloads the metadata
import httpx
response = httpx.head("http://httpbin.dev/html").headers['Content-Type']
'text/html; charset=utf-8'
httpx.head("https://wiki.mozilla.org/images/3/37/Mozilla_MDN_Guide.pdf").headers['Content-Type']
'application/pdf'

When web scraping and web crawling knowing content type before retrieving URL contents can save a lot of bandwidth and speed up the web scraping process. For example, when crawling we only want to follow HTML pages and avoid media files.

Question tagged: Web Crawling, Python

Related Posts

How to Scrape Sitemaps to Discover Scraping Targets

Usually to find scrape targets we look at site search or category pages but there's a better way - sitemaps! In this tutorial, we'll be taking a look at how to find and scrape sitemaps for target locations.

Creating Search Engine for any Website using Web Scraping

Guide for creating a search engine for any website using web scraping in Python. How to crawl data, index it and display it via js powered GUI.