To get the file type of an URL we have 2 options - check the URL string for file suffix or perform a HEAD request:
import mimetypes
# mimetypes module can analysize string for file extensions:
mimetypes.guess_type("http://example.com/file.pdf")
('application/pdf', None)
mimetypes.guess_type("http://example.com/song.mp3")
('audio/mpeg', None)
mimetypes.guess_type("http://example.com/file-without-extension")
(None, None)
# for files without extension we can make head request which only downloads the metadata
import httpx
response = httpx.head("https://httpbin.dev/html").headers['Content-Type']
'text/html; charset=utf-8'
httpx.head("https://wiki.mozilla.org/images/3/37/Mozilla_MDN_Guide.pdf").headers['Content-Type']
'application/pdf'
When web scraping and web crawling knowing content type before retrieving URL contents can save a lot of bandwidth and speed up the web scraping process. For example, when crawling we only want to follow HTML pages and avoid media files.