How to get file type of an URL in Python?

To get the file type of an URL we have 2 options - check the URL string for file suffix or perform a HEAD request:

import mimetypes

# mimetypes module can analysize string for file extensions:
('application/pdf', None)
('audio/mpeg', None)

(None, None)
# for files without extension we can make head request which only downloads the metadata
import httpx
response = httpx.head("").headers['Content-Type']
'text/html; charset=utf-8'

When web scraping and web crawling knowing content type before retrieving URL contents can save a lot of bandwidth and speed up the web scraping process. For example, when crawling we only want to follow HTML pages and avoid media files.

Question tagged: Web Crawling, Python

Related Posts

Intro to Web Scraping Images with Python

In this guide, we’ll explore how to scrape images from websites using different methods. We'll also cover the most common image scraping challenges and how to overcome them. By the end of this article, you will be an image scraping master!

How to Scrape Sitemaps to Discover Scraping Targets

Usually to find scrape targets we look at site search or category pages but there's a better way - sitemaps! In this tutorial, we'll be taking a look at how to find and scrape sitemaps for target locations.

Creating Search Engine for any Website using Web Scraping

Guide for creating a search engine for any website using web scraping in Python. How to crawl data, index it and display it via js powered GUI.