     [Answers](https://scrapfly.io/blog)   /  [crawling](https://scrapfly.io/blog/tag/crawling)   /  [How to ignore non HTML URLs when web crawling?](https://scrapfly.io/blog/answers/how-to-ignore-non-html-urls-when-web-crawling)   # How to ignore non HTML URLs when web crawling?

 by [Bernardas Alisauskas](https://scrapfly.io/blog/author/bernardas) Oct 31, 2022 1 min read [\#crawling](https://scrapfly.io/blog/tag/crawling) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fanswers%2Fhow-to-ignore-non-html-urls-when-web-crawling "Share on LinkedIn")    

 

 

When web crawling we often want to avoid non-HTML URLs which could slow down or break our scraper. To do this we can employ two types of validation rules:

First, we can check the URL extension for common file formats:

python```python
import posixpath

IGNORED_EXTENSIONS = [
    # archives
    '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',
    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm',
    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp',
    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk',
]

url = "https://example.com/foo.pdf"
if posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS:
    print("+ url is valid")
else:
    print("- url is invalid")
```



However, not all URLs include file extensions. Alternatively, we can check the `Content-Type` header of potential URLs using HEAD-type requests which only scrape the document's metadata:

python```python
import requests

VALID_TYPES = [
    "text/html",
    # we might also want to scrape plain text files:
    "text/plain",
    # or json files:
    "application/json",
    # or even javascript files
    "application/javascript",
]

url = "https://example.com/foo.pdf"
response = requests.head(url)
if response.headers['Content-Type'] in VALID_TYPES:
    print("+ url is valid")
else:
    print("- url is invalid")
```



For managed crawling that handles URL filtering automatically, Scrapfly's [Crawler API](https://scrapfly.io/crawler-api) filters non-HTML content and deduplicates URLs without writing custom filtering logic.



 

    



Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 

## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fanswers%2Fhow-to-ignore-non-html-urls-when-web-crawling) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fanswers%2Fhow-to-ignore-non-html-urls-when-web-crawling) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fanswers%2Fhow-to-ignore-non-html-urls-when-web-crawling) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fanswers%2Fhow-to-ignore-non-html-urls-when-web-crawling) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fanswers%2Fhow-to-ignore-non-html-urls-when-web-crawling) 



 ## Related Articles

 [  

 http curl 

### How to Ignore cURL SSL Errors

Learn to handle SSL errors in cURL, including using self-signed certificates. Explore common issues, safe practices.

 

 ](https://scrapfly.io/blog/posts/guide-to-curl-ignore-ssl-errors) [  

 python nodejs 

### How to use Headless Chrome Extensions for Web Scraping

In this article, we'll explore different useful Chrome extensions for web scraping. We'll also explain how to install Ch...

 

 ](https://scrapfly.io/blog/posts/how-to-use-browser-extensions-with-playwright-puppeteer-and-selenium) [  

 python crawling 

### How to Find All URLs on a Domain

Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to crawl entire domain to...

 

 ](https://scrapfly.io/blog/posts/how-to-find-all-urls-on-a-domain) 

  ## Related Questions

- [ Q How to Send a HEAD Request With cURL? ](https://scrapfly.io/blog/answers/how-to-send-curl-head-requests)
- [ Q How to get page source in Puppeteer? ](https://scrapfly.io/blog/answers/how-to-get-page-source-in-puppeteer)
- [ Q What's the difference between Web Scraping and Crawling? ](https://scrapfly.io/blog/answers/whats-the-difference-between-scraping-and-crawling)
- [ Q How to get file type of an URL in Python? ](https://scrapfly.io/blog/answers/how-to-get-url-filetype-in-python)
 
  



   



 Scale your web scraping effortlessly, **1,000 free credits** [Start Free](https://scrapfly.io/register)