MissingSchema
error can be seen when using Python requests module to scrape invalid URLs without protocol indicator (the http://
bit).
This usually happens when we accidently supply the scraper with relative URLs instead of absolute URLs:
import requests
requests.get("/product/25") # default redirect limit is 30
# will raise:
# MissingSchema: Invalid URL '/product/10': No scheme supplied. Perhaps you meant http:///product/10?
When web scraping, it's best to always ensure the scraped URLs are absolute using the urljoin()
function:
from urllib.parse import urljoin
import requests
response = requests.get("http://example.com")
urls = [ # lets assume we got this batch of product urls:
"/product/1",
"/product/2",
"/product/3",
]
for relative_url in urls:
absolute_url = urljoin(response.url, relative_url)
# this will result in: http://example.com/product/1
item_response = requests.get(absolute_url)
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇