🚀 We are hiring! See open positions

How to Fix 403 Forbidden Errors When Web Scraping

How to Fix 403 Forbidden Errors When Web Scraping

You send a request with Python and get a 403 Forbidden. The page loads fine in your browser, but your script hits a wall. In web scraping, a 403 rarely means you lack permission to view the page. It means the server flagged your request as automated.

In this guide, we'll cover what causes 403 errors in web scraping, the five detection vectors that trigger them, and seven Python fixes from proper headers to TLS fingerprinting. Let's get started!

Key Takeaways

  • A 403 Forbidden in web scraping means the server detected your request as automated, not that you lack access
  • Five detection vectors cause most 403 errors: missing headers, User-Agent strings, IP reputation, TLS fingerprints, and cookie validation
  • Fix 403 errors by escalating through solutions: headers, User-Agent rotation, delays, proxies, sessions, headless browsers, and TLS matching
  • Start with the simplest fix and escalate only when needed
  • Use Scrapfly when manual fixes get too complex to maintain

What Is a 403 Forbidden Error?

HTTP 403 Forbidden tells the client it can't access a resource. Unlike 401 Unauthorized (which means "who are you?"), a 403 means "I know who you are, but you're not allowed in."

The HTTP spec defines 403 for cases where the server understands the request but refuses to authorize it. Common causes include:

  • The user account lacks permission for that resource
  • The client's IP address is on a blocklist
  • Rate limits have been exceeded
  • The connection looks automated

In traditional web development, 403 errors usually signal a permissions problem. A user tries to open an admin page without admin rights, and the server returns 403.

Web scraping is different. When your Python script gets a 403 from a public page, the server isn't checking login credentials. It's checking whether you look like a real browser or a bot. Anti-bot systems from Cloudflare, Akamai, and DataDome analyze your request and block anything that doesn't match a real browser.

Response headers can give clues about why you were blocked. Check for X- prefixed headers that mention rate limits or blocking reasons. Some services return error pages with details, while others give a generic "Access Denied."

While 403 errors happen for many reasons, web scrapers hit them because of bot detection. The rest of this guide focuses on diagnosing and fixing that.

Why Do Web Scrapers Get 403 Errors?

Anti-bot systems check multiple signals to separate humans from bots. When even one signal looks off, the server returns a 403. Here are the five main detection vectors.

Missing or Suspicious HTTP Headers

A bare httpx.get(url) call sends about three headers. A real Chrome browser sends 15 or more. Servers check for headers like Accept, Accept-Language, Accept-Encoding, and the Sec-Fetch-* family. Missing these headers is a red flag.

Some anti-bot systems also check header order. Chrome sends headers in a specific sequence. If your HTTP client sends them differently, the server notices the mismatch.

User-Agent Detection

Python's default User-Agent (python-httpx/0.x.x or python-requests/2.x) gets blocked everywhere. Every anti-bot system rejects these strings. Some websites also maintain blocklists of known bot User-Agents.

Changing your User-Agent alone often isn't enough. If the rest of your headers still look like a Python script, the server catches the mismatch. But a proper User-Agent is a required first step.

IP Reputation and Rate Limiting

Datacenter IP addresses raise suspicion. Most real users browse from residential IPs provided by their internet service provider. If your scraper runs on AWS, Google Cloud, or a similar provider, many anti-bot systems block it on sight.

Rate limiting adds another layer. Too many requests from one IP in a short window triggers blocking. Some servers return a 429 Too Many Requests for this, but others use 403 to hide the blocking reason.

TLS and Browser Fingerprinting

This is the top reason "it works in my browser but not in Python." Every TLS client has a unique fingerprint based on its handshake parameters. Python's TLS fingerprint looks nothing like Chrome's.

Anti-bot systems use JA3 and JA4 hashing to identify which TLS library made the request. They also check HTTP/2 settings and frame ordering. Even with perfect headers and a Chrome User-Agent, a Python TLS fingerprint gives you away.

Some sites set cookies on the first visit and expect them on later requests. If you skip the homepage and go straight to a product page, the missing cookies trigger a 403.

JavaScript-set cookies make this harder. Python HTTP clients can't run JavaScript, so they miss cookies that a browser would receive. This explains the common pattern where your first request works but follow-up requests return 403.

How to Fix 403 Forbidden When Web Scraping

These seven solutions go from simplest to most advanced. Start at the top and move down until your 403 errors stop. Each fix targets a specific detection vector, so matching your symptom to the right solution saves time.

Set Proper HTTP Headers

The quickest fix for 403 errors is sending browser-like headers. This works when the server checks for missing headers but doesn't inspect TLS fingerprints.

Header Example

import httpx

# Mimic Chrome's request headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

response = httpx.get("https://web-scraping.dev/products", headers=headers)
print(response.status_code)

This header set covers what most servers check. Pay attention to the Sec-Fetch-* headers since many anti-bot systems flag requests that miss them.

Rotate User-Agents

If one User-Agent gets blocked after several requests, rotating through a list helps. Combine this with full header sets for the best results.

User-Agent Rotation

import httpx
import random

user_agents = [
    # Chrome on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    # Chrome on macOS
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    # Chrome on Linux
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    # Firefox on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) Gecko/20100101 Firefox/134.0",
    # Firefox on macOS
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.7; rv:134.0) Gecko/20100101 Firefox/134.0",
]

# Pick a random User-Agent for each request
headers = {"User-Agent": random.choice(user_agents)}
response = httpx.get("https://web-scraping.dev/products", headers=headers)
print(response.status_code)

Keep your User-Agent list updated. Outdated browser versions stand out in server logs and can trigger blocks on their own.

Add Request Delays and Throttling

Rapid-fire requests with consistent timing look nothing like human browsing. Adding random delays between requests helps avoid rate-limiting 403 errors.

Delay Example

import httpx
import time
import random

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
]

for url in urls:
    response = httpx.get(url)
    print(f"{url}: {response.status_code}")
    # Random delay between 2 and 5 seconds
    time.sleep(random.uniform(2, 5))

Use random.uniform() instead of a fixed delay so each pause is different. Consistent timing is a signal on its own. For async code, swap time.sleep() with asyncio.sleep().

Use Proxy Rotation

When 403 errors come from IP blocking or rate limiting, proxy rotation spreads your requests across many IP addresses. Residential proxies work best because their IPs match real internet users.

Proxy Rotation Example

import httpx
import random

proxies = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

# Rotate proxies for each request
proxy = random.choice(proxies)
with httpx.Client(proxy=proxy) as client:
    response = client.get("https://web-scraping.dev/products")
    print(response.status_code)

Datacenter proxies are cheaper but get blocked more often. Residential proxies cost more but pass IP reputation checks. Free proxies are unreliable and often already blocklisted.

Handle Cookies and Sessions

If your scraper gets 403 on the second or third request, missing cookies are likely the cause. Use an httpx Client to keep cookies across requests.

Session Example

import httpx

# Client persists cookies across requests
with httpx.Client() as client:
    # Visit homepage to collect session cookies
    client.get("https://web-scraping.dev/")

    # Later requests include those cookies
    response = client.get("https://web-scraping.dev/products")
    print(response.status_code)
    print(dict(client.cookies))

The httpx Client works like a browser session. It stores cookies from each response and sends them with the next request. This fixes the "works once, then 403" pattern that many scrapers run into.

Use a Headless Browser

When a site requires JavaScript to set cookies or pass bot checks, HTTP clients alone won't work. A headless browser runs real JavaScript and passes most fingerprint checks.

Playwright Example

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Navigate like a real browser
    page.goto("https://web-scraping.dev/products")

    # Get the rendered HTML
    html = page.content()
    print(f"Got {len(html)} characters of HTML")

    browser.close()

Headless browsers solve JavaScript fingerprinting and cookie issues. The tradeoff is speed and memory, so use them only when simpler fixes don't work.

Match TLS Fingerprints

This is the fix for "it works in my browser but not in Python." The curl_cffi library sends requests with Chrome's TLS fingerprint instead of Python's default. It solves TLS/JA3 detection without the overhead of a full browser.

curl_cffi Example

from curl_cffi import requests

# Impersonate Chrome's TLS fingerprint
response = requests.get(
    "https://web-scraping.dev/products",
    impersonate="chrome",
)
print(response.status_code)
print(response.text[:500])

The impersonate parameter tells curl_cffi to match Chrome's TLS handshake, HTTP/2 settings, and header order. You can also target exact versions like "chrome131". This library gives you the speed of HTTP clients with the fingerprint of a real browser.

When to Use Each Solution

Each solution targets a different detection method. Use this table to match your symptom with the right fix.

Solution Fixes Complexity Speed Impact
Set proper headers Missing headers detection Low None
Rotate User-Agents Basic bot detection Low None
Add delays Rate limiting Low Slower
Proxy rotation IP blocking, rate limiting Medium Slight
Handle sessions Cookie validation Medium None
Headless browser JS fingerprinting High Much slower
TLS fingerprinting TLS/JA3 detection High None

Here's a quick decision guide based on what you're seeing:

  1. Getting 403 on the first request? Check your headers and User-Agent first.
  2. Getting 403 after several successes? Add delays and rotate proxies.
  3. Works in browser but not Python? Use curl_cffi for TLS matching or a headless browser.
  4. Getting 403 on return visits? Handle cookies with an httpx Client session.
  5. All of the above failing? Use Scrapfly to handle bypass for you.

Fix 403 Errors with Scrapfly

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

When manual fixes get too complex, Scrapfly handles 403 bypass for you. It routes requests through residential proxies, matches browser fingerprints, and runs JavaScript when needed.

Scrapfly Example

from scrapfly import ScrapeConfig, ScrapflyClient

client = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")

result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/products",
    # Turn on anti-bot bypass
    asp=True,
    # Use residential proxies
    proxy_pool="public_residential_pool",
    # Set target country
    country="us",
))
print(result.scrape_result['content'][:500])

One API call replaces headers, proxies, fingerprints, and session handling. Scrapfly picks the right bypass method for each target site.

FAQ

What is the difference between 401 Unauthorized and 403 Forbidden?

A 401 means the server doesn't know who you are, while a 403 means it knows you but won't let you in. In scraping, 401 points to a missing auth token and 403 usually means bot detection.

What is the difference between 403 Forbidden and 429 Too Many Requests?

A 429 explicitly signals a rate limit, while a 403 can mean the same thing but hides the reason. If you get 403 after many successful requests, rate limiting is the likely cause.

What is the difference between 403 Forbidden and 404 Not Found?

A 403 says the resource exists but you can't access it. A 404 says it doesn't exist. Some sites return 404 instead of 403 on purpose to hide the fact that a protected resource exists.

Why does my browser load the page but Python returns 403?

Your browser passes TLS fingerprint checks and runs JavaScript, but Python fails both. Fix this with curl_cffi for TLS matching or Playwright for full browser rendering.

How do I fix 403 Forbidden with Python requests?

Start by setting proper HTTP headers that look like Chrome. If that doesn't work, escalate through the seven solutions in this guide.

Does Cloudflare cause 403 errors when scraping?

Yes, Cloudflare uses TLS fingerprinting, JavaScript challenges, and behavior analysis to detect bots. See our Cloudflare bypass guide for targeted solutions.

Can rotating proxies fix 403 errors?

Only if the 403 comes from IP blocking or rate limiting. If the server blocks based on TLS fingerprints or missing headers, proxies alone won't help.

Why am I getting 403 after several successful requests?

Rate limiting or behavior analysis triggered a block after the server allowed your first few requests. Add random delays, rotate proxies, and vary your request patterns to fix it.

Summary

Most 403 Forbidden errors in web scraping come from bot detection, not permissions. The fix depends on what triggers the block. Start with proper HTTP headers and a real User-Agent. Add delays and proxy rotation for rate limiting. Use curl_cffi or a headless browser for TLS fingerprinting.

Match your symptom to the right solution. First-request 403 means bad headers. Mid-session 403 means rate limiting. "Works in browser only" means TLS fingerprinting.

For production scraping at scale, Scrapfly handles all these bypass techniques in a single API call. It picks the right method for each target and saves you from maintaining bypass code yourself.

Explore this Article with AI

Related Knowledgebase

Related Articles