# Scrapfly Documentation

## Table of Contents

### Dashboard

- [Intro](https://scrapfly.io/docs)
- [Project](https://scrapfly.io/docs/project)
- [Account](https://scrapfly.io/docs/account)
- [Workspace & Team](https://scrapfly.io/docs/workspace-and-team)
- [Billing](https://scrapfly.io/docs/billing)

### Products

#### MCP Server

- [Getting Started](https://scrapfly.io/docs/mcp/getting-started)
- [Tools & API Spec](https://scrapfly.io/docs/mcp/tools)
- [Authentication](https://scrapfly.io/docs/mcp/authentication)
- [Examples & Use Cases](https://scrapfly.io/docs/mcp/examples)
- [FAQ](https://scrapfly.io/docs/mcp/faq)
##### Integrations

- [Overview](https://scrapfly.io/docs/mcp/integrations)
- [Claude Desktop](https://scrapfly.io/docs/mcp/integrations/claude-desktop)
- [Claude Code](https://scrapfly.io/docs/mcp/integrations/claude-code)
- [ChatGPT](https://scrapfly.io/docs/mcp/integrations/chatgpt)
- [Cursor](https://scrapfly.io/docs/mcp/integrations/cursor)
- [Cline](https://scrapfly.io/docs/mcp/integrations/cline)
- [Windsurf](https://scrapfly.io/docs/mcp/integrations/windsurf)
- [Zed](https://scrapfly.io/docs/mcp/integrations/zed)
- [Roo Code](https://scrapfly.io/docs/mcp/integrations/roo-code)
- [VS Code](https://scrapfly.io/docs/mcp/integrations/vscode)
- [LangChain](https://scrapfly.io/docs/mcp/integrations/langchain)
- [LlamaIndex](https://scrapfly.io/docs/mcp/integrations/llamaindex)
- [CrewAI](https://scrapfly.io/docs/mcp/integrations/crewai)
- [OpenAI](https://scrapfly.io/docs/mcp/integrations/openai)
- [n8n](https://scrapfly.io/docs/mcp/integrations/n8n)
- [Make](https://scrapfly.io/docs/mcp/integrations/make)
- [Zapier](https://scrapfly.io/docs/mcp/integrations/zapier)
- [Vapi AI](https://scrapfly.io/docs/mcp/integrations/vapi)
- [Agent Builder](https://scrapfly.io/docs/mcp/integrations/agent-builder)
- [Custom Client](https://scrapfly.io/docs/mcp/integrations/custom-client)


#### Web Scraping API

- [Getting Started](https://scrapfly.io/docs/scrape-api/getting-started)
- [API Specification]()
- [Monitoring](https://scrapfly.io/docs/monitoring)
- [Customize Request](https://scrapfly.io/docs/scrape-api/custom)
- [Debug](https://scrapfly.io/docs/scrape-api/debug)
- [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection)
- [Proxy](https://scrapfly.io/docs/scrape-api/proxy)
- [Proxy Mode](https://scrapfly.io/docs/scrape-api/proxy-mode)
- [Proxy Mode - Screaming Frog](https://scrapfly.io/docs/scrape-api/proxy-mode/screaming-frog)
- [Proxy Mode - Apify](https://scrapfly.io/docs/scrape-api/proxy-mode/apify)
- [(Auto) Data Extraction](https://scrapfly.io/docs/scrape-api/extraction)
- [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering)
- [Javascript Scenario](https://scrapfly.io/docs/scrape-api/javascript-scenario)
- [SSL](https://scrapfly.io/docs/scrape-api/ssl)
- [DNS](https://scrapfly.io/docs/scrape-api/dns)
- [Cache](https://scrapfly.io/docs/scrape-api/cache)
- [Session](https://scrapfly.io/docs/scrape-api/session)
- [Webhook](https://scrapfly.io/docs/scrape-api/webhook)
- [Screenshot](https://scrapfly.io/docs/scrape-api/screenshot)
- [Errors](https://scrapfly.io/docs/scrape-api/errors)
- [Timeout](https://scrapfly.io/docs/scrape-api/understand-timeout)
- [Throttling](https://scrapfly.io/docs/throttling)
- [Troubleshoot](https://scrapfly.io/docs/scrape-api/troubleshoot)
- [Billing](https://scrapfly.io/docs/scrape-api/billing)
- [FAQ](https://scrapfly.io/docs/scrape-api/faq)

#### Crawler API

- [Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [API Specification]()
- [Retrieving Results](https://scrapfly.io/docs/crawler-api/results)
- [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format)
- [Data Extraction](https://scrapfly.io/docs/crawler-api/extraction-rules)
- [Webhook](https://scrapfly.io/docs/crawler-api/webhook)
- [Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Troubleshoot](https://scrapfly.io/docs/crawler-api/troubleshoot)
- [FAQ](https://scrapfly.io/docs/crawler-api/faq)

#### Screenshot API

- [Getting Started](https://scrapfly.io/docs/screenshot-api/getting-started)
- [API Specification]()
- [Accessibility Testing](https://scrapfly.io/docs/screenshot-api/accessibility)
- [Webhook](https://scrapfly.io/docs/screenshot-api/webhook)
- [Billing](https://scrapfly.io/docs/screenshot-api/billing)
- [Errors](https://scrapfly.io/docs/screenshot-api/errors)

#### Extraction API

- [Getting Started](https://scrapfly.io/docs/extraction-api/getting-started)
- [API Specification]()
- [Rules Template](https://scrapfly.io/docs/extraction-api/rules-and-template)
- [LLM Extraction](https://scrapfly.io/docs/extraction-api/llm-prompt)
- [AI Auto Extraction](https://scrapfly.io/docs/extraction-api/automatic-ai)
- [Webhook](https://scrapfly.io/docs/extraction-api/webhook)
- [Billing](https://scrapfly.io/docs/extraction-api/billing)
- [Errors](https://scrapfly.io/docs/extraction-api/errors)
- [FAQ](https://scrapfly.io/docs/extraction-api/faq)

#### Proxy Saver

- [Getting Started](https://scrapfly.io/docs/proxy-saver/getting-started)
- [Fingerprints](https://scrapfly.io/docs/proxy-saver/fingerprints)
- [Optimizations](https://scrapfly.io/docs/proxy-saver/optimizations)
- [SSL Certificates](https://scrapfly.io/docs/proxy-saver/certificates)
- [Protocols](https://scrapfly.io/docs/proxy-saver/protocols)
- [Pacfile](https://scrapfly.io/docs/proxy-saver/pacfile)
- [Secure Credentials](https://scrapfly.io/docs/proxy-saver/security)
- [Billing](https://scrapfly.io/docs/proxy-saver/billing)

#### Cloud Browser API

- [Getting Started](https://scrapfly.io/docs/cloud-browser-api/getting-started)
- [Proxy & Geo-Targeting](https://scrapfly.io/docs/cloud-browser-api/proxy)
- [Unblock API](https://scrapfly.io/docs/cloud-browser-api/unblock)
- [File Downloads](https://scrapfly.io/docs/cloud-browser-api/file-downloads)
- [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume)
- [Human-in-the-Loop](https://scrapfly.io/docs/cloud-browser-api/human-in-the-loop)
- [Debug Mode](https://scrapfly.io/docs/cloud-browser-api/debug-mode)
- [Bring Your Own Proxy](https://scrapfly.io/docs/cloud-browser-api/bring-your-own-proxy)
- [Browser Extensions](https://scrapfly.io/docs/cloud-browser-api/extensions)
##### Integrations

- [Puppeteer](https://scrapfly.io/docs/cloud-browser-api/puppeteer)
- [Playwright](https://scrapfly.io/docs/cloud-browser-api/playwright)
- [Selenium](https://scrapfly.io/docs/cloud-browser-api/selenium)
- [Vercel Agent Browser](https://scrapfly.io/docs/cloud-browser-api/agent-browser)
- [Browser Use](https://scrapfly.io/docs/cloud-browser-api/browser-use)
- [Stagehand](https://scrapfly.io/docs/cloud-browser-api/stagehand)

- [Billing](https://scrapfly.io/docs/cloud-browser-api/billing)
- [Errors](https://scrapfly.io/docs/cloud-browser-api/errors)


### Tools

- [Antibot Detector](https://scrapfly.io/docs/tools/antibot-detector)

### SDK

- [Golang](https://scrapfly.io/docs/sdk/golang)
- [Python](https://scrapfly.io/docs/sdk/python)
- [Rust](https://scrapfly.io/docs/sdk/rust)
- [TypeScript](https://scrapfly.io/docs/sdk/typescript)
- [Scrapy](https://scrapfly.io/docs/sdk/scrapy)

### Integrations

- [Getting Started](https://scrapfly.io/docs/integration/getting-started)
- [LangChain](https://scrapfly.io/docs/integration/langchain)
- [LlamaIndex](https://scrapfly.io/docs/integration/llamaindex)
- [CrewAI](https://scrapfly.io/docs/integration/crewai)
- [Zapier](https://scrapfly.io/docs/integration/zapier)
- [Make](https://scrapfly.io/docs/integration/make)
- [n8n](https://scrapfly.io/docs/integration/n8n)

### Academy

- [Overview](https://scrapfly.io/academy)
- [Web Scraping Overview](https://scrapfly.io/academy/scraping-overview)
- [Tools](https://scrapfly.io/academy/tools-overview)
- [Reverse Engineering](https://scrapfly.io/academy/reverse-engineering)
- [Static Scraping](https://scrapfly.io/academy/static-scraping)
- [HTML Parsing](https://scrapfly.io/academy/html-parsing)
- [Dynamic Scraping](https://scrapfly.io/academy/dynamic-scraping)
- [Hidden API Scraping](https://scrapfly.io/academy/hidden-api-scraping)
- [Headless Browsers](https://scrapfly.io/academy/headless-browsers)
- [Hidden Web Data](https://scrapfly.io/academy/hidden-web-data)
- [JSON Parsing](https://scrapfly.io/academy/json-parsing)
- [Data Processing](https://scrapfly.io/academy/data-processing)
- [Scaling](https://scrapfly.io/academy/scaling)
- [Walkthrough Summary](https://scrapfly.io/academy/walkthrough-summary)
- [Scraper Blocking](https://scrapfly.io/academy/scraper-blocking)
- [Proxies](https://scrapfly.io/academy/proxies)

---

# Reverse Engineering Websites

 The first step to scraping a website is to understand the basic workings of it. This has a scary name of "reverse engineering", but usually, it's relatively simple!

 Several reverse engineering topics apply to almost all website scraping and it's important to get minor details right for scrapers to succeed and function reliably.

 In this section, we'll take a brief look at how the modern web works and the tools that can be used to reverse-engineer almost any website's behavior.

##   URLs - What are They Anyway? 

 URL is the address of a resource on the internet. This can be a webpage, an image, a video, a file, etc.

 It's important to understand URL structures as often in web scraping we need to generate URLs ourselves, modify some parameters or extract information from them. Let's take a quick look.

  | Protocol | https |
|---|---|
| Hostname | web-scraping.dev |
| Pathname | /product/2 |
| Anchor | \#description |
| Parameter `variant` | one |
| Parameter `currency` | usd |

 This is an average URL the parts that make it up. Feel free to change the values to experiment.

 The **parameters** part is the most critical one in web scraping. It is used to pass page configuration details like currency, language, product variant etc. Here, it's important to format these values as seen on the target website, including ordering encoding and even capitalization, to prevent errors or scraper blocking.

 As for other parts:

- The **protocol** is almost always `https` which means it's end-to-end encrypted but the unsecured `http` endpoint is often easier to scrape if possible.
- The **anchor** doesn't even get sent to the web server so it's safe to completely ignore.
- The **hostname** can vary for different geographic locations of the website (e.g. `google.com` vs `google.co.uk`).

##  Requests and Responses

 The web is built on the HTTP protocol which is a simple request-response system. The client (scraper or web browser) sends a request to a given URL and the server returns a response.

 So, a web scraper is just a client that sends requests to a server and processes the responses. Here's a quick example using Python:

 ```
import httpx

# send a request using
response = httpx.request(
  method="GET",
  url="https://web-scraping.dev/products",
  headers={
    # note these are mostly optional or already set as defaults
    "user-agent": "scrapfly academy",  # who's making the request?
    "accept-language": "en-US",  # what language do we want the response in? (if possible)
    "accept-encoding": "gzip, deflate, br",  # what encoding do we want the response in? (if possible)
    "accept": "text/html",  # what type of response do we want?
    "referer": "https://web-scraping.dev",  # where did we come from?
  }
)
print(response.status_code)
# 200 - success!
print(response.headers)
{
  "content-type": "text/html; charset=utf-8",
  ...
}
print(response.text)
# ...

```

 

   

 

### What are Requests?

 Requests are made from 3 parts: the **method**, the **URL** and the **headers**.

    

  

  For the **request method** in scraping we usually use `GET` requests to retrieve pages and `POST` to interact with dynamic elements like search boxes, forms or buttons.

 The **headers** are used to pass additional meta information to the server like:

- What sort of device is making this request?
- What's the preferred language and encoding?
- Does the client have any custom preferences?
 
 Using these details the server decides how to format the response so these settings are important to get the expected results in web scraping.

### What are Responses?

 Responses are made up of 3 parts: the **status code**, the **headers** and the **body**.

    

  

  The **status code** is a 3-digit number that tells us if the request was successful or not:

- **2xx** - Success
- **3xx** - Redirection, meaning the URL for this page has changed
- **4xx** - Client Error, meaning the request is misconfigurated (bad url or headers) or the client being blocked
- **5xx** - Server Error, meaning the server is down or can't handle this request or the client is being blocked
 
 The **headers** return meta-information about the request like how it's encoded or the language of the content. Though, in scraping the most useful detail here is the `Set-Cookie` header which we'll cover later in the academy.

 The **body** is the actual content of the page which is usually an HTML or JSON document in web scraping.

##  Browser Devtools

 Browser devtools come with every web browser (like Chrome) and it is the holy grail of all web development and scraping is no exception!

 To open devtools use `F12` or right-click anywhere on the page and select `Inspect` or `Inspect Element`.

 Try it on this button👋 look at this sauropod 🦕 for a secret message.

 

 [##### Using Devtools to Find Hidden APIs

The most powerful devtools use is finding hidden web APIs for web scraping.

 

 ](https://scrapfly.io/blog/posts/how-to-scrape-hidden-apis/) 

 

 We'll be covering devtools use in many forms throughout this academy but here's a quick overview of the most important features:

- **Network** tab shows all requests made by the browser and the responses it receives.
- **Elements** tab shows the HTML element tree.
- **Application** tab shows the web page's state: cookies, databases and other persistent data
 
---

 [##### Scrapfly Web Tools

See our set of web tools made for web scraping developers.

 

 ](https://scrapfly.io/web-scraping-tools) 

 Reverse engineering is a crucial skill web scraping development, and it's learn through real examples and practice. Next, let's do exactly that!

 

 



## Next - Static Page Scraping

 We'll explore reverse engineering in action but first, let's take a look at basic static page scraping - the simplest form of web scraping.

 [&lt;](https://scrapfly.io/academy/tools-overview "Previous Page") [&gt;](https://scrapfly.io/academy/static-scraping "Next Page")