Reverse Engineering Websites

The first step to scraping a website is to understand the basic workings of it. This has a scary name of "reverse engineering", but usually, it's relatively simple!

Several reverse engineering topics apply to almost all website scraping and it's important to get minor details right for scrapers to succeed and function reliably.

In this section, we'll take a brief look at how the modern web works and the tools that can be used to reverse-engineer almost any website's behavior.

URLs - What are They Anyway?

URL is the address of a resource on the internet. This can be a webpage, an image, a video, a file, etc.

It's important to understand URL structures as often in web scraping we need to generate URLs ourselves, modify some parameters or extract information from them. Let's take a quick look.

Protocol	https
Hostname	web-scraping.dev
Pathname	/product/2
Anchor	#description
Parameter `variant`	one
Parameter `currency`	usd

This is an average URL the parts that make it up. Feel free to change the values to experiment.

The parameters part is the most critical one in web scraping. It is used to pass page configuration details like currency, language, product variant etc. Here, it's important to format these values as seen on the target website, including ordering encoding and even capitalization, to prevent errors or scraper blocking.

As for other parts:

The protocol is almost always https which means it's end-to-end encrypted but the unsecured http endpoint is often easier to scrape if possible.
The anchor doesn't even get sent to the web server so it's safe to completely ignore.
The hostname can vary for different geographic locations of the website (e.g. google.com vs google.co.uk).

Requests and Responses

The web is built on the HTTP protocol which is a simple request-response system. The client (scraper or web browser) sends a request to a given URL and the server returns a response.

So, a web scraper is just a client that sends requests to a server and processes the responses. Here's a quick example using Python:

import httpx

# send a request using
response = httpx.request(
  method="GET",
  url="https://web-scraping.dev/products",
  headers={
    # note these are mostly optional or already set as defaults
    "user-agent": "scrapfly academy",  # who's making the request?
    "accept-language": "en-US",  # what language do we want the response in? (if possible)
    "accept-encoding": "gzip, deflate, br",  # what encoding do we want the response in? (if possible)
    "accept": "text/html",  # what type of response do we want?
    "referer": "https://web-scraping.dev",  # where did we come from?
  }
)
print(response.status_code)
# 200 - success!
print(response.headers)
{
  "content-type": "text/html; charset=utf-8",
  ...
}
print(response.text)
# <html>...</html>

What are Requests?

Requests are made from 3 parts: the method, the URL and the headers.

For the request method in scraping we usually use GET requests to retrieve pages and POST to interact with dynamic elements like search boxes, forms or buttons.

The headers are used to pass additional meta information to the server like:

What sort of device is making this request?
What's the preferred language and encoding?
Does the client have any custom preferences?

Using these details the server decides how to format the response so these settings are important to get the expected results in web scraping.

What are Responses?

Responses are made up of 3 parts: the status code, the headers and the body.

The status code is a 3-digit number that tells us if the request was successful or not:

2xx - Success
3xx - Redirection, meaning the URL for this page has changed
4xx - Client Error, meaning the request is misconfigurated (bad url or headers) or the client being blocked
5xx - Server Error, meaning the server is down or can't handle this request or the client is being blocked

The headers return meta-information about the request like how it's encoded or the language of the content. Though, in scraping the most useful detail here is the Set-Cookie header which we'll cover later in the academy.

The body is the actual content of the page which is usually an HTML or JSON document in web scraping.

Browser Devtools

Browser devtools come with every web browser (like Chrome) and it is the holy grail of all web development and scraping is no exception!

To open devtools use F12 or right-click anywhere on the page and select Inspect or Inspect Element.

Try it on this 👋 look at this sauropod 🦕 for a secret message.

Using Devtools to Find Hidden APIs

The most powerful devtools use is finding hidden web APIs for web scraping.

We'll be covering devtools use in many forms throughout this academy but here's a quick overview of the most important features:

Network tab shows all requests made by the browser and the responses it receives.
Elements tab shows the HTML element tree.
Application tab shows the web page's state: cookies, databases and other persistent data

Scrapfly Web Tools

See our set of web tools made for web scraping developers.

Reverse engineering is a crucial skill web scraping development, and it's learn through real examples and practice. Next, let's do exactly that!

Next - Static Page Scraping

We'll explore reverse engineering in action but first, let's take a look at basic static page scraping — the simplest form of web scraping.

< >