Reverse Engineering Websites
The first step to scraping a website is to understand the basic workings of it. This has a scary name of "reverse engineering", but usually, it's relatively simple!
Several reverse engineering topics apply to almost all website scraping and it's important to get minor details right for scrapers to succeed and function reliably.
In this section, we'll take a brief look at how the modern web works and the tools that can be used to reverse-engineer almost any website's behavior.
URLs - What are They Anyway?
URL is the address of a resource on the internet. This can be a webpage, an image, a video, a file, etc.
It's important to understand URL structures as often in web scraping we need to generate URLs ourselves, modify some parameters or extract information from them. Let's take a quick look.
Protocol | https |
Hostname | web-scraping.dev |
Pathname | /product/2 |
Anchor | #description |
Parameter variant |
one |
Parameter currency |
usd |
This is an average URL the parts that make it up. Feel free to change the values to experiment.
The parameters part is the most critical one in web scraping. It is used to pass page configuration details like currency, language, product variant etc. Here, it's important to format these values as seen on the target website, including ordering encoding and even capitalization, to prevent errors or scraper blocking.
As for other parts:
- The protocol is almost always
https
which means it's end-to-end encrypted but the unsecuredhttp
endpoint is often easier to scrape if possible. - The anchor doesn't even get sent to the web server so it's safe to completely ignore.
- The hostname can vary for different geographic locations of the website (e.g.
google.com
vsgoogle.co.uk
).
Requests and Responses
The web is built on the HTTP protocol which is a simple request-response system. The client (scraper or web browser) sends a request to a given URL and the server returns a response.
So, a web scraper is just a client that sends requests to a server and processes the responses. Here's a quick example using Python:
What are Requests?
Requests are made from 3 parts: the method, the URL and the headers.
For the request method in scraping we usually use GET
requests to retrieve pages and
POST
to interact with dynamic elements like search boxes, forms or buttons.
The headers are used to pass additional meta information to the server like:
- What sort of device is making this request?
- What's the preferred language and encoding?
- Does the client have any custom preferences?
Using these details the server decides how to format the response so these settings are important to get the expected results in web scraping.
What are Responses?
Responses are made up of 3 parts: the status code, the headers and the body.
The status code is a 3-digit number that tells us if the request was successful or not:
- 2xx - Success
- 3xx - Redirection, meaning the URL for this page has changed
- 4xx - Client Error, meaning the request is misconfigurated (bad url or headers) or the client being blocked
- 5xx - Server Error, meaning the server is down or can't handle this request or the client is being blocked
The headers return meta-information about the request like how it's encoded or the language of the content.
Though, in scraping the most useful detail here is the Set-Cookie
header
which we'll cover later in the academy.
The body is the actual content of the page which is usually an HTML or JSON document in web scraping.
Browser Devtools
Browser devtools come with every web browser (like Chrome) and it is the holy grail of all web development and scraping is no exception!
To open devtools use F12
or right-click anywhere on the page and select Inspect
or Inspect Element
.
Try it on this 👋 look at this sauropod 🦕 for a secret message.
We'll be covering devtools use in many forms throughout this academy but here's a quick overview of the most important features:
- Network tab shows all requests made by the browser and the responses it receives.
- Elements tab shows the HTML element tree.
- Application tab shows the web page's state: cookies, databases and other persistent data
Reverse engineering is a crucial skill web scraping development, and it's learn through real examples and practice. Next, let's do exactly that!
Next - Static Page Scraping
We'll explore reverse engineering in action but first, let's take a look at basic static page scraping — the simplest form of web scraping.