Static Page Scraping
Static HTML pages are the easiest and simplest form pages encountered in web scraping. An easy way to confirm whether the page is static or not is to disable javascript in your browser and confirm whether the data is still present.
For static page scraping we only need an HTTP client to fetch the page and an HTML parser to extract the data fields we want.
An example of a static HTML page would be web-scraping.dev/products page which lists a page of products without using any javascript. On the other hand, web-scraping.dev/testimonials is a dynamic page that uses javascript to load pages as the user scrolls down.
Let's focus on the static products listing page and take a look at how to scrape it using Python as a reference.
Example Scraper
For this example scraper we'll be using Python with httpx
as our HTTP client.
It can be installed using pip install httpx[http2]
.
Note that we are installing the http2
version of this library as it's a good practice to scrape with the
latest HTTP version available.
Above, we are using HTTP protocol to pull page data from our example URL.
For that, we use minimal configuration by enabling http2
.
We're also setting some custom headers which are a way for the requesting client
to provide metadata about the request - who's sending it and from where?
In return, we receive a response object which is either a success or an error.
This is indicated by the status_code
property where
200
range numbers mean success and others indicate an error.
In the next section we'll take a look at how to parse the HTML data we received and wrap up our scraper though before that we really recommend skimming over the types of HTTP challenges 👇
Challenges
Static page scraping introduces us to the first set of web scraping challenges that relate to HTTP connections. We can divide them into two practical categories.
Technical Challenges
To successfully retrieve the pages our HTTP requests must be valid. This means the correct URL must be used, request headers and even the HTTP version can play an effect.
Handling Page State
While HTTP in itself is stateless (meaning requests 1 and 2 are independent of each other), web pages can build extra layers to track the client state. Most commonly this is being done through cookies.
Cookies are just normal headers that contain key=value
data though they have a special standard functionality.
The web server expects the client to store Set-Cookie
response header values in its database and send the whole
value back using Cookie
request header. This is how cookies are used for persistent states and it
can be a surprising challenge for new web scrapers.
Blocking Challenges
This also introduces us to web scraper blocking. Any unusual HTTP behavior can indicate that the request is not coming from a web browser user. So, it's important to replicate the HTTP behavior of a web browser as much as possible. This includes using the same HTTP version, headers, cookies, and even connection patterns.
We cover blocking in great detail in Scraper Blocking section.
Related to HTTP Blocking:
User-Agent Header Explanation and Intro
This particular header is one of the most important request headers in web scraping.
How Headers are used to Identify Web Scrapers
Headers play a major role in scrape blocking, here's how it's all being done.
Next up - Parsing HTML
We got our HTML data from this page which is not very useful. Next, let's take a look at how to parse it using HTML parsing tools.