Static Page Scraping

Static HTML pages are the easiest and simplest form pages encountered in web scraping. An easy way to confirm whether the page is static or not is to disable javascript in your browser and confirm whether the data is still present.

For static page scraping we only need an HTTP client to fetch the page and an HTML parser to extract the data fields we want.

An example of a static HTML page would be web-scraping.dev/products page which lists a page of products without using any javascript. On the other hand, web-scraping.dev/testimonials is a dynamic page that uses javascript to load pages as the user scrolls down.

Let's focus on the static products listing page and take a look at how to scrape it using Python as a reference.

Example Scraper

For this example scraper we'll be using Python with httpx as our HTTP client. It can be installed using pip install httpx[http2]. Note that we are installing the http2 version of this library as it's a good practice to scrape with the latest HTTP version available.

Above, we are using HTTP protocol to pull page data from our example URL. For that, we use minimal configuration by enabling http2. We're also setting some custom headers which are a way for the requesting client to provide metadata about the request - who's sending it and from where?

In return, we receive a response object which is either a success or an error. This is indicated by the status_code property where 200 range numbers mean success and others indicate an error.

In the next section we'll take a look at how to parse the HTML data we received and wrap up our scraper though before that we really recommend skimming over the types of HTTP challenges 👇

Challenges

Static page scraping introduces us to the first set of web scraping challenges that relate to HTTP connections. We can divide them into two practical categories.

Technical Challenges

To successfully retrieve the pages our HTTP requests must be valid. This means the correct URL must be used, request headers and even the HTTP version can play an effect.

Handling Page State

While HTTP in itself is stateless (meaning requests 1 and 2 are independent of each other), web pages can build extra layers to track the client state. Most commonly this is being done through cookies.

Cookies are just normal headers that contain key=value data though they have a special standard functionality. The web server expects the client to store Set-Cookie response header values in its database and send the whole value back using Cookie request header. This is how cookies are used for persistent states and it can be a surprising challenge for new web scrapers.

Scrapeground Exercise: Cookies in Web Scraping

See this in-depth tutorial on Cookies in web scraping on Scrapfly Scrapeground. This example demonstrates how login systems use cookies to track user sessions.

Blocking Challenges

This also introduces us to web scraper blocking. Any unusual HTTP behavior can indicate that the request is not coming from a web browser user. So, it's important to replicate the HTTP behavior of a web browser as much as possible. This includes using the same HTTP version, headers, cookies, and even connection patterns.

We cover blocking in great detail in Scraper Blocking section.

Related to HTTP Blocking:

Next up - Parsing HTML

We got our HTML data from this page which is not very useful. Next, let's take a look at how to parse it using HTML parsing tools.

< >

Summary