Web Scraping With Python 101: Connection

Web Scraping With Python 101: Connection

One of the biggest revolutions of 21st century is realization how valuable data can be. Great news is that the internet is full of great, public data for you to take advantage of, and that's exactly the purpose of web-scraping: collecting this public data to bootstrap a project or a newly found business.

In this multipart tutorial series we'll take an in depth look at what is web scraping, how everything around it functions and how can we write our own web-scrapers in Python programming language.

So what is web scraping?
As you might have guessed, web scraping is essentially public data collection and there are thousands of reasons why one might want to collect this public data, ranging from finding potential employees to competitive intelligence.

We at ScrapFly did extensive research into web scraping applications, and you can find our findings here: https://scrapfly.io/use-case

In this part of the series, we'll focus on the connection part of things: how can we create a program that downloads hundreds and thousands of public web pages for us to ingest later?

Next: Web Scraping With Python 102: Parsing

Connection: HTTP Protocol Fundamentals

To collect data from a public resource, we need to establish connection with it first. Most of the web is served over http protocol which is rather simple: we (the client) send a request for a specific document to the website (the server), once the server processes our request it replies with the requested document - a very straight forward exchange!

illustration of a standard http exchange

As you can see in this illustration: we send a request object which is consists of method (aka type), location and headers, in turn we receive a response object which consists of status code, headers and document content itself. Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.

Understanding Requests and Responses

When it comes to web-scraping we don't exactly need to know every little detail about http requests and responses, however it's good to have a general overview and to know which parts of this protocol are especially useful in web-scraping. Let's take a look at exactly that!

Request Method

Http requests are conveniently divided into few types that perform distinct function:

  • GET requests are intended to request a document.
  • POST requests are intended to request a document by sending a document.
  • HEAD requests are intended to request documents meta information.
  • PATCH requests are intended to update a document.
  • PUT requests are intended to either create a new document or update it.
  • DELETE requests are intended to delete a document.

When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET and POST type requests. To add, HEAD requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check it's metadata whether it's worth the effort.

Request Location

To understand what is resource location, first we should take a quick look at URL's structure itself:

example of an URL structure

Here, we can visualize each part of a URL: we have protocol which when it comes to http is either http or https, then we have host which is essentially address of the server, and finally we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up python and let it figure it out for you:

from urllib.parse import urlparse
urlparse("http://www.domain.com/path/to/resource?arg1=true&arg2=false")
> ParseResult(scheme='http', netloc='www.domain.com', path='/path/to/resource', params='', query='arg1=true&arg2=false', fragment='')

Request Headers

While it might appear like request headers are just minor metadata details in web scraping, they are extremely important. Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.

Let's take a look at some of the most important headers and what they mean:

User-Agent is an identity header that tells the server who's requesting the document.

# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

Whenever you visit a web page in your webs browser it identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, we don't want to be denied content, so we have to blend in by faking our user agent to look like that one of a browser.

There are many online databases that contain latest user-agent strings of various platforms, like https://www.whatismybrowser.com/guides/the-latest-user-agent/chrome

Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.

Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web-scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:

text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of a website/webapp.

These are a few of most important observations, for more see extensive full documentation over at https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

Response Status Code

Conveniently, all http responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate). Let's take a quick look at status codes that are most relevant to web scraping:

  • 200 range codes generally mean success!
  • 300 range codes tend to mean redirection - in other words if we request content at /product1.html it might be moved to a new location /products/1.html and server would inform us about that.
  • 400 range codes mean request is malformed or denied. Our web scraper could be missing some headers, cookies or authentication details.
  • 500 range codes typically mean server issues. The website might be unavailable right now or is purposefully disabling access to our web scraper.

For more on http status code see official documentation at MDN https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

Response Headers

When it comes to web scraping, response headers provide some important information for connection functionality and efficiency. For example, Set-Cookie header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag, Last-Modified are intended to help client with caching to optimize resource usage.

For the entire list of all http headers see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

Finally, just like with request headers, headers prefixed with X- are custom web functionality headers and depend on each, individual website.


We took a brief overlook of core http components, and now it's time we give it a go and see how http works in practical Python!

Choosing a HTTP Client in Python

Before we start exploring http connection in python, we need to choose our http client. Python already comes with http client built in called urllib, however it's not as suited for web-scraping as other alternatives which Python has many!

Http protocol has 3 distinct versions and libraries that support these versions:

  • HTTP1.1 the most simple text based protocol used widely by simpler programs. Implemented by urllib, requests, httpx, aiohttp
  • HTTP2 more complex/efficient binary based protocol, mostly used by web-browsers. Implemented by httpx
  • HTTP3/QUIC the newest and most efficient version of protocol mostly used by web-browser. Implemented by aioquic, httpx(planned)

As you can see, python has a very healthy http client ecosystem! Fortunately, we don't need to worry about these at the beginning of our scraping career, but having this overview helps us to choose a good starting ground for our connections.

For this article we'll be sticking with httpx as it offers all the features required for web-scraping. That being said, other http clients like requests, aiohttp etc. can be used almost interchangeably.

Let's see how we can utilize http connections for scraping in Python. First, let's set up our working environment. We'll need python version >3.7 and httpx library:

$ python --version
Pyhton 3.7.4
$ pip install httpx

With httpx installed, we have everything we need to start connecting and receiving our documents. Let's give it a go!

Exploring HTTP with httpx

Now that we have basic understandings of http protocol and our working environment ready, let's finally see it in action!
In this section, we'll experiment with basic web-scraping scenarios to further understand http protocol in practice.

For our example case study, we'll be using http://httpbin.org request testing service, which allows us to send requests and returns exactly what happens.

GET Requests

Let's start off with GET type requests, which are the most common type of requests in web scraping.
To put it shortly GET often simply means: give me the document located at : GET https://www.httpbin.org/html request would be asking /html document from httpbin.org server:

import httpx
response = httpx.get("https://www.producthunt.com/posts/evernote")
html = response.text
metadata = response.headers
print(html)
print(metadata)

Here, we perform the most basic GET request possible. However, just requesting the document often might not be enough. As we've explored before, requests are made of: request type, location, headers and optional content. So what are headers?

Request Metadata - Headers

We've already done a theoretical overview of request headers and since they're so important in web-scraping let's take a look at how we can use them with our http client:

import httpx
response = httpx.get('http://httpbin.org/headers')
print(response.text)

In this example we're using httpbin.org testing endpoint for headers, it returns http inputs (headers, body) we sent as response body. If we run this code with specific headers, we can see that the client is generating some basic ones automatically:

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-httpx/0.19.0", 
  }
}

Even though we didn't explicitly provide any headers in our request, httpx generated required basics for us. By using headers argument, we can specify custom headers ourselves:

import httpx
response = httpx.get('http://httpbin.org/headers', headers={"User-Agent": "ScrapFly's Web Scraping Tutorial"})
print(response.text)
# will print:
# {
#   "headers": {
#     "Accept": "*/*", 
#     "Accept-Encoding": "gzip, deflate, br", 
#     "Host": "httpbin.org", 
#     "User-Agent": "ScrapFly's Web Scraping Tutorial", 
#     ^^^^^^^ - we changed this!
#   }
# }

As you can see above, we used a custom User-Agent header for this request, while other headers remain automatically generated by our client. We'll talk more about headers in "Challenges" section below, but for now most minor web-scraping can work well with headers httpx generates for us!

POST Requests

As we've discovered, GET type requests just mean "get me that document", however sometimes that might not be enough information for the server to serve correct content.

Often, web interactive web apps use POST type requests to transit their data, which essentially means the server requires some document to be sent for it to return you the correct document back. Usually this means we are sending a document containing some parameters and the server returns us content matching these parameters.

Common examples are various forms or user inputs. For example, search bars would often use POST forms to return

Let's take a quick look at how we can use POST requests in httpx:

import httpx
response = httpx.post("http://httpbin.org/post", json={"question": "Why is 6 afraid of 7?"})
print(response.text)
# will print:
# {
#   ...
#   "data": "{\"question\": \"Why is 6 afraid of 7?\"}", 
#   "headers": {
#     "Content-Type": "application/json", 
#      ...
#   }, 
# }

As you can see, if we submit this request, the server will receive some data and a header indicating that this data of application/json type. With this information, the server will do some thinking and return us a document in exchange. In this imaginary scenario, we submit a document with question data and the server would return us the answer.

Putting It All Together

Now that we have briefly introduced ourselves with http in python, let's apply it practically!
In this section, we have a short challenge: we have multiple urls that we want to retrieve html contents off. Let's see what sort of practical challenges we might encounter and how real web scraping programs function.

import httpx

# here is a list of urls, in this example we'll just use some place holders
urls = [
    "http://httbin.org/html", 
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
]
# as discussed in headers chapter we should always stick to browser-like headers for our 
# requests to prevent being blocked
headers = {
    # lets use Chrome browser on Windows:
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}

# since we have multiple urls we want to scrape we should establish a persistent session
with httpx.Client(headers=headers) as session:
    for url in urls:
        response = session.get(url)
        html = response.text
        meta = response.headers
        print(html)
        print(meta)

As you can see, there's quite a bit of going on here. Let's unpack the most important bits in greater detail:

Why are we using custom headers?
As we've discussed in the headers chapter, we must mask our web scraper to appear as a web browser to prevent being blocked. While httpbin.org is not blocking any requests, it's generally a good practice to set at least User-Agent and Accept headers when web-scraping public targets.

What is httpx.Client?
We could skip it and call httpx.get() for each url instead:

for url in urls:
    response = httpx.get(url, headers=headers)

However, as we've covered earlier http is not a persistent protocol - meaning every time we would call httpx.get() we would connect with the server and only then exchange our requests/response objects.
To optimize this exchange we can establish a session which is usually referred to as Connection Pooling or HTTP persistent connection.
In other words, this session will establish connection only once and exchange all of our 5 requests within this one connection, that's 1 vs 5 - what if we're scraping thousands of pages?
Using http client sessions not only optimizes our code, but also provides some convenient shortcuts. As in our example, instead of providing our custom headers for every request we can set them for the entire session!

In this chapter, we took a quick look at a basic http content scraper and 2 minor but important challenges: connection pooling and masking web-scrapers with browser-like headers.

Before we continue with parsing, let's take a look at other challenges that are out of scope of this article but are worth keeping in mind.

Challenges

Web Scraping is a unique niche because web scraper clients and web server are in constant disagreement about their goals: web servers expect to serve humans and web scrapers just want to collect machine parsable content. This causes a lot of unique challenges like: web scrapers being blocked because they don't look like humans; web content structure changes so web scrapers need to be updated to understand it; web servers expect the client to be a browser and web scrapers often don't have all capabilities browsers do and so on.

Fortunately, the web scraping community is pretty big and can often help solve these issues. For more specific help, check out these popular web-scrapping discussion forums:

To add, we at ScrapFly have years of experience with these issues and worked hard to provide one shoe-fit-all solution via our ScrapFly middleware API. Let's take a quick look how ScrapFly solves many of these issues without any extra intervention from the user.

ScrapFly's Approach - Brains In The Middle

Here at ScrapFly we recognize the difficulties of web-scraping and came up with an API solution that solves these issues for our users.
ScrapFly is essentially an intelligent negotiator that sits between your scraper and your target. Your scraper, instead of connecting to your target itself, requests ScrapFly API to do it for it and simply return the results:

ScrapFly acts as a smart rendering middleware

As you can see, this abstraction layer can greatly increase performance and reduce the complexity of many web-scrapers by deferring these issues to us!

Let's take a look at how our example scraper would look in ScrapFly SDK. Let's install scrapfly SDK and kick it off pip install scrapfly-sdk:

from scrapfly import ScrapflyClient, ScrapeConfig

urls = [
    "http://httbin.org/html", 
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
]
with ScrapflyClient(key='<YOUR KEY>') as client:
    for url in urls:
        response = client.scrape(scrape_config=ScrapeConfig(
            url=url,
        )
        html = response.scrape_result['content']

As you can see, our code with ScrapFly looks almost the same except we get rid of a lot of complexity such as faking our headers as we did in our httpx based scraper. In addition, ScrapFly offers a platitude of additional features like proxies, bot detection avoidance, javascript rendering and much more!

For javascript rendering, see: https://scrapfly.io/blog/scraping-using-browsers

Summary

In this article, we took an in-depth dive into http connection in the context of web-scraping. We looked into what are http requests and responses and which details are particularly relevant to web-scraping. Then we got into Python itself, we overview the available libraries and wrote some basic code that collects some htmls.

Finally, we took a quick overview of the main challenges web-scraper faces during the connection step of the process and how ScrapFly's very own tool can help resolve these issues.

With this knowledge, you're ready to scrape the web! In the next post, we'll take a look at how we can extract valuable data from the htmls we've gathered and how can we put these two steps together as fully functional web-scrapers:

Next: Web Scraping With Python 102: Parsing

Additional Reading

In this article we've covered hands-on web scraping, however when scaling to hundreds of thousands of requests reinventing the wheel can be a suboptimal and painful experience. For this it might be worth taking a look into web-scraping frameworks like Scrapy which is a convenient abstraction layer around everything we've learned today and more!

Scrapy implements a lot of shortcuts and optimizations that otherwise would be difficult to implement by hand, such as request concurrency, retry logic and countless community extensions for handling various niche cases. That being said, to really enjoy scrapy it's good to have basic understandings of basic web-scraping ideas, so don't skip the next article!

Next: Web Scraping With Python 102: Parsing

Get Your FREE API Key
Discover Scrapfly

Related post

Web Scraping with Selenium and Python

Introduction to web scraping dynamic javascript powered websites and web apps using Selenium browser automation library and Python.

Web Scraping with Python and BeautifulSoup

Introduction to web scraping with Python and BeautifulSoup package. Tips, tricks and best practices as well as real life example.

Scraping Dynamic Websites Using Browser

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping