Web Scraping With Python Tutorial

article feature image

One of the biggest revolutions of 21st century is realization how valuable data can be. Great news is that the internet is full of great, public data for you to take advantage of, and that's exactly the purpose of web scraping: collecting this public data to bootstrap a newly found business or a project.

In this practical introduction to web scraping in python, we'll take a deep look at what exactly is web scraping, technologies that power it, and what are some common challenges modern web scraping projects face.

For this we'll explore the entire web scraping with python process:
We'll start off by learning about HTTP and how to use HTTP clients in python to collect web page data. Then we'll take a look at parsing HTML page data using CSS and XPATH selectors. Finally, we'll build an example web scraper with Python for producthunt.com product data to solidify what we've learned.

What is Web Scraping?

Web scraping is essentially public data collection via automated process. There are thousands of reasons why one might want to collect this public data, like finding potential employees or gathering competitive intelligence. We at ScrapFly did extensive research into web scraping applications, and you can find our findings here on our Web Scraping Use Cases page.

To scrape a website with python we're generally dealing with two types of problems: collecting the public data available online and then parsing this data for structured product information. In this article, we'll take a look at both of these steps and solidify the knowledge with an example project.

Connection: HTTP Fundamentals

To collect data from a public resource, we need to establish connection with it first. Most of the web is served over HTTP which is rather simple: we (the client) send a request the website (the server) for a specific document, once the server processes our request it replies with the requested document - a very straight forward exchange!

illustration of a standard http exchange

illustration of a standard HTTP exchange

As you can see in this illustration: we send a request object which is consists of method (aka type), location and headers. In turn, we receive a response object which consists of status code, headers and document content itself.
Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.

Understanding Requests and Responses

When it comes to web scraping we don't exactly need to know every little detail about HTTP requests and responses, however it's good to have a general overview and to know which parts of this protocol are especially useful in web scraping. Let's take a look at exactly that!

Request Method

Http requests are conveniently divided into few types that perform distinct function:

  • GET requests are intended to request a document.
  • POST requests are intended to request a document by sending a document.
  • HEAD requests are intended to request documents meta information.
  • PATCH requests are intended to update a document.
  • PUT requests are intended to either create a new document or update it.
  • DELETE requests are intended to delete a document.

When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET and POST type requests. To add, HEAD requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check it's metadata whether it's worth the effort.

Request Location

To understand resource location, first we should take a quick look at URL's structure itself:

illustration of url parts

example of an URL structure

Here, we can visualize each part of a URL: we have protocol, which when it comes to HTTP is either http or https. Then, we have host which is essentially the address of the server that is either a domain name or an IP address. Finally, we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up python and let it figure it out for you:

from urllib.parse import urlparse
urlparse("http://www.domain.com/path/to/resource?arg1=true&arg2=false")
> ParseResult(scheme='http', netloc='www.domain.com', path='/path/to/resource', params='', query='arg1=true&arg2=false', fragment='')

Request Headers

While it might appear like request headers are just minor metadata details, in web scraping they are extremely important! Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.

Let's take a look at some of the most important headers and what they mean:

User-Agent is an identity header that tells the server who's requesting the document.

# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

Whenever you visit a web page in your web browser it identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, of course, we don't want to be denied access, so we have to blend in by faking our user agent to look like that one of a browser.

There are many online databases that contain latest user-agent strings of various platforms, like one provided by whatismybrowser.com

Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.

Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:

text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

For all standard values see content negotiation header list by MDN

X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of a website/webapp.

These are a few of most important observations, for more see extensive full documentation over at https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

Response Status Code

Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate or follow a redirect).
Let's take a quick look at status codes that are most relevant to web scraping:

  • 200 range codes generally mean success!
  • 300 range codes tend to mean redirection - in other words if we request content at /product1.html it might be moved to a new location /products/1.html and server would inform us about that.
  • 400 range codes mean request is malformed or denied. Our web scraper could be missing some headers, cookies or authentication details.
  • 500 range codes typically mean server issues. The website might be unavailable right now or is purposefully disabling access to our web scraper.

For all standard HTTP response codes see HTTP status list by MDN

Response Headers

When it comes to web scraping, response headers provide some important information for connection functionality and efficiency.
For example, Set-Cookie header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag, Last-Modified are intended to help client with caching to optimize resource usage.

For all options see the standard HTTP request header list by MDN

Finally, just like with request headers, headers prefixed with X- are custom web functionality headers and depend on each, individual website.


We took a brief overlook of core HTTP components, and now it's time to see how HTTP works in practical Python!

HTTP Clients in Python

Before we start exploring HTTP connections in python, we need to choose a HTTP client. Python comes with an HTTP client built in called urllib, however for web scraping we need something more feature rich and easier to handle so let's take a look at popular community libraries.

First thing to note when it comes to the HTTP is that it has 3 distinct versions:

  • HTTP1.1 the most simple text based protocol used widely by simpler programs. Implemented by urllib, requests, httpx, aiohttp
  • HTTP2 more complex/efficient binary based protocol, mostly used by web-browsers. Implemented by httpx
  • HTTP3/QUIC the newest and most efficient version of protocol mostly used by web-browser. Implemented by aioquic, httpx(planned)

As you can see, Python has a very healthy HTTP client ecosystem. When it comes to web scraping HTTP1.1 is good enough for most cases, however HTTP2/3 are very helpful for avoiding web scraper blocking.

We'll be sticking with httpx as it offers all the features required for web scraping. That being said, other HTTP clients like the requests library can be used almost interchangeably.

Let's see how we can utilize http connections for scraping in Python. First, let's set up our working environment. We'll need python version >3.7 and httpx library:

$ python --version
Pyhton 3.7.4
$ pip install httpx

With httpx installed, we have everything we need to start connecting and receiving our documents. Let's give it a go!

Exploring HTTP with httpx

Now that we have basic understanding of HTTP and our working environment ready, let's see it in action!
In this section, we'll experiment with basic web-scraping scenarios to further understand HTTP in practice.

For our example case study, we'll be using http://httpbin.org request testing service, which allows us to send requests and returns exactly what happens.

GET Requests

Let's start off with GET type requests, which are the most common type of requests in web scraping.
To put it shortly GET often simply means: give me the document located at : GET https://www.httpbin.org/html request would be asking /html document from httpbin.org server:

import httpx
response = httpx.get("https://www.producthunt.com/posts/evernote")
html = response.text
metadata = response.headers
print(html)
print(metadata)

Here, we perform the most basic GET request possible. However, just requesting the document often might not be enough. As we've explored before, requests are made of: request type, location, headers and optional content. So what are headers?

Request Metadata - Headers

We've already done a theoretical overview of request headers and since they're so important in web scraping let's take a look at how we can use them with our HTTP client:

import httpx
response = httpx.get('http://httpbin.org/headers')
print(response.text)

In this example we're using httpbin.org testing endpoint for headers, it returns http inputs (headers, body) we sent as response body. If we run this code with specific headers, we can see that the client is generating some basic ones automatically:

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-httpx/0.19.0", 
  }
}

Even though we didn't explicitly provide any headers in our request, httpx generated required basics for us. By using headers argument, we can specify custom headers ourselves:

import httpx
response = httpx.get('http://httpbin.org/headers', headers={"User-Agent": "ScrapFly's Web Scraping Tutorial"})
print(response.text)
# will print:
# {
#   "headers": {
#     "Accept": "*/*", 
#     "Accept-Encoding": "gzip, deflate, br", 
#     "Host": "httpbin.org", 
#     "User-Agent": "ScrapFly's Web Scraping Tutorial", 
#     ^^^^^^^ - we changed this!
#   }
# }

As you can see above, we used a custom User-Agent header for this request, while other headers remain automatically generated by our client. We'll talk more about headers in "Challenges" section below, but for now most minor web scraping can work well with headers httpx generates for us.

POST Requests

As we've discovered, GET type requests just mean "get me that document", however sometimes that might not be enough information for the server to serve correct content.

On the other hand POST requests are the opposite - "take this document". Why would we want to give someone a document when web scraping? Some website operations require complex set of parameters to process the request. For example, to render a search results page the website needs query parameters of what to search. So, as a web scraper we would send a document containing search parameters and in return we'd get document containing search results.

Let's take a quick look at how we can use POST requests in httpx:

import httpx
response = httpx.post("http://httpbin.org/post", json={"question": "Why is 6 afraid of 7?"})
print(response.text)
# will print:
# {
#   ...
#   "data": "{\"question\": \"Why is 6 afraid of 7?\"}", 
#   "headers": {
#     "Content-Type": "application/json", 
#      ...
#   }, 
# }

As you can see, if we submit this request, the server will receive some JSON data, and a Content-Type header indicating the type of this document(application/json). With this information, the server will do some thinking and return us a document in exchange. In this imaginary scenario, we submit a document with question data, and the server would return us the answer.

Configuring Proxies

Making thousands of connections from a single address is an easy way to be identified as a web scraper which might result in being blocked. To add, some websites are only available in certain regions of the world. Meaning, we are in great advantage if we can mask the origin of our connections by using a proxy.

Httpx supports extensive proxy options for both HTTP and SOCKS5 type proxies:

import httpx
response = httpx.get(
    "http://httpbin.org/ip",
    # we can set proxy for all requests
    proxies = {"all://": "http://111.22.33.44:8500"},
    # or we can set proxy for specific domains
    proxies = {"all://only-in-us.com": "http://us-proxy.com:8500"},
)
Introduction To Proxies in Web Scraping

For more on proxies in web scraping see our full introduction tutorial which explains different proxy types and how to correctly manage them in web scraping projects.

Introduction To Proxies in Web Scraping

Managing Cookies

Cookies are used to help server track its clients. It enables persistent connection details such as login sessions or website preferences.
In web scraping, we'd can encounter websites that cannot function without cookies so we must replicate them in our HTTP client connection. In httpx we can use the cookies argument:

import httpx

# we can either use dict objects
cookies = {"login-session": "12345"}
# or more advanced httpx.Cookies manager:
cookies = httpx.Cookies()
cookies.set("login-session", "12345", domain="httpbin.org")

response = httpx.get('https://httpbin.org/cookies', cookies=cookies)

Putting It All Together

Now that we have briefly introduced ourselves with HTTP clients in python, let's apply it practice and scrape some stuff!
In this section, we have a short challenge: we have multiple URLs that we want to retrieve HTML contents off. Let's see what sort of practical challenges we might encounter and how real web scraping programs function.

import httpx

# here is a list of urls, in this example we'll just use some place holders
urls = [
    "http://httbin.org/html", 
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
]
# as discussed in headers chapter we should always stick to browser-like headers for our 
# requests to prevent being blocked
headers = {
    # lets use Chrome browser on Windows:
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}

# since we have multiple urls we want to scrape we should establish a persistent session
with httpx.Client(headers=headers) as session:
    for url in urls:
        response = session.get(url)
        html = response.text
        meta = response.headers
        print(html)
        print(meta)

As you can see, there's quite a bit of going on here. Let's unpack the most important bits in greater detail:

Why are we using custom headers?
As we've discussed in the headers chapter, we must mask our web scraper to appear as a web browser to prevent being blocked. While httpbin.org is not blocking any requests, it's generally a good practice to set at least User-Agent and Accept headers when web-scraping public targets.

What is httpx.Client?
We could skip it and call httpx.get() for each url instead:

for url in urls:
    response = httpx.get(url, headers=headers)
# vs
with httpx.Client(headers=headers) as session:
    response = session.get(url)

However, as we've covered earlier HTTP is not a persistent protocol - meaning every time we would call httpx.get() we would connect with the server and only then exchange our requests/response objects. To optimize this exchange we can establish a session which is usually referred to as Connection Pooling or HTTP persistent connection.

In other words, this session will establish connection only once and continue exchanging our requests until we close it. Using sessions not only optimizes our code, but also provides some convenient shortcuts like setting global settings like headers and managing cookies, redirects automatically.


We've got a good grip on HTTP so now, let's take a look at the second part of web scraping: parsing!

Parsing HTML Content

HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about HTML structure is that it's intended to be machine-readable text content, which is great news for web-scraping as we can easily parse the data with code!

HTML is a tree type structure that lends easily to parsing. For example, let's take this simple HTML content:

<head>
    <title>My Website</title>
</head>
<body>
    <h0>Welcome to my website!</h1>
    <div class="content">
        <p>This is my website</p>
        <p>Isn't it great?</p>
    </div>
</body>

Here we see a basic HTML document that a simple website might serve. You can already see the tree like structure just by indentation of the text, but we can even go further and illustrate it:

HTML tree illustration

Example of a HTML node tree. Note that branches are ordered left-to-right and each element can contain extra properties.

This tree structure is brilliant for web-scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under HTML tag <head> which is under <title> nodes. In other words - if we wanted to extract 999 titles for 1000 different pages, we would write a rule to find head->title->text for every one of them.

When it comes to HTML parsing using path instructions, there are two standard ways to approach this: CSS selectors and XPATH selectors - let's take a look at them.

Using CSS and XPATH Selectors

There are two HTML parsing standards:

  • CSS selectors - simpler, more brief, less powerful
  • XPATH selectors - more complex, longer, very powerful

Generally, modern websites can be parsed with CSS selectors alone, however sometimes HTML structure can so complex that having that extra XPATH power makes things much easier. We'll be mixing both - we'll stick CSS where we can otherwise fallback to XPATH.

Parsing HTML with CSS Selectors

For more on CSS selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Parsing HTML with CSS Selectors
Parsing HTML with Xpath

For more on XPATH selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Parsing HTML with Xpath

Since Python has no built-in HTML parser, we must choose a library which provides such capability. In Python, there are several options, but the two biggest libraries are beautifulsoup (beautifulsoup4) and parsel.

We'll be using parsel HTML parsing package in this chapter, but since CSS and XPATH selectors are de facto standard ways of parsing HTML we can easily apply the same knowledge to beautifulsoup library as well as other HTML parsing libraries in other programming languages.

Let's install the python library parsel and do a quick introduction:

$ pip install parsel
$ pip show parsel
Name: parsel
Version: 0.6.0
...

For more on parsel see official documentation

Now with our package installed, let's give it a spin with this imaginary HTML content:

# for this example we're using a simple website page
HTML = """
<head>
    <title>My Website</title>
</head>
<body>
    <div class="content">
        <h0>First blog post</h1>
        <p>Just started this blog!</p>
        <a href="http://github.com/scrapfly">Checkout My Github</a>
        <a href="http://twitter.com/scrapfly_dev">Checkout My Twitter</a>
    </div>
</body>
"""
from parsel import Selector

# first we must build parsable tree object from HTML text string
tree = Selector(HTML)
# once we have tree object we can start executing our selectors
# we can use ss selectors:
github_link = tree.css('.content a::attr(href)').get()
# we can also use xpath selectors:
twitter_link = tree.xpath('//a[contains(@href,"twitter.com")]/@href').get()
title = tree.css('title').get()
github_link = tree.css('.content a::attr(href)').get()
article_text = ''.join(tree.css('.content ::text').getall()).strip()
print(title)
print(github_link)
print(twitter_link)
print(article_text)
# will print:
# <title>My Website</title>
# http://github.com/scrapfly
# http://twitter.com/scrapfly_dev
# First blog post
# Just started this blog!
# Checkout My Github

In this example we used parsel package to create a parse tree from existing HTML text. Then, we used CSS and XPATH selector functions of this parse tree to extract title, Github link, Twitter link and the article's text.

Example Project: producthunt.com scraper

In the last article we've covered how to download HTML documents using httpx client and in this article we've figured how to use CSS and XPATH selectors to parse HTML data using parsel package. Now let's put all of this together and write a small scraper!

In this section we'll be scraping https://www.producthunt.com/ which is essentially a technical product directory where people submit and discuss new digital products.

Let's start with the scraper's source code:

import httpx
import json
from parsel import Selector

DEFAULT_HEADERS = {
    # lets use Chrome browser on Windows:
    "User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
}


def parse_product(response):
    tree = Selector(response.text)
    return {
        "url": str(response.url),
        'name': tree.css('h0 ::text').get(),
        'subtitle': tree.css('h1 ::text').get(),
        # votes are located under <span> which contains bigButtonCount in class names
        'votes': tree.css("span[class*='bigButtonCount']::text").get(),
        # tags is our most complex location
        # tag links are under div which contains topicPriceWrap class 
        # and tag links are only valid if they have /topic/ in them
        'tags': tree.xpath(
            "//div[contains(@class,'topicPriceWrap')]"
            "//a[contains(@href, '/topics/')]/text()"
        ).getall(),

    }

def scrape_products(urls):
    results = []
    with httpx.Client(headers=DEFAULT_HEADERS) as session:
        for url in urls:
            response = session.get(url
            results.append(parse_product(response))
    return results

if __name__ == '__main__':
    results = scrape_products([
        "https://www.producthunt.com/posts/notion-8",
        "https://www.producthunt.com/posts/obsidian-4",
        "https://www.producthunt.com/posts/evernote",
    ])
    print(json.dumps(results, indent=2))

In this little scraper we provide a list of producthunt.com product urls and have our scraper collect and parse basic product data from each one of them.

Scraper results example
[
  {
    "url": "https://www.producthunt.com/posts/notion-8",
    "name": "Notion",
    "subtitle": "Artificial intelligence-powered email.",
    "tags": [
      "Android",
      "iPhone",
      "Email"
    ],
    "votes": "0,650"
  },
  {
    "url": "https://www.producthunt.com/posts/obsidian-4",
    "name": "Obsidian",
    "subtitle": "A powerful knowledge base that works on local Markdown files",
    "tags": [
      "Productivity",
      "Note"
    ],
    "votes": "0,706"
  },
  {
    "url": "https://www.producthunt.com/posts/evernote",
    "name": "Evernote",
    "subtitle": "Note taking made easy",
    "tags": [
      "Android",
      "iPhone",
      "iPad"
    ],
    "votes": "299"
  }
]

Thanks to the rich python's ecosystem, we've accomplished this single page scraper in under 39 lines of code - awesome!

Further, let's modify our script, so it finds product urls by itself by scraping producthunt.com topics. For example, /topics/productivity contains a list of products that are intended to boost digital productivity:

from urllib.parse import urljoin

def parse_topic(response):
    tree = Selector(text=response.text)
    # get product relative urls:
    urls = tree.xpath("//li[contains(@class,'styles_item')]/a/@href").getall()
    # turn relative urls to absolute urls and return them
    return [urljoin(response.url, url) for url in urls]

def scrape_topic(topic):
    with httpx.Client(headers=DEFAULT_HEADERS) as session:
        response = session.get(f"https://www.producthunt.com/topics/{topic}")
        return parse_topic(response)

if __name__ == '__main__':
    urls = scrape_topic("productivity")
    results = scrape_products(urls)
    print(json.dumps(results, indent=2))

Now we have a full scraping loop: we retrieve product urls from a directory page and then scrape each of them individually!

Wou could further improve this scraper with paging support as now we're only scraping the first page of topics, implement error and failure handling as well as some tests. That being said this is a good entry point into web scraping world as we tried out many thing we've covered in this article like header faking, using client sessions and parsing xpath/css with css/xpath selectors.

Challenges

When it comes to web scraping challenges we can put in them in to few distinct categories:

Dynamic Content

In this article we used HTTP clients to retrieve data, however our python environment is not a web browser and it can't execute complex javascript powered behavior some websites use. Most common example of this is dynamic data loading when page URL doesn't change but clicking a button changes some data on the page. To scrape this we either need to reverse engineer the website's javascript behavior or use web browser automation with headless browsers.

Scraping Dynamic Websites Using Browser

For browser usage in web scraping see our full introduction article which covers the most popular tools Selenium, Puppeteer and Playwright

Scraping Dynamic Websites Using Browser

Connection Blocking

Unfortunately, not every website tolerates web scraping and often blocks them. To avoid this, we need to ensure that our web scraper looks and behaves like a web browser user. We've taken a look at using web browser headers to accomplish this but there's much more to it.

Parsing Content

Even though HTML content is machine parsable many website developers don't create it with this intention. So, we might encounter HTML files that are really difficult to digest. XPATH and CSS selectors are really powerful and combined with regular expression or natural language parsing we can confidently extract any data an HTML could present. If you're stuck with parsing we highly recommend #xpath and #css-selectors tags on stackoverflow.

Web Scraper Scaling

There's a lot of data online and while scraping few documents is easy, scaling that to thousands and millions of http requests and documents can quickly introduce a lot of challenges ranging from web scraper blocking to handling multiple concurrent connections.

For bigger scrappers we highly recommend taking advantage of Python's asynchronous ecosystem. Since HTTP connections involve a lot of waiting async programming allows us to schedule and handle multiple connections concurrently. For example in httpx we can manage both synchronous and asynchronous connections:

import httpx
import asyncio
from time import time

urls_20 = [f"http://httpbin.org/links/20/{i}" for i in range(20)]

def scrape_sync():
    _start = time()
    with httpx.Client() as session:
        for url in urls_20:
            session.get(url)
    return time() - _start

async def scrape_async():
    _start = time()
    async with httpx.AsyncClient() as session:
        await asyncio.gather(*[session.get(url) for url in urls_20])
    return time() - _start

if __name__ == "__main__":
    print(f"sync code finished in: {scrape_sync():.2f} seconds")
    print(f"async code finished in: {asyncio.run(scrape_async()):.2f} seconds")

Here, we have two functions that scrape 20 urls. One synchronous and one taking advantage of asyncio's concurrency. If we run them we can see a drastic speed difference:

sync code finished in: 7.58 seconds
async code finished in: 0.89 seconds

Fortunately, the web scraping community is pretty big and can often help solve these issues, our favorite resources are:

We at ScrapFly have years of experience with these issues and worked hard to provide one shoe-fit-all solution via our ScrapFly API where many of these challenges are solved automatically!

ScrapFly

Here at ScrapFly we recognize the difficulties of web scraping and came up with an API solution that solves these issues for our users.
ScrapFly is essentially an intelligent middleware that sits between your scraper and your target. Your scraper, instead of connecting to your target itself, requests ScrapFly API to do it for it and return the end results.

illustration of scrapfly's middleware

This abstraction layer can greatly increase performance and reduce the complexity of many web-scrapers by offloading common web scraping issues away from the scraper code!

Let's take a look at how our example scraper would look in ScrapFly SDK. We can install ScrapFly SDK using pip: pip install scrapfly-sdk and the usage is similar to that of a regular HTTP client library:

from scrapfly import ScrapflyClient, ScrapeConfig

urls = [
    "http://httbin.org/html", 
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
    "http://httbin.org/html",
]
with ScrapflyClient(key='<YOUR KEY>') as client:
    for url in urls:
        response = client.scrape(
            ScrapeConfig(url=url)
        )
        html = response.scrape_result['content']

As you can see, our code with ScrapFly looks almost the same except we get rid of a lot of complexity such as faking our headers as we did in our httpx based scraper - ScrapFly does all this automatically!
We can even go further and enable a lot of optional features (click to expand for details):

javascript rendering - use ScrapFly's automated browsers to render websites powered by javascript

This can be enabled by the render_js=True option:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://quotes.toscrape.com/js/page/2/" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            render_js=True
            # ^^^^^^^ enabled 
        )
    )
    html = response.scrape_result['content']

smart proxies - use ScrapFly's 190M proxy pool to scrape hard to access websites

All ScrapFly requests go through proxy but we can further extend that by selecting different proxy types and proxy locations:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://quotes.toscrape.com/js/page/2/" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            # see https://scrapfly.io/dashboard/proxy for available proxy pools
            proxy_pool='public_mobile_pool',  # use mobile proxies
            country='US',  # use proxies located in the United States
        )
    )
    html = response.scrape_result['content']

anti scraping protection bypass - scrape anti-scraping service protected websites

This can be enabled by the asp=True option:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://quotes.toscrape.com/js/page/2/" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            # enable anti-scraping protection bypass
            asp=True
        )
    )
    html = response.scrape_result['content']

Scraping Frameworks: Scrapy

In this article we've covered hands-on web scraping, however when scaling to hundreds of thousands of requests reinventing the wheel can be a suboptimal and painful experience. For this it might be worth taking a look into web scraping frameworks like Scrapy which is a convenient abstraction layer around everything we've learned today and more!

Web Scraping With Scrapy

For more on scrapy see our full introduction article which covers introduction, best practices, tips and tricks and an example project!

Web Scraping With Scrapy

Scrapy implements a lot of shortcuts and optimizations that otherwise would be difficult to implement by hand, such as request concurrency, retry logic and countless community extensions for handling various niche cases.

ScrapFly's python-sdk package implements all the powerful ScrapFly's features into Scrapy's API:

# /spiders/scrapfly.py
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse


class ScrapFlySpider(ScrapflySpider):
    name = 'scrapfly'
    start_urls = [
        ScrapeConfig(url='https://www.example.com')
    ]

    def parse(self, response: ScrapflyScrapyResponse):
        yield ScrapflyScrapyRequest(
            scrape_config=ScrapeConfig(
                url=response.urljoin(url),
                # we can enable javascript rendering via browser automation
                render_js=True,
                # we can get around anti bot protection
                asp=True,
                # specific proxy country
                country='us',
                # change proxy type to mobile proxies
                proxy_pool="public_mobile_pool",
            ),
            callback=self.parse_report
        )
    
# settings.py
SCRAPFLY_API_KEY = 'YOUR API KEY'
CONCURRENT_REQUESTS = 2

FAQ

We've covered a lot in this article but web scraping is such a vast subject that we just can't fit everything into a single article. However, we can answer some frequently asked questions people have about web scraping in Python:

Is Python Good for Web Scraping?

Building a web scraper in Python is quite easy! Unsurprisingly, it's by far the most popular language used in web scraping.
Python is an easy yet powerful language with rich ecosystems in data parsing and HTTP connection areas. Since web scraping scaling is mostly IO based (waiting for connections to complete takes the most of the program's runtime), Python performs exceptionally well as it supports asynchronous code paradigm natively! So, Python for web scraping is fast, accessible and has a huge community.

What is the best HTTP client library for Python?

Currently, the best option for web scraping in our opinion is the httpx library as it supports synchronous and asynchronous python as well as being easy to configure for avoiding web scraper blocking. Alternatively, the requests library is a good choice for beginners as it has the easiest API.

How to speed up python web scraping?

The easiest way to speed up web scraping in python is to use asynchronous HTTP client such as httpx and use asynchronous functions (coroutines) for all HTTP connection related code.

How to prevent python web scraping blocking?

One of the most common challenges when using Python to scrape a website is blocking. This happens because scrapers inherently behave differently compared to a web browser so they can be detected and blocked.
The goal is to ensure that HTTP connections from python web scraper look similar to those of a web browser like Chrome or Firefox. This involves all connection aspects: using http2 instead of http1.1, using same headers as the web browser, treating cookies the same way browser does etc. For more see How to Scrape Without Getting Blocked Tutorial

Why can't my scraper see the data my browser does?

When we're using HTTP clients like requests, httpx etc. we scrape only the raw page source which often looks different from page source in the browser. This is because the browsers run all the javascript that is present in the page which can change it. Our python scraper has no javascript capabilities, so we either need to reverse engineer javascript code or control a web browser instance. See our for more.

What are the best tools used in web scraper development?

There are a lot of great tools out there, though when it comes to best web scraping tools in Python the most important tool must be the web browser developer tools. This suite of tools can be accessed in majority of web browser (Chrome, Firefox, Safari via F12 key or right click "inspect element").
This toolset is vital for understanding how the website works. It allows us to inspect the HTML tree, test our xpath/css slectors as well as track network activity - all of which are brilliant tools for developing web scrapers.

We recommend getting familiar with these tools by reading official documentation page.

Summary

In this python web scraping tutorial we've covered the basics of everything you need to know to start web scraping in Python.

We've introduced ourselves with the HTTP protocol which is the backbone of all internet connections. We explored GET and POST requests, and the importance of request headers.
Then, we've taken a look at HTML parsing: using CSS and XPATH selectors to parse data from raw HTML content.
Finally, we solidified this knowledge with an example project where we scraped products details of producthunt.com.

This web scraping tutorial should start you on the right path, but it's just the tip of the web scraping iceberg! Check out ScrapFly API for dealing with advanced web scraping challenges like scaling and blocking.

Related post

Crawling with Python

Introduction to web crawling with Python. What is web crawling? How it differs from web scraping? And a deep dive into code, building our own crawler and an example project crawling Shopify-powered websites.

How to Scrape Zoominfo

Practical tutorial on how to web scrape public company and people data from Zoominfo.com using Python and how to avoid being blocked using ScrapFly API.

How to Scrape Google Maps

We'll take a look at to find businesses through Google Maps search system and how to scrape their details using either Selenium, Playwright or ScrapFly's javascript rendering feature - all of that in Python.

How to Scrape Angel.co

Tutorial for web scraping AngelList - angel.co - tech startup company and job directory using Python. Practical code and best practices.