In this Python web scraping tutorial we'll take a deep dive into what makes Python the number one language when it comes to web scraping. We'll cover basics and best practices when it comes to web scraping using Python.
In this introduction we'll cover these major subjects:
HTTP protocol - what are HTTP requests and responses and how to use them to collect data from websites.
Data parsing - how to parse collected HTML and JSON files to extract structured data.
To wrap up, we'll solidify our knowledge with an example project by scraping job listing data from remotepython.com/jobs/ - a job listing board for remote Python jobs.
What is Web Scraping?
One of the biggest revolutions of the 21st century is the realization of how valuable data can be - and the internet is full of free public data!
Web scraping is an automated process to collect public web data. There are thousands of reasons why one might want to collect this public data, like finding potential employees or gathering competitive intelligence. We at ScrapFly did extensive research into web scraping applications, and you can find our findings here on our Web Scraping Use Cases page.
To scrape a website with python we're generally dealing with two types of problems: collecting the public data available online and then parsing this data for structured product information.
So, how to scrape data from a website using Python? In this article, we'll cover everything you need to know - let's dive in!
Setup
In this tutorial, we'll cover several popular web scraping libraries:
httpx - HTTP client library, most commonly used in web scraping. Another popular alternative for this is requests library though we'll stick with httpx as it's much more suited for web scraping.
Before we dive in deep let's take a quick look at a simple web scraper:
import httpx
from parsel import Selector
# Retrieve html page
response = httpx.get("https://www.remotepython.com/jobs/")
# check whether request was a success
assert response.status_code == 200
# parse HTML for specific information:
selector = Selector(text=response.text)
for job in selector.css('.box-list .item'):
title = job.css('h3 a::text').get()
relative_url = job.css('h3 a::attr(href)').get()
print(title)
print(response.url.join(relative_url))
print('--------------------------')
Example Output
Back-End / Data / DevOps Engineer
https://www.remotepython.com/jobs/8173028f333140e1b6d74f70dc42a52a/
--------------------------
Lead Software Engineer (Python)
https://www.remotepython.com/jobs/a63708cb43df422dbe76938c843ed1fb/
--------------------------
Senior Back End Engineer
https://www.remotepython.com/jobs/de4dab9efc7b435b860cd3003a122c63/
--------------------------
Full Stack Python Developer - remote
https://www.remotepython.com/jobs/98c317bf6f8b4610a4476407cff32b2d/
--------------------------
Remote Python Developer
https://www.remotepython.com/jobs/dadf4aacff444043b601f6665b53889c/
--------------------------
Python Developer
https://www.remotepython.com/jobs/0f52fc0bb2a04a0db67238b63df6d5aa/
--------------------------
Senior Software Engineer
https://www.remotepython.com/jobs/e0e51ee44bb443e98dde0d9d8390a933/
--------------------------
Remote Senior Back End Developer (Python)
https://www.remotepython.com/jobs/a6bcd1b264134ef8b6715f2aa05da00f/
--------------------------
Full Stack Software Engineer
https://www.remotepython.com/jobs/dac24df8ef2a47e6ad41bf05343d74bd/
--------------------------
Remote Python & JavaScript Full Stack Developer
https://www.remotepython.com/jobs/f9d92f4a5743457d9f7fae31a3ebc057/
--------------------------
Sr. Back-End Developer
https://www.remotepython.com/jobs/3c70ed5dd269402f83a54f93e35add9c/
--------------------------
Backend Engineer
https://www.remotepython.com/jobs/ecca5fc4a9194387b19c3bcd491216df/
--------------------------
Miscellaneous tasks for existing Python website, Django CMS and Vue 2
https://www.remotepython.com/jobs/6edf140866784803a862574861cae487/
--------------------------
Senior Django Developer
https://www.remotepython.com/jobs/7b04bdee004a4dab9598cc4dfc0ae029/
--------------------------
Sr. Backend Python Engineer
https://www.remotepython.com/jobs/6b7920f8cd6943ad8fe45c634c3daed6/
--------------------------
This quick scraper will collect all job titles and URLs on the first page of our example target. Pretty easy! Let's take a deeper look at all of these details.
HTTP Fundamentals
To collect data from a public resource, we need to establish a connection with it first.
Most of the web is served over HTTP which is a rather simple data exchange protocol:
We (the client) send a request to the website (the server) for a specific document. The server processes the request and replies with a response that will either contain the web data or an error message. A very straightforward exchange!
So, we send a request object which is made up of 3 parts:
method - one of few possible types.
headers - metadata about our request.
location - what document we want to retrieve.
In turn, we receive a response object which consists of:
status code - one of few possibilities indicating the success of failure.
headers - metadata about the response.
content - the page data, like HTML or JSON.
Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.
Requests and Responses
When it comes to web scraping we only need to understand some HTTP essentials. Let's take a quick look.
Request Methods
HTTP requests are conveniently divided into a few types (called methods) that perform distinct functions.
The most common types used in web scraping are:
GET - request a document.
POST - request a document by sending a document.
HEAD - request documents meta information like when was the last time it was updated.
In web scraping, we'll mostly be using GET-type requests as we want to retrieve the documents.
POST requests are also quite common when scraping interactive parts of the web pages like forms, search or paging.
HEAD requests are used for optimization - scrapers can request meta information and then decide whether downloading the whole page is worth it.
Other methods aren't used often but it's good to be aware of them:
PATCH - update an existing document.
PUT - either create a new document or update it.
DELETE - delete a document.
Request Location
Request location is defined by an URL (Universal Resource Location) which is structured from a few key parts:
Here, we can visualize each part of a URL:
Protocol - when it comes to HTTP is either http or https.
Host - the address of the server that is either a domain name or an IP address.
Location - unique path where the resource is located.
If you're ever unsure of a URL's structure, you can always fire up python and let it figure it out for you:
from urllib.parse import urlparse
urlparse("http://www.domain.com/path/to/resource?arg1=true&arg2=false")
# which will print:
ParseResult(
scheme='http',
netloc='www.domain.com',
path='/path/to/resource',
params='',
query='arg1=true&arg2=false',
fragment=''
)
Request Headers
While it might appear like request headers are just minor metadata details, in web scraping they are extremely important.
Headers contain essential details about the request - who's requesting the data? What type of data they are expecting? Using wrong or incomplete headers might result in an error or even get the web scraper blocked.
Let's take a look at some of the most important headers and what they mean.
User-Agent
This is the client's identity header. It tells the server what type of client is making the request: is it a desktop web browser? or a phone app?
# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
Whenever you visit a web page in your web browser it identifies itself with a User-Agent string that
looks something like "Browser Name, Operating System, Some version numbers".
This helps the server to determine whether to serve or deny the client. In web scraping, of course,
we don't want to be denied access, so we have to blend in by faking our user agent to look like that one of a browser.
Cookies are used to store persistent data. This is a vital feature for websites to keep track of user state: user logins, configuration preferences etc. All of the cookie information is exchanged through this Cookie header.
Accept
Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content the client is expecting to receive.
Generally, when web scraping we want to mimic this of one of the popular web browsers. For example, here are the values a Chrome browser uses:
These headers are special custom headers that could mean anything. These are important to keep an eye on when web scraping, as they might configure important functionality of the website/webapp.
Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, a failure or more details are needed (like a login or auth token).
Let's take a quick look at the status codes that are most relevant to web scraping:
200 range codes generally mean success!
300 range codes tend to mean redirection - in other words, if we request /product1.html it might be moved to a new location like /products/1.html which 300 status responses would tell us about.
400 range codes mean the request is malformed or denied. Our web scraper could be missing some headers, cookies or authentication details.
500 range codes typically mean server issues. The website might be unavailable right now or is purposefully disabling access to our web scraper.
When it comes to web scraping, response headers provide some important information for connection functionality and efficiency though we rarely need to work with response headers in basic web scraping.
The most notable response header in web scraping is the Set-Cookie header which asks our client to save some cookies for future requests. Cookies can be vital for website functionality so it's important to manage them when web scraping.
🧙♂️ popular HTTP clients like httpx.Client manage cookies automatically for us!
The X- prefixed headers are custom headers set by the website which can contain extra response details or secret tokens.
Finally, there are cache-related headers that are useful for scraper optimization:
Etag header often indicates content hash of the response letting the scraper know if the content has change since the last scrape.
Last-Modified header tells when was the last time page changed it's content.
We took a brief overlook of core HTTP components, and now it's time to see how HTTP works in practical Python!
HTTP Clients in Python
Before we start exploring HTTP connections in python, we need to choose an HTTP client. Let's take a look at what is the best web scraping library in Python when it comes to handling HTTP connections.
Python comes with an HTTP client built-in called urllib though, it's not very good for web scraping. Fortunately, the community offers several great alternatives:
httpx (recommended) - most feature rich client, offering http2 support and asynchronous client.
requests - most popular client as it's one of the easiest to use.
aiohttp - very fast asynchronous client and server.
So, what makes a good HTTP client for web scraping?
The first thing to note is the HTTP version. There are 3 popular versions used on the web:
HTTP1.1 The most simple text based protocol used widely by simpler programs. Implemented by urllib, requests, httpx, aiohttp
HTTP2 more complex/efficient binary based protocol, mostly used by web-browsers. Implemented by httpx
HTTP3/QUIC the newest and most efficient version of protocol mostly used by web-browser. Implemented by aioquic, httpx(planned)
When it comes to web scraping HTTP1.1 is good enough for most cases, though HTTP2/3 are very helpful for avoiding web scraper blocking as most of the real web users use HTTP2+ web browsers.
We'll be sticking with httpx as it offers all the features required for web scraping. That being said, other HTTP clients like the requests library can be used almost interchangeably.
Exploring HTTP with httpx
Now that we have a basic understanding of HTTP let's see it in action!
In this section, we'll experiment with basic web-scraping scenarios to further understand HTTP in practice. For our example case study, we'll be using http://httpbin.org request testing service, which allows us to send requests and returns exactly what happens.
GET Requests
Let's start off with GET-type requests, which are the most common type of requests in web scraping.
To put it shortly GET often simply means: give me the document located at URL.
For example, GET https://www.httpbin.org/html request would be asking /html document from httpbin.org server.
200
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>Herman Melville - Moby-Dick</h1>
<div>
<p>
Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur, no impatience, no petulance did come from him. Silent, slow, and solemn; bowing over still further his chronically broken back, he toiled away, as if toil were life itself, and the heavy beating of his hammer the heavy beating of his heart. And so it was.—Most miserable! A peculiar walk in this old man, a certain slight but painful appearing yawing in his gait, had at an early period of the voyage excited the curiosity of the mariners. And to the importunity of their persisted questionings he had finally given in; and so it came to pass that every one now knew the shameful story of his wretched fate. Belated, and not innocently, one bitter winter's midnight, on the road running between two country towns, the blacksmith half-stupidly felt the deadly numbness stealing over him, and sought refuge in a leaning, dilapidated barn. The issue was, the loss of the extremities of both feet. Out of this revelation, part by part, at last came out the four acts of the gladness, and the one long, and as yet uncatastrophied fifth act of the grief of his life's drama. He was an old man, who, at the age of nearly sixty, had postponedly encountered that thing in sorrow's technicals called ruin. He had been an artisan of famed excellence, and with plenty to do; owned a house and garden; embraced a youthful, daughter-like, loving wife, and three blithe, ruddy children; every Sunday went to a cheerful-looking church, planted in a grove. But one night, under cover of darkness, and further concealed in a most cunning disguisement, a desperate burglar slid into his happy home, and robbed them all of everything. And darker yet to tell, the blacksmith himself did ignorantly conduct this burglar into his family's heart. It was the Bottle Conjuror! Upon the opening of that fatal cork, forth flew the fiend, and shrivelled up his home. Now, for prudent, most wise, and economic reasons, the blacksmith's shop was in the basement of his dwelling, but with a separate entrance to it; so that always had the young and loving healthy wife listened with no unhappy nervousness, but with vigorous pleasure, to the stout ringing of her young-armed old husband's hammer; whose reverberations, muffled by passing through the floors and walls, came up to her, not unsweetly, in her nursery; and so, to stout Labor's iron lullaby, the blacksmith's infants were rocked to slumber. Oh, woe on woe! Oh, Death, why canst thou not sometimes be timely? Hadst thou taken this old blacksmith to thyself ere his full ruin came upon him, then had the young widow had a delicious grief, and her orphans a truly venerable, legendary sire to dream of in their after years; and all of them a care-killing competency.
</p>
</div>
</body>
</html>
Headers({'date': 'Thu, 24 Nov 2022 09:48:41 GMT', 'content-type': 'text/html; charset=utf-8', 'content-length': '3741', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})
Here, we perform a basic GET request though real web scraper requests tend to be a bit more complex. Next, let's take a look at request headers.
Request Metadata - Headers
We've already done a theoretical overview of request headers and since they're so important in web scraping let's take a look at how we can use them with our HTTP client:
In this example we're using httpbin.org testing endpoint for headers, it returns the sent inputs (headers, body) back to us as a response body. If we run this code with specific headers, we can see that the client is generating some basic ones automatically:
Even though we didn't explicitly provide any headers in our request, httpx generated the required basics for us.
To add some custom headers we can use the headers argument:
import httpx
response = httpx.get('http://httpbin.org/headers', headers={"User-Agent": "ScrapFly's Web Scraping Tutorial"})
print(response.text)
# will print:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Host": "httpbin.org",
"User-Agent": "ScrapFly's Web Scraping Tutorial",
# ^^^^^^^ - we changed this!
}
}
As you can see above, we used a custom User-Agent header for this request, while other headers remain automatically generated by our client.
POST Requests
As we've discovered, GET-type requests just mean "get me that document", however sometimes that might not be enough information for the server to serve correct content that's where POST-type requests come in.
POST-type requests essentially mean "take this document". Though, why would we want to give someone a document when web scraping?
Some website operations require a complex set of parameters to process the request. For example, to render a search result page the website might need dozens of different parameters like search query, page number and various filters. The only way to provide such a huge set of parameters is to send them as a document using POST requests.
Let's take a quick look at how we can use POST requests in httpx:
import httpx
response = httpx.post("http://httpbin.org/post", json={"question": "Why is 6 afraid of 7?"})
print(response.text)
# will print:
# {
# ...
# "data": "{\"question\": \"Why is 6 afraid of 7?\"}",
# "headers": {
# "Content-Type": "application/json",
# ...
# },
# }
As you can see, if we submit this request, the server will receive some JSON data, and a Content-Type header indicating the type of this document(in this case it's application/json). With this information, the server will do some thinking and return us a document matching our request data.
Configuring Proxies
Proxy servers help to disguise the client's original address by routing the network through a middleman server.
Many websites don't tolerate web scrapers and can block them after a few requests. So, proxies can be used to distribute requests through several proxy identities - an easy way to avoid blocking. To add, some websites are only available in certain regions, proxies can help to access those too.
Httpx supports extensive proxy options for both HTTP and SOCKS5 type proxies:
import httpx
response = httpx.get(
"http://httpbin.org/ip",
# we can set proxy for all requests
proxies = {"all://": "http://111.22.33.44:8500"},
# or we can set proxy for specific domains
proxies = {"all://only-in-us.com": "http://us-proxy.com:8500"},
)
Managing Cookies
Cookies are used to help the server track the state of its clients. It enables persistent connection features such as login sessions or website preferences (currency, language etc.).
In web scraping, we can encounter websites that cannot function without cookies so we must replicate them in our HTTP client connection. In httpx we can use the cookies argument:
import httpx
# we can either use dict objects
cookies = {"login-session": "12345"}
# or more advanced httpx.Cookies manager:
cookies = httpx.Cookies()
cookies.set("login-session", "12345", domain="httpbin.org")
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)
# new cookies can also be set by the server
response.cookies
Tip: Automatic Cookie Tracking
Most HTTP clients can track cookies automatically through session objects. In httpx it's done through httpx.Client:
import httpx
session = httpx.Client()
# this mock request will ask server to set some cookies for us:
response1 = session.get('http://httpbin.org/cookies/set/mycookie/123')
print(response1.cookies)
# now we don't need to set cookies manually, session keeps track of them
response2 = session.get('http://httpbin.org/cookies')
# we can see the automatic cookies in the response.request object:
response2.request.headers['cookie']
'mycookie=123'
Putting It All Together
Now that we have briefly introduced ourselves to the HTTP clients in python let's apply everything we've learned.
In this section, we have a short challenge: we have multiple URLs that we want to retrieve HTMLs of. Let's see what sort of practical challenges we might encounter and how a real web scraping programs function:
import httpx
# as discussed in headers chapter we should always stick to browser-like headers for our
# requests to prevent being blocked
headers = {
# lets use Chrome browser on Windows:
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}
# here is a list of urls, in this example we'll just use some place holders
urls = [
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
]
# since we have multiple urls we want to scrape we should establish a persistent session
session = httpx.Client(headers=headers)
for url in urls:
response = session.get(url)
html = response.text
meta = response.headers
print(html)
print(meta)
The first thing we do is set some request headers to prevent being instantly blocked.
While httpbin.org is not blocking any requests, it's generally a good practice to set at least User-Agent and Accept headers when web-scraping public targets.
What is httpx.Client?
We could skip it and call httpx.get() for each url instead:
for url in urls:
response = httpx.get(url, headers=headers)
# vs
with httpx.Client(headers=headers) as session:
response = session.get(url)
However, HTTP is not a persistent protocol - meaning every time we call httpx.get() basically start a new independent connection which is terribly inefficient.
To optimize this exchange we can establish a session. This is usually referred to as "Connection Pooling" or HTTP persistent connection.
In other words, a session will establish the connection only once and continue exchanging our requests until we close it. Session client not only makes the connection more efficient but provides many convenient features like global headers settings, automatic cookie management and so on.
Tip: Inspect Web Traffic
To fully understand how a website works for web scraping purposes we can use the web browsers devtools suite.
The developer tools' network tab keeps track of every network request our browser makes. This can help to understand how to scrape the website, especially when working with POST-type requests.
See this demonstration video:
Parsing HTML Content
HTML is a text data structure that powers the web. The great thing about HTML structure is that it's intended to be machine-readable text content. This is great news for web-scraping as we can parse data with code just as easily as we do it with our eyes!
HTML is a tree-type structure that lends easily to parsing. For example, let's take this simple HTML content:
Here we see a basic HTML document that a simple website might serve. You can already see the tree-like structure just by indentation of the text, but we can even go further and illustrate it:
This tree structure of HTML is brilliant for web-scraping as we can easily navigate the whole document with a set of simple instructions.
For example, to find links in this HTML we can see that they are under body->div->a node where class==link. These rules are usually expressed through two standard ways: CSS selectors and XPath - let's take a look at them.
Using CSS and XPATH Selectors
There are two HTML parsing standards:
CSS selectors - simpler, briefer, less powerful
XPATH selectors - more complex, longer, very powerful
Generally, modern websites can be parsed with CSS selectors alone however, sometimes HTML structure can be so complex that having that extra XPath power makes things much easier. We'll be mixing both - we'll stick CSS where we can otherwise fall back to XPath.
Since Python has no built-in HTML parser, we must choose a library that provides such capability. In Python, there are several options, but the two biggest libraries are beautifulsoup (beautifulsoup4) and parsel.
We'll be using parsel HTML parsing package in this chapter, but since CSS and XPath selectors are de facto standard ways of parsing HTML we can easily apply the same knowledge to BeautifulSoup library as well as other HTML parsing libraries in other programming languages.
Let's see a quick example of how Parsel can be used in Python to parse HTML using CSS selectors and XPath:
# for this example we're using a simple website page
HTML = """
<head>
<title>My Website</title>
</head>
<body>
<div class="content">
<h1>First blog post</h1>
<p>Just started this blog!</p>
<a href="http://github.com/scrapfly">Checkout My Github</a>
<a href="http://twitter.com/scrapfly_dev">Checkout My Twitter</a>
</div>
</body>
"""
from parsel import Selector
# first we must build parsable tree object from HTML text string
tree = Selector(HTML)
# once we have tree object we can start executing our selectors
# we can use ss selectors:
github_link = tree.css('.content a::attr(href)').get()
# we can also use xpath selectors:
twitter_link = tree.xpath('//a[contains(@href,"twitter.com")]/@href').get()
title = tree.css('title').get()
github_link = tree.css('.content a::attr(href)').get()
article_text = ''.join(tree.css('.content ::text').getall()).strip()
print(title)
print(github_link)
print(twitter_link)
print(article_text)
# will print:
# <title>My Website</title>
# http://github.com/scrapfly
# http://twitter.com/scrapfly_dev
# First blog post
# Just started this blog!
# Checkout My Github
In this example, we used the parsel package to create a parse tree from the HTML text. Then, we used CSS and XPath selector functions of this parse tree to extract the title, Github link, Twitter link and the article's text.
Tip: Use Browser's Devtools
When web scraping a specific target we can use the web browser's developer tools suite to quickly visualize the website's HTML structure and build our CSS and XPath selectors. See this demonstration video:
Example Project
We've covered how to download HTML documents using httpx client and how to use CSS and XPath selectors to parse HTML data using Parsel. Now let's put all of this together in an example project!
For our real-world project, we'll be scraping remotepython.com/jobs/ which contains remote job listings for Python.
We'll be scraping all of the job listings present on the website which involves several steps:
import httpx
import json
from parsel import Selector
# first we need to configure default headers to avoid being blocked.
DEFAULT_HEADERS = {
# lets use Chrome browser on Windows:
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
}
# then we should create a persistent HTTP client:
client = httpx.Client(headers=DEFAULT_HEADERS)
# to start, let's scrape first page
response_first = client.get("https://www.remotepython.com/jobs/")
# and create a function to parse job listings from a page - we'll use this for all pages
def parse_jobs(response: httpx.Response):
selector = Selector(text=response.text)
parsed = []
# find all job boxes and iterate through them:
for job in selector.css('.box-list .item'):
# note that web pages use relative urls (e.g. /jobs/1234)
# which we can convert to absolute urls (e.g. remotepython.com/jobs/1234 )
relative_url = job.css('h3 a::attr(href)').get()
absolute_url = response.url.join(relative_url)
# rest of the data can be parsed using CSS or XPath selectors:
parsed.append({
"url": absolute_url,
"title": job.css('h3 a::text').get(),
"company": job.css('h5 .color-black::text').get(),
"location": job.css('h5 .color-white-mute::text').get(),
"date": job.css('div>.color-white-mute::text').get('').split(': ')[-1],
"short_description": job.xpath('.//h5/following-sibling::p[1]/text()').get("").strip(),
})
return parsed
results = parse_jobs(response_first)
# print results as pretty json:
print(json.dumps(results, indent=2))
Example Output
[
{
"url": "https://www.remotepython.com/jobs/8173028f333140e1b6d74f70dc42a52a/",
"title": "Back-End / Data / DevOps Engineer ",
"company": "Publisher Discovery",
"location": "Bristol, UK, United Kingdom",
"date": "Nov. 23, 2022",
"short_description": "Publisher Discovery is hiring a remote Back-End & Data Engineer to help build, run and evolve the pipelines and platform that underpin our business insights technology.\r\n\r\nWe \u2026"
},
{
"url": "https://www.remotepython.com/jobs/a63708cb43df422dbe76938c843ed1fb/",
"title": "Lead Software Engineer (Python) ",
"company": "Hashtrust Technologies",
"location": "gurgaon, India",
"date": "Nov. 23, 2022",
"short_description": "Job Description:\r\n\r\nHashtrust Technologies is looking for a Lead Software Engineer (Python) with system architecture experience to work with our clients, design solutions, develop\u2026"
},
{
"url": "https://www.remotepython.com/jobs/de4dab9efc7b435b860cd3003a122c63/",
"title": "Senior Back End Engineer ",
"company": "Cube Software",
"location": "New York City, United States",
"date": "Nov. 22, 2022",
"short_description": "We're on a mission to help every company hit their numbers.\r\n\r\nThe world has evolved, but business planning has not. Most Finance teams still manage their planning and analysi\u2026"
},
... etc
]
This short scraper scrapes the first page of results, let's extend it further to collect the remaining pages:
import json
from parsel import Selector
# to scrape other pages we need to find their links and repeat the scrape process:
other_page_urls = Selector(text=response_first.text).css('.pagination a::attr(href)').getall()
for url in other_page_urls:
# we need to turn relative urls (like ?page=2) to absolute urls (like http://remotepython.com/jobs?page=2)
absolute_url = response_first.url.join(url)
response = client.get(absolute_url)
results.extend(parse_jobs(response))
print(json.dumps(results, indent=2))
Above, we extract the remaining page URLs and scrape them the same way scraped the first page.
This wraps up our short example project though we leave you with an extra challenge - how to scrape detailed job listing data?
Common Scraping Challenges
Let's take a look at some popular web scraping challenges and what are the ways to address them.
Dynamic Content
Some websites require javascript which might appear to be difficult to scrape in Python. There are several ways to approach dynamic data scraping.
There's a lot of data online and while scraping few pages is easy, scaling that to thousands and millions of HTTP requests and documents can quickly introduce a lot of challenges ranging from web scraper blocking to handling multiple concurrent connections.
For bigger scrappers we highly recommend taking advantage of Python's asynchronous ecosystem. Since HTTP connections involve a lot of waiting async programming allows us to schedule and handle multiple connections concurrently. For example in httpx we can manage both synchronous and asynchronous connections:
import httpx
import asyncio
from time import time
urls_20 = [f"http://httpbin.org/links/20/{i}" for i in range(20)]
def scrape_sync():
_start = time()
with httpx.Client() as session:
for url in urls_20:
session.get(url)
return time() - _start
async def scrape_async():
_start = time()
async with httpx.AsyncClient() as session:
await asyncio.gather(*[session.get(url) for url in urls_20])
return time() - _start
if __name__ == "__main__":
print(f"sync code finished in: {scrape_sync():.2f} seconds")
print(f"async code finished in: {asyncio.run(scrape_async()):.2f} seconds")
Here, we have two functions that scrape 20 urls. One synchronous and one taking advantage of asyncio's concurrency. If we run them we can see a drastic speed difference:
We at ScrapFly have years of experience with these issues and worked hard to provide one shoe-fit-all solution via our ScrapFly API where many of these challenges are solved automatically!
ScrapFly
Web scraping with Python is surprisingly accessible though scaling up web scraping operations can be difficult and this is where Scrapfly can lend a hand!
Let's take a look at how our example scraper would look in ScrapFly SDK.
We can install ScrapFly SDK using pip: pip install scrapfly-sdk and the usage is almost identical to our httpx and parsel example project:
import json
from urllib.parse import urljoin
from scrapfly import ScrapflyClient, ScrapeApiResponse, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
first_page = scrapfly.scrape(
ScrapeConfig(
url="https://www.remotepython.com/jobs/",
# we can set proxy country to appear as if we're connecting from US
country="US",
# for harder to scrape targets we can enable :anti-scraping protection bypass" if needed:
# asp=True,
)
)
def parse_jobs(result: ScrapeApiResponse):
parsed = []
# note: scrapfly results have parsel.Selector built-in already!
for job in result.selector.css(".box-list .item"):
parsed.append(
{
"url": urljoin(result.context["url"], job.css("h3 a::attr(href)").get()),
"title": job.css("h3 a::text").get(),
"company": job.css("h5 .color-black::text").get(),
"location": job.css("h5 .color-white-mute::text").get(),
"date": job.css("div>.color-white-mute::text").get("").split(": ")[-1],
"short_description": job.xpath(".//h5/following-sibling::p[1]/text()").get("").strip(),
}
)
return parsed
results = parse_jobs(first_page)
other_page_urls = first_page.selector.css(".pagination a::attr(href)").getall()
for url in other_page_urls:
absolute_url = urljoin(first_page.context["url"], url)
response = scrapfly.scrape(ScrapeConfig(url=absolute_url))
results.extend(parse_jobs(response))
print(json.dumps(results, indent=2))
As you can see, our code with ScrapFly looks almost the same except we get rid of a lot of complexity such as faking our headers as we did in our httpx based scraper - ScrapFly does all this automatically!
We can even go further and enable a lot of optional features (click to expand for details):
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://quotes.toscrape.com/js/page/2/"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
render_js=True
# ^^^^^^^ enabled
)
)
html = response.scrape_result['content']
Smart Proxies
All ScrapFly requests go through smart proxies but we can further extend that by selecting different proxy types and proxy locations:
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://quotes.toscrape.com/js/page/2/"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
# see https://scrapfly.io/dashboard/proxy for available proxy pools
proxy_pool='public_mobile_pool', # use mobile proxies
country='US', # use proxies located in the United States
)
)
html = response.scrape_result['content']
In this article, we've covered hands-on web scraping with python however, when scaling to hundreds of thousands of requests reinventing the wheel can be a suboptimal and difficult experience.
For big web scraping projects, it might be worth taking a look into web scraping frameworks like Scrapy which provides many helper functions and features for various topics we've covered today!
Scrapy implements a lot of shortcuts and optimizations that otherwise would be difficult to implement by hand, such as request concurrency, retry logic and countless community extensions for handling various niche cases.
ScrapFly's python-sdk package implements all the powerful ScrapFly's features into Scrapy's API:
# /spiders/scrapfly.py
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class ScrapFlySpider(ScrapflySpider):
name = 'scrapfly'
start_urls = [
ScrapeConfig(url='https://www.example.com')
]
def parse(self, response: ScrapflyScrapyResponse):
yield ScrapflyScrapyRequest(
scrape_config=ScrapeConfig(
url=response.urljoin(url),
# we can enable javascript rendering via browser automation
render_js=True,
# we can get around anti bot protection
asp=True,
# specific proxy country
country='us',
# change proxy type to mobile proxies
proxy_pool="public_mobile_pool",
),
callback=self.parse_report
)
# settings.py
SCRAPFLY_API_KEY = 'YOUR API KEY'
CONCURRENT_REQUESTS = 2
FAQ
We've covered a lot in this article but web scraping is such a vast subject that we just can't fit everything into a single article. However, we can answer some frequently asked questions people have about web scraping in Python:
Is Python Good for Web Scraping?
Building a web scraper in Python is quite easy! Unsurprisingly, it's by far the most popular language used in web scraping.
Python is an easy yet powerful language with rich ecosystems in data parsing and HTTP connection areas. Since web scraping scaling is mostly IO based (waiting for connections to complete takes the most of the program's runtime), Python performs exceptionally well as it supports asynchronous code paradigm natively! So, Python for web scraping is fast, accessible and has a huge community.
What is the best HTTP client library for Python?
Currently, the best option for web scraping in our opinion is the httpx library as it supports synchronous and asynchronous python as well as being easy to configure for avoiding web scraper blocking. Alternatively, the requests library is a good choice for beginners as it has the easiest API.
How to speed up python web scraping?
The easiest way to speed up web scraping in python is to use asynchronous HTTP client such as httpx and use asynchronous functions (coroutines) for all HTTP connection related code.
How to prevent python web scraping blocking?
One of the most common challenges when using Python to scrape a website is blocking. This happens because scrapers inherently behave differently compared to a web browser so they can be detected and blocked.
The goal is to ensure that HTTP connections from python web scraper look similar to those of a web browser like Chrome or Firefox. This involves all connection aspects: using http2 instead of http1.1, using same headers as the web browser, treating cookies the same way browser does etc. For more see How to Scrape Without Getting Blocked? In-Depth Tutorial
Why can't my scraper see the data my browser does?
When we're using HTTP clients like requests, httpx etc. we scrape only the raw page source which often looks different from page source in the browser. This is because the browsers run all the javascript that is present in the page which can change it. Our python scraper has no javascript capabilities, so we either need to reverse engineer javascript code or control a web browser instance. See our for more.
What are the best tools used in web scraper development?
There are a lot of great tools out there, though when it comes to best web scraping tools in Python the most important tool must be the web browser developer tools. This suite of tools can be accessed in majority of web browser (Chrome, Firefox, Safari via F12 key or right click "inspect element").
This toolset is vital for understanding how the website works. It allows us to inspect the HTML tree, test our xpath/css slectors as well as track network activity - all of which are brilliant tools for developing web scrapers.
In this python web scraping tutorial, we've covered everything you need to know to start web scraping in Python.
We've introduced ourselves with the HTTP protocol which is the backbone of all internet connections. We explored GET and POST requests, and the importance of request headers for avoiding blocking.
Then, we've taken a look at parsing HTML in Python: how to use CSS and XPath selectors to parse data from raw HTML content to legible datasets.
Finally, we solidified this knowledge with an example project where we scraped job listings displayed on remotepython.org. We used Chrome developer tools to inspect the structure of the website to build our CSS selectors and scraped each page of job results.
This web scraping tutorial should start you on the right path, but it's just the tip of the web scraping iceberg! For more see our other posts tagged with Python. In particular, we recommend getting familiar with the crawling process next:
Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python parsers and AI models for efficient data extraction.