How to Scrape Instagram
Tutorial on how to scrape instagram.com user and post data using pure Python. How to scrape instagram without loging in or being blocked.
One of the biggest revolutions of 21st century is realization how valuable data can be. Great news is that the internet is full of great, public data for you to take advantage of, and that's exactly the purpose of web scraping: collecting this public data to bootstrap a newly found business or a project.
In this practical introduction to web scraping in python, we'll take a deep look at what exactly is web scraping, technologies that power it, and what are some common challenges modern web scraping projects face.
For this we'll explore the entire web scraping with python process:
We'll start off by learning about HTTP and how to use HTTP clients in python to collect web page data. Then we'll take a look at parsing HTML page data using CSS and XPATH selectors. Finally, we'll build an example web scraper with Python for producthunt.com product data to solidify what we've learned.
Web scraping is essentially public data collection via automated process. There are thousands of reasons why one might want to collect this public data, like finding potential employees or gathering competitive intelligence. We at ScrapFly did extensive research into web scraping applications, and you can find our findings here on our Web Scraping Use Cases page.
To scrape a website with python we're generally dealing with two types of problems: collecting the public data available online and then parsing this data for structured product information. In this article, we'll take a look at both of these steps and solidify the knowledge with an example project.
To collect data from a public resource, we need to establish connection with it first. Most of the web is served over HTTP which is rather simple: we (the client) send a request the website (the server) for a specific document, once the server processes our request it replies with the requested document - a very straight forward exchange!
illustration of a standard HTTP exchange
As you can see in this illustration: we send a request object which is consists of method (aka type), location and headers. In turn, we receive a response object which consists of status code, headers and document content itself.
Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.
When it comes to web scraping we don't exactly need to know every little detail about HTTP requests and responses, however it's good to have a general overview and to know which parts of this protocol are especially useful in web scraping. Let's take a look at exactly that!
Http requests are conveniently divided into few types that perform distinct function:
GET
requests are intended to request a document.POST
requests are intended to request a document by sending a document.HEAD
requests are intended to request documents meta information.PATCH
requests are intended to update a document.PUT
requests are intended to either create a new document or update it.DELETE
requests are intended to delete a document.When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET
and POST
type requests. To add, HEAD
requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check it's metadata whether it's worth the effort.
To understand resource location, first we should take a quick look at URL's structure itself:
example of an URL structure
Here, we can visualize each part of a URL: we have protocol
, which when it comes to HTTP is either http
or https
. Then, we have host
which is essentially the address of the server that is either a domain name or an IP address. Finally, we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up python and let it figure it out for you:
from urllib.parse import urlparse
urlparse("http://www.domain.com/path/to/resource?arg1=true&arg2=false")
> ParseResult(scheme='http', netloc='www.domain.com', path='/path/to/resource', params='', query='arg1=true&arg2=false', fragment='')
While it might appear like request headers are just minor metadata details, in web scraping they are extremely important! Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.
Let's take a look at some of the most important headers and what they mean:
User-Agent is an identity header that tells the server who's requesting the document.
# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
Whenever you visit a web page in your web browser it identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, of course, we don't want to be denied access, so we have to blend in by faking our user agent to look like that one of a browser.
There are many online databases that contain latest user-agent strings of various platforms, like one provided by whatismybrowser.com
Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.
Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
For all standard values see content negotiation header list by MDN
X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of a website/webapp.
These are a few of most important observations, for more see extensive full documentation over at https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers
Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate or follow a redirect).
Let's take a quick look at status codes that are most relevant to web scraping:
/product1.html
it might be moved to a new location /products/1.html
and server would inform us about that.For all standard HTTP response codes see HTTP status list by MDN
When it comes to web scraping, response headers provide some important information for connection functionality and efficiency.
For example, Set-Cookie
header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag
, Last-Modified
are intended to help client with caching to optimize resource usage.
For all options see the standard HTTP request header list by MDN
Finally, just like with request headers, headers prefixed with X-
are custom web functionality headers and depend on each, individual website.
We took a brief overlook of core HTTP components, and now it's time to see how HTTP works in practical Python!
Before we start exploring HTTP connections in python, we need to choose a HTTP client. Python comes with an HTTP client built in called urllib, however for web scraping we need something more feature rich and easier to handle so let's take a look at popular community libraries.
First thing to note when it comes to the HTTP is that it has 3 distinct versions:
As you can see, Python has a very healthy HTTP client ecosystem. When it comes to web scraping HTTP1.1 is good enough for most cases, however HTTP2/3 are very helpful for avoiding web scraper blocking.
We'll be sticking with httpx
as it offers all the features required for web scraping. That being said, other HTTP clients like the requests library can be used almost interchangeably.
Let's see how we can utilize http connections for scraping in Python. First, let's set up our working environment. We'll need python version >3.7 and httpx
library:
$ python --version
Pyhton 3.7.4
$ pip install httpx
With httpx
installed, we have everything we need to start connecting and receiving our documents. Let's give it a go!
httpx
Now that we have basic understanding of HTTP and our working environment ready, let's see it in action!
In this section, we'll experiment with basic web-scraping scenarios to further understand HTTP in practice.
For our example case study, we'll be using http://httpbin.org request testing service, which allows us to send requests and returns exactly what happens.
Let's start off with GET
type requests, which are the most common type of requests in web scraping.
To put it shortly GET
often simply means: give me the document located at GET https://www.httpbin.org/html
request would be asking /html
document from httpbin.org
server:
import httpx
response = httpx.get("https://www.producthunt.com/posts/evernote")
html = response.text
metadata = response.headers
print(html)
print(metadata)
Here, we perform the most basic GET
request possible. However, just requesting the document often might not be enough. As we've explored before, requests are made of: request type, location, headers and optional content. So what are headers?
We've already done a theoretical overview of request headers and since they're so important in web scraping let's take a look at how we can use them with our HTTP client:
import httpx
response = httpx.get('http://httpbin.org/headers')
print(response.text)
In this example we're using httpbin.org testing endpoint for headers, it returns http inputs (headers, body) we sent as response body. If we run this code with specific headers, we can see that the client is generating some basic ones automatically:
{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Host": "httpbin.org",
"User-Agent": "python-httpx/0.19.0",
}
}
Even though we didn't explicitly provide any headers in our request, httpx
generated required basics for us. By using headers
argument, we can specify custom headers ourselves:
import httpx
response = httpx.get('http://httpbin.org/headers', headers={"User-Agent": "ScrapFly's Web Scraping Tutorial"})
print(response.text)
# will print:
# {
# "headers": {
# "Accept": "*/*",
# "Accept-Encoding": "gzip, deflate, br",
# "Host": "httpbin.org",
# "User-Agent": "ScrapFly's Web Scraping Tutorial",
# ^^^^^^^ - we changed this!
# }
# }
As you can see above, we used a custom User-Agent
header for this request, while other headers remain automatically generated by our client. We'll talk more about headers in "Challenges" section below, but for now most minor web scraping can work well with headers httpx generates for us.
As we've discovered, GET type requests just mean "get me that document", however sometimes that might not be enough information for the server to serve correct content.
On the other hand POST requests are the opposite - "take this document". Why would we want to give someone a document when web scraping? Some website operations require complex set of parameters to process the request. For example, to render a search results page the website needs query parameters of what to search. So, as a web scraper we would send a document containing search parameters and in return we'd get document containing search results.
Let's take a quick look at how we can use POST requests in httpx:
import httpx
response = httpx.post("http://httpbin.org/post", json={"question": "Why is 6 afraid of 7?"})
print(response.text)
# will print:
# {
# ...
# "data": "{\"question\": \"Why is 6 afraid of 7?\"}",
# "headers": {
# "Content-Type": "application/json",
# ...
# },
# }
As you can see, if we submit this request, the server will receive some JSON data, and a Content-Type
header indicating the type of this document(application/json
). With this information, the server will do some thinking and return us a document in exchange. In this imaginary scenario, we submit a document with question data, and the server would return us the answer.
Making thousands of connections from a single address is an easy way to be identified as a web scraper which might result in being blocked. To add, some websites are only available in certain regions of the world. Meaning, we are in great advantage if we can mask the origin of our connections by using a proxy.
Httpx supports extensive proxy options for both HTTP and SOCKS5 type proxies:
import httpx
response = httpx.get(
"http://httpbin.org/ip",
# we can set proxy for all requests
proxies = {"all://": "http://111.22.33.44:8500"},
# or we can set proxy for specific domains
proxies = {"all://only-in-us.com": "http://us-proxy.com:8500"},
)
For more on proxies in web scraping see our full introduction tutorial which explains different proxy types and how to correctly manage them in web scraping projects.
Cookies are used to help server track its clients. It enables persistent connection details such as login sessions or website preferences.
In web scraping, we'd can encounter websites that cannot function without cookies so we must replicate them in our HTTP client connection. In httpx we can use the cookies
argument:
import httpx
# we can either use dict objects
cookies = {"login-session": "12345"}
# or more advanced httpx.Cookies manager:
cookies = httpx.Cookies()
cookies.set("login-session", "12345", domain="httpbin.org")
response = httpx.get('https://httpbin.org/cookies', cookies=cookies)
Now that we have briefly introduced ourselves with HTTP clients in python, let's apply it practice and scrape some stuff!
In this section, we have a short challenge: we have multiple URLs that we want to retrieve HTML contents off. Let's see what sort of practical challenges we might encounter and how real web scraping programs function.
import httpx
# here is a list of urls, in this example we'll just use some place holders
urls = [
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
]
# as discussed in headers chapter we should always stick to browser-like headers for our
# requests to prevent being blocked
headers = {
# lets use Chrome browser on Windows:
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}
# since we have multiple urls we want to scrape we should establish a persistent session
with httpx.Client(headers=headers) as session:
for url in urls:
response = session.get(url)
html = response.text
meta = response.headers
print(html)
print(meta)
As you can see, there's quite a bit of going on here. Let's unpack the most important bits in greater detail:
Why are we using custom headers?
As we've discussed in the headers chapter, we must mask our web scraper to appear as a web browser to prevent being blocked. While httpbin.org is not blocking any requests, it's generally a good practice to set at least User-Agent
and Accept
headers when web-scraping public targets.
What is httpx.Client
?
We could skip it and call httpx.get()
for each url instead:
for url in urls:
response = httpx.get(url, headers=headers)
# vs
with httpx.Client(headers=headers) as session:
response = session.get(url)
However, as we've covered earlier HTTP is not a persistent protocol - meaning every time we would call httpx.get()
we would connect with the server and only then exchange our requests/response objects. To optimize this exchange we can establish a session which is usually referred to as Connection Pooling or HTTP persistent connection.
In other words, this session will establish connection only once and continue exchanging our requests until we close it. Using sessions not only optimizes our code, but also provides some convenient shortcuts like setting global settings like headers and managing cookies, redirects automatically.
We've got a good grip on HTTP so now, let's take a look at the second part of web scraping: parsing!
HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about HTML structure is that it's intended to be machine-readable text content, which is great news for web-scraping as we can easily parse the data with code!
HTML is a tree type structure that lends easily to parsing. For example, let's take this simple HTML content:
<head>
<title>My Website</title>
</head>
<body>
<h0>Welcome to my website!</h1>
<div class="content">
<p>This is my website</p>
<p>Isn't it great?</p>
</div>
</body>
Here we see a basic HTML document that a simple website might serve. You can already see the tree like structure just by indentation of the text, but we can even go further and illustrate it:
Example of a HTML node tree. Note that branches are ordered left-to-right and each element can contain extra properties.
This tree structure is brilliant for web-scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under HTML tag <head>
which is under <title>
nodes. In other words - if we wanted to extract 999 titles for 1000 different pages, we would write a rule to find head->title->text
for every one of them.
When it comes to HTML parsing using path instructions, there are two standard ways to approach this: CSS selectors and XPATH selectors - let's take a look at them.
There are two HTML parsing standards:
Generally, modern websites can be parsed with CSS selectors alone, however sometimes HTML structure can so complex that having that extra XPATH power makes things much easier. We'll be mixing both - we'll stick CSS where we can otherwise fallback to XPATH.
For more on CSS selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms
For more on XPATH selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms
Since Python has no built-in HTML parser, we must choose a library which provides such capability. In Python, there are several options, but the two biggest libraries are beautifulsoup (beautifulsoup4) and parsel.
We'll be using parsel
HTML parsing package in this chapter, but since CSS and XPATH selectors are de facto standard ways of parsing HTML we can easily apply the same knowledge to beautifulsoup library as well as other HTML parsing libraries in other programming languages.
Let's install the python library parsel
and do a quick introduction:
$ pip install parsel
$ pip show parsel
Name: parsel
Version: 0.6.0
...
For more on parsel
see official documentation
Now with our package installed, let's give it a spin with this imaginary HTML content:
# for this example we're using a simple website page
HTML = """
<head>
<title>My Website</title>
</head>
<body>
<div class="content">
<h0>First blog post</h1>
<p>Just started this blog!</p>
<a href="http://github.com/scrapfly">Checkout My Github</a>
<a href="http://twitter.com/scrapfly_dev">Checkout My Twitter</a>
</div>
</body>
"""
from parsel import Selector
# first we must build parsable tree object from HTML text string
tree = Selector(HTML)
# once we have tree object we can start executing our selectors
# we can use ss selectors:
github_link = tree.css('.content a::attr(href)').get()
# we can also use xpath selectors:
twitter_link = tree.xpath('//a[contains(@href,"twitter.com")]/@href').get()
title = tree.css('title').get()
github_link = tree.css('.content a::attr(href)').get()
article_text = ''.join(tree.css('.content ::text').getall()).strip()
print(title)
print(github_link)
print(twitter_link)
print(article_text)
# will print:
# <title>My Website</title>
# http://github.com/scrapfly
# http://twitter.com/scrapfly_dev
# First blog post
# Just started this blog!
# Checkout My Github
In this example we used parsel package to create a parse tree from existing HTML text. Then, we used CSS and XPATH selector functions of this parse tree to extract title, Github link, Twitter link and the article's text.
In the last article we've covered how to download HTML documents using httpx
client and in this article we've figured how to use CSS and XPATH selectors to parse HTML data using parsel
package. Now let's put all of this together and write a small scraper!
In this section we'll be scraping https://www.producthunt.com/ which is essentially a technical product directory where people submit and discuss new digital products.
Let's start with the scraper's source code:
import httpx
import json
from parsel import Selector
DEFAULT_HEADERS = {
# lets use Chrome browser on Windows:
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
}
def parse_product(response):
tree = Selector(response.text)
return {
"url": str(response.url),
'name': tree.css('h0 ::text').get(),
'subtitle': tree.css('h1 ::text').get(),
# votes are located under <span> which contains bigButtonCount in class names
'votes': tree.css("span[class*='bigButtonCount']::text").get(),
# tags is our most complex location
# tag links are under div which contains topicPriceWrap class
# and tag links are only valid if they have /topic/ in them
'tags': tree.xpath(
"//div[contains(@class,'topicPriceWrap')]"
"//a[contains(@href, '/topics/')]/text()"
).getall(),
}
def scrape_products(urls):
results = []
with httpx.Client(headers=DEFAULT_HEADERS) as session:
for url in urls:
response = session.get(url
results.append(parse_product(response))
return results
if __name__ == '__main__':
results = scrape_products([
"https://www.producthunt.com/posts/notion-8",
"https://www.producthunt.com/posts/obsidian-4",
"https://www.producthunt.com/posts/evernote",
])
print(json.dumps(results, indent=2))
In this little scraper we provide a list of producthunt.com product urls and have our scraper collect and parse basic product data from each one of them.
[
{
"url": "https://www.producthunt.com/posts/notion-8",
"name": "Notion",
"subtitle": "Artificial intelligence-powered email.",
"tags": [
"Android",
"iPhone",
"Email"
],
"votes": "0,650"
},
{
"url": "https://www.producthunt.com/posts/obsidian-4",
"name": "Obsidian",
"subtitle": "A powerful knowledge base that works on local Markdown files",
"tags": [
"Productivity",
"Note"
],
"votes": "0,706"
},
{
"url": "https://www.producthunt.com/posts/evernote",
"name": "Evernote",
"subtitle": "Note taking made easy",
"tags": [
"Android",
"iPhone",
"iPad"
],
"votes": "299"
}
]
Thanks to the rich python's ecosystem, we've accomplished this single page scraper in under 39 lines of code - awesome!
Further, let's modify our script, so it finds product urls by itself by scraping producthunt.com topics. For example, /topics/productivity contains a list of products that are intended to boost digital productivity:
from urllib.parse import urljoin
def parse_topic(response):
tree = Selector(text=response.text)
# get product relative urls:
urls = tree.xpath("//li[contains(@class,'styles_item')]/a/@href").getall()
# turn relative urls to absolute urls and return them
return [urljoin(response.url, url) for url in urls]
def scrape_topic(topic):
with httpx.Client(headers=DEFAULT_HEADERS) as session:
response = session.get(f"https://www.producthunt.com/topics/{topic}")
return parse_topic(response)
if __name__ == '__main__':
urls = scrape_topic("productivity")
results = scrape_products(urls)
print(json.dumps(results, indent=2))
Now we have a full scraping loop: we retrieve product urls from a directory page and then scrape each of them individually!
Wou could further improve this scraper with paging support as now we're only scraping the first page of topics, implement error and failure handling as well as some tests. That being said this is a good entry point into web scraping world as we tried out many thing we've covered in this article like header faking, using client sessions and parsing xpath/css with css/xpath selectors.
When it comes to web scraping challenges we can put in them in to few distinct categories:
In this article we used HTTP clients to retrieve data, however our python environment is not a web browser and it can't execute complex javascript powered behavior some websites use. Most common example of this is dynamic data loading when page URL doesn't change but clicking a button changes some data on the page. To scrape this we either need to reverse engineer the website's javascript behavior or use web browser automation with headless browsers.
For browser usage in web scraping see our full introduction article which covers the most popular tools Selenium, Puppeteer and Playwright
Unfortunately, not every website tolerates web scraping and often blocks them. To avoid this, we need to ensure that our web scraper looks and behaves like a web browser user. We've taken a look at using web browser headers to accomplish this but there's much more to it.
Even though HTML content is machine parsable many website developers don't create it with this intention. So, we might encounter HTML files that are really difficult to digest. XPATH and CSS selectors are really powerful and combined with regular expression or natural language parsing we can confidently extract any data an HTML could present. If you're stuck with parsing we highly recommend #xpath and #css-selectors tags on stackoverflow.
There's a lot of data online and while scraping few documents is easy, scaling that to thousands and millions of http requests and documents can quickly introduce a lot of challenges ranging from web scraper blocking to handling multiple concurrent connections.
For bigger scrappers we highly recommend taking advantage of Python's asynchronous ecosystem. Since HTTP connections involve a lot of waiting async programming allows us to schedule and handle multiple connections concurrently. For example in httpx we can manage both synchronous and asynchronous connections:
import httpx
import asyncio
from time import time
urls_20 = [f"http://httpbin.org/links/20/{i}" for i in range(20)]
def scrape_sync():
_start = time()
with httpx.Client() as session:
for url in urls_20:
session.get(url)
return time() - _start
async def scrape_async():
_start = time()
async with httpx.AsyncClient() as session:
await asyncio.gather(*[session.get(url) for url in urls_20])
return time() - _start
if __name__ == "__main__":
print(f"sync code finished in: {scrape_sync():.2f} seconds")
print(f"async code finished in: {asyncio.run(scrape_async()):.2f} seconds")
Here, we have two functions that scrape 20 urls. One synchronous and one taking advantage of asyncio's concurrency. If we run them we can see a drastic speed difference:
sync code finished in: 7.58 seconds
async code finished in: 0.89 seconds
Fortunately, the web scraping community is pretty big and can often help solve these issues, our favorite resources are:
We at ScrapFly have years of experience with these issues and worked hard to provide one shoe-fit-all solution via our ScrapFly API where many of these challenges are solved automatically!
Here at ScrapFly we recognize the difficulties of web scraping and came up with an API solution that solves these issues for our users.
ScrapFly is essentially an intelligent middleware that sits between your scraper and your target. Your scraper, instead of connecting to your target itself, requests ScrapFly API to do it for it and return the end results.
This abstraction layer can greatly increase performance and reduce the complexity of many web-scrapers by offloading common web scraping issues away from the scraper code!
Let's take a look at how our example scraper would look in ScrapFly SDK. We can install ScrapFly SDK using pip: pip install scrapfly-sdk
and the usage is similar to that of a regular HTTP client library:
from scrapfly import ScrapflyClient, ScrapeConfig
urls = [
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
"http://httbin.org/html",
]
with ScrapflyClient(key='<YOUR KEY>') as client:
for url in urls:
response = client.scrape(
ScrapeConfig(url=url)
)
html = response.scrape_result['content']
As you can see, our code with ScrapFly looks almost the same except we get rid of a lot of complexity such as faking our headers as we did in our httpx based scraper - ScrapFly does all this automatically!
We can even go further and enable a lot of optional features (click to expand for details):
javascript rendering - use ScrapFly's automated browsers to render websites powered by javascript
This can be enabled by the render_js=True
option:
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://quotes.toscrape.com/js/page/2/"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
render_js=True
# ^^^^^^^ enabled
)
)
html = response.scrape_result['content']
smart proxies - use ScrapFly's 190M proxy pool to scrape hard to access websites
All ScrapFly requests go through proxy but we can further extend that by selecting different proxy types and proxy locations:
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://quotes.toscrape.com/js/page/2/"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
# see https://scrapfly.io/dashboard/proxy for available proxy pools
proxy_pool='public_mobile_pool', # use mobile proxies
country='US', # use proxies located in the United States
)
)
html = response.scrape_result['content']
anti scraping protection bypass - scrape anti-scraping service protected websites
This can be enabled by the asp=True
option:
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://quotes.toscrape.com/js/page/2/"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
# enable anti-scraping protection bypass
asp=True
)
)
html = response.scrape_result['content']
In this article we've covered hands-on web scraping, however when scaling to hundreds of thousands of requests reinventing the wheel can be a suboptimal and painful experience. For this it might be worth taking a look into web scraping frameworks like Scrapy which is a convenient abstraction layer around everything we've learned today and more!
For more on scrapy see our full introduction article which covers introduction, best practices, tips and tricks and an example project!
Scrapy implements a lot of shortcuts and optimizations that otherwise would be difficult to implement by hand, such as request concurrency, retry logic and countless community extensions for handling various niche cases.
ScrapFly's python-sdk package implements all the powerful ScrapFly's features into Scrapy's API:
# /spiders/scrapfly.py
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyMiddleware, ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class ScrapFlySpider(ScrapflySpider):
name = 'scrapfly'
start_urls = [
ScrapeConfig(url='https://www.example.com')
]
def parse(self, response: ScrapflyScrapyResponse):
yield ScrapflyScrapyRequest(
scrape_config=ScrapeConfig(
url=response.urljoin(url),
# we can enable javascript rendering via browser automation
render_js=True,
# we can get around anti bot protection
asp=True,
# specific proxy country
country='us',
# change proxy type to mobile proxies
proxy_pool="public_mobile_pool",
),
callback=self.parse_report
)
# settings.py
SCRAPFLY_API_KEY = 'YOUR API KEY'
CONCURRENT_REQUESTS = 2
We've covered a lot in this article but web scraping is such a vast subject that we just can't fit everything into a single article. However, we can answer some frequently asked questions people have about web scraping in Python:
Building a web scraper in Python is quite easy! Unsurprisingly, it's by far the most popular language used in web scraping.
Python is an easy yet powerful language with rich ecosystems in data parsing and HTTP connection areas. Since web scraping scaling is mostly IO based (waiting for connections to complete takes the most of the program's runtime), Python performs exceptionally well as it supports asynchronous code paradigm natively! So, Python for web scraping is fast, accessible and has a huge community.
Currently, the best option for web scraping in our opinion is the httpx library as it supports synchronous and asynchronous python as well as being easy to configure for avoiding web scraper blocking. Alternatively, the requests library is a good choice for beginners as it has the easiest API.
The easiest way to speed up web scraping in python is to use asynchronous HTTP client such as httpx and use asynchronous functions (coroutines) for all HTTP connection related code.
One of the most common challenges when using Python to scrape a website is blocking. This happens because scrapers inherently behave differently compared to a web browser so they can be detected and blocked.
The goal is to ensure that HTTP connections from python web scraper look similar to those of a web browser like Chrome or Firefox. This involves all connection aspects: using http2 instead of http1.1, using same headers as the web browser, treating cookies the same way browser does etc. For more see How to Scrape Without Getting Blocked Tutorial
When we're using HTTP clients like requests, httpx etc. we scrape only the raw page source which often looks different from page source in the browser. This is because the browsers run all the javascript that is present in the page which can change it. Our python scraper has no javascript capabilities, so we either need to reverse engineer javascript code or control a web browser instance. See our for more.
There are a lot of great tools out there, though when it comes to best web scraping tools in Python the most important tool must be the web browser developer tools. This suite of tools can be accessed in majority of web browser (Chrome, Firefox, Safari via F12 key or right click "inspect element").
This toolset is vital for understanding how the website works. It allows us to inspect the HTML tree, test our xpath/css slectors as well as track network activity - all of which are brilliant tools for developing web scrapers.
We recommend getting familiar with these tools by reading official documentation page.
In this python web scraping tutorial we've covered the basics of everything you need to know to start web scraping in Python.
We've introduced ourselves with the HTTP protocol which is the backbone of all internet connections. We explored GET and POST requests, and the importance of request headers.
Then, we've taken a look at HTML parsing: using CSS and XPATH selectors to parse data from raw HTML content.
Finally, we solidified this knowledge with an example project where we scraped products details of producthunt.com.
This web scraping tutorial should start you on the right path, but it's just the tip of the web scraping iceberg! Check out ScrapFly API for dealing with advanced web scraping challenges like scaling and blocking.