What is HTTP 401 Error and How to Fix it
Discover the HTTP 401 error meaning, its causes, and solutions in this comprehensive guide. Learn how 401 unauthorized errors occur.
Asynchronous web scraping is a programming technique that allows for running multiple scrape tasks in effective parallel.
Asynchronous programming is especially important in web scraping as web scraping programs have a lot of waiting time. In other words, every time a web scraper requests a web page, it has to wait for the response. This waiting time can be relatively long, especially when scraping large amounts of web pages.
For example, let's take a look at this synchronous scraping example in Python:
import httpx
from time import time
_start = time()
pages = [
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
]
for page in pages:
httpx.get(page)
print(f"finished scraping {len(pages)} pages in {time() - _start:.2f} seconds")
"finished scraping 5 pages in 15.46 seconds"
Here we have a list of 5 web pages that load in 2 seconds each. If we run this code, we'll see that it completes in ~15 seconds every time.
This is because our code waits for each page to fully complete scraping before moving on even if the program itself does nothing but wait for the server to respond.
In contrast, asynchronous web scraping allows for running multiple scrape tasks in effective parallel:
import httpx
import asyncio
from time import time
async def run():
_start = time()
async with httpx.AsyncClient() as client:
pages = [
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
"https://httpbin.dev/delay/2",
]
# run all requests concurrently using asyncio.gather
await asyncio.gather(*[client.get(page) for page in pages])
print(f"finished scraping {len(pages)} pages in {time() - _start:.2f} seconds")
asyncio.run(run())
"finished scraping 5 pages in 2.93 seconds"
This Python example uses httpx.AsyncClient
and asyncio
to eliminate the waiting time by running all requests in parallel. As a result, the code completes in 2-3 seconds every time.
Asynchronous programming is an ideal fit for web scraping and one of the easiest ways to speed up web scraping. For more see:
This knowledgebase is provided by Scrapfly data APIs, check us out! 👇