Web Scraping With Cloud Browsers
Introduction cloud browsers and their benefits and a step-by-step setup with self-hosted Selenium-grid cloud browsers.
Speed and efficiency are essential in data programming, especially when handling large-scale data extraction. Two concepts that often come into play when optimizing these processes are concurrency and parallelism. While they might sound similar, they serve distinct purposes and offer different advantages.
In this article, we'll explore the details of parallelism vs concurrency, highlight their key differences, and demonstrate how they can be leveraged in Python and JavaScript to enhance performance across various data programming tasks. Let's get started!
Concurrency refers to the ability of a system to handle multiple tasks simultaneously by managing their execution in overlapping time periods. It's about dealing with a lot of tasks at once, allowing each task to make progress without blocking.
In the context of web scraping, concurrency enables your scraper to initiate multiple HTTP requests simultaneously, handling them as responses come in. This means your scraper doesn't have to wait for one request to finish before starting another, leading to a much faster data extraction process.
Key Takeaway: Concurrency is about efficiently managing multiple tasks by avoiding blocking, It allows tasks to overlap, especially when waiting for I/O, making better use of system resources.
For some taks concurrency can be limiting as we're only eliminating wait not actually multiplying the processes and this is where parellelism comes into play.
Parallelism, on the other hand, involves performing multiple tasks simultaneously, typically by utilizing multiple CPU cores or processors. It's about executing multiple operations at the exact same time, which can significantly speed up computation-heavy tasks.
In web scraping, parallelism can be utilized for tasks like parsing large volumes of data or processing complex datasets. By distributing these tasks across multiple cores, you can achieve faster data processing and reduce the overall scraping time.
Key Takeaway: Parallelism is about executing multiple tasks at the same exact time, leveraging multiple CPU cores to enhance performance.
With a clear understanding of both concurrency and parallelism, it's crucial to highlight the distinctions between these two concepts to apply them effectively in your projects.
Understanding the distinction between concurrency vs parallelism is crucial for optimizing your web scraping strategies:
Understanding these differences allows you to choose the right approach based on the nature of your web scraping tasks.
Understanding the distinction between CPU-bound and IO-bound tasks is essential when deciding whether to use concurrency or parallelism in your scraping or programming workflows:
These tasks rely heavily on processing power and computation, consuming significant CPU resources. Examples include mathematical calculations, data parsing, or image processing. Parallelism is highly effective in these cases because it distributes the workload across multiple CPU cores, speeding up task execution.
These tasks are primarily waiting for input/output operations, such as making HTTP requests, reading from databases, or interacting with files. Concurrency is better suited for IO-bound tasks because it allows the system to handle multiple operations simultaneously by overlapping the waiting times, improving overall efficiency without the need for extra computational power.
By identifying whether your task is CPU-bound or IO-bound, you can apply the correct optimization approach to maximize performance.
Now that we've covered the key differences between concurrency and parallelism, it's time to put these concepts into practice. We'll explore concurrency vs parallelism with examples on how you can implement both concurrency and parallelism in Python and JavaScript to improve your web scraping tasks.
If you want to use python with concurrency in mind, it is primarily achieved using the asyncio
library or threading. Below is an example demonstrating how asyncio
can handle multiple HTTP requests concurrently, showcasing significant speed improvements for IO-bound tasks.
import asyncio
import httpx
import time
# Asynchronous function to fetch the content of a URL
async def fetch(url):
async with httpx.AsyncClient(timeout=10.0) as client:
response = await client.get(url)
return response.text
# Concurrently fetch multiple URLs using asyncio.gather
async def concurrent_fetch(urls):
tasks = [fetch(url) for url in urls]
return await asyncio.gather(*tasks)
# Synchronous version to demonstrate performance difference
def sync_fetch(urls):
results = []
for url in urls:
response = httpx.get(url)
results.append(response.text)
return results
def run_concurrent():
urls = ["http://httpbin.org/delay/2"] * 100 # Use the same delay for simplicity
start_time = time.time()
# Running fetch requests concurrently
asyncio.run(concurrent_fetch(urls))
duration = time.time() - start_time
print(f"Concurrent fetch completed in {duration:.2f} seconds")
def run_sync():
urls = ["http://httpbin.org/delay/2"] * 100 # Use the same delay for simplicity
start_time = time.time()
# Running fetch requests synchronously
sync_fetch(urls)
duration = time.time() - start_time
print(f"Synchronous fetch completed in {duration:.2f} seconds")
if __name__ == "__main__":
print("Running concurrent version:")
# Concurrent fetch completed in 2.05 seconds
run_concurrent()
print("Running synchronous version:")
# Synchronous fetch completed in 200.15 seconds
run_sync()
Using asyncio
for concurrent tasks, especially IO-bound tasks like HTTP requests, results in a massive speedup compared to processing the same requests synchronously.
In Python, parallelism can be achieved using the concurrent.futures
module, specifically the ProcessPoolExecutor
, which allows you to run multiple processes in parallel, each utilizing separate CPU cores. Below is an example that demonstrates how to perform a CPU-intensive task, like squaring numbers, in parallel.
from concurrent.futures import ProcessPoolExecutor
import time
def fibonacci(n):
if n <= 1:
return n
else:
return fibonacci(n - 1) + fibonacci(n - 2)
def run_parallel(numbers):
start_time = time.time()
with ProcessPoolExecutor() as executor:
results = executor.map(fibonacci, numbers)
duration = time.time() - start_time
t = f"Parallel processing completed in {duration:.2f} seconds"
return list(results), t
def run_non_parallel(numbers):
start_time = time.time()
results = [fibonacci(n) for n in numbers]
duration = time.time() - start_time
t = f"Non-parallel processing completed in {duration:.2f} seconds"
return results, t
if __name__ == "__main__":
numbers = [30, 31, 32, 33, 34] # Change these values as needed
parallel_results, parallel_time = run_parallel(numbers)
non_parallel_results, non_parallel_time = run_non_parallel(numbers)
# Compare results
print(f"Parallel results: {parallel_time}")
# Example Output: Parallel processing completed in 1.50 seconds
print(f"Non-parallel results: {non_parallel_time}")
# Example Output: Non-parallel processing completed in 7.50 seconds
In this example, parallel processing speeds up the computation of Fibonacci numbers by distributing the workload across multiple processors, resulting in a substantial performance improvement compared to the sequential approach.
If you want to use javascript with concurrency in mind, it is managed using async/await
and Promises. Below is an example that performs multiple HTTP requests concurrently, similar to the Python example.
const axios = require("axios");
async function fetch(url) {
const response = await axios.get(url);
return response.data;
}
async function concurrentFetch(urls) {
const startTime = Date.now();
const promises = urls.map((url) => fetch(url));
await Promise.all(promises);
const duration = (Date.now() - startTime) / 1000;
console.log(`Concurrent fetch completed in ${duration.toFixed(2)} seconds`);
}
// Generate 100 URLs with varying delays (cycling between 1 to 5 seconds)
const urls = Array.from(
{ length: 100 },
(_, i) => `http://httpbin.org/delay/${(i % 5) + 1}`
);
concurrentFetch(urls);
// Output: Concurrent fetch completed in X.XX seconds (depending on delay and execution)
Similar to the Python example, the concurrent HTTP requests complete in approximately the time of the longest single delay (5 seconds), showcasing effective concurrency in JavaScript.
In JavaScript, you can achieve parallelism using the Worker Threads
module. This allows you to run CPU-intensive tasks in separate threads, taking advantage of multi-core processors. Below is an example that shows how to calculate squares of numbers in parallel.
const {
Worker,
isMainThread,
parentPort,
workerData,
} = require("worker_threads");
function fibonacci(n) {
if (n <= 1) {
return n;
} else {
return fibonacci(n - 1) + fibonacci(n - 2);
}
}
function runNonParallel(numbers) {
const startTime = Date.now();
const results = numbers.map(fibonacci);
const duration = (Date.now() - startTime) / 1000;
const t = `Non-parallel processing completed in ${duration.toFixed(
2
)} seconds`;
return t;
}
function runWorker(workerData) {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, { workerData });
worker.on("message", resolve);
worker.on("error", reject);
worker.on("exit", (code) => {
if (code !== 0)
reject(new Error(`Worker stopped with exit code ${code}`));
});
});
}
if (isMainThread) {
const numbers = [30, 31, 32, 33, 34]; // Change these values as needed
const runParallel = async (numbers) => {
const startTime = Date.now();
const promises = numbers.map((number) => runWorker(number));
await Promise.all(promises);
const duration = (Date.now() - startTime) / 1000;
const t = `Parallel processing completed in ${duration.toFixed(2)} seconds`;
return t;
};
(async () => {
const parallelResults = await runParallel(numbers);
const nonParallelResults = runNonParallel(numbers);
// Compare results
console.log(parallelResults);
console.log(nonParallelResults);
})();
} else {
const result = fibonacci(workerData);
parentPort.postMessage(result);
}
The script utilizes worker_threads
to run the fibonacci
function in parallel for different numbers. This allows for CPU-bound tasks like calculating Fibonacci sequences to run concurrently across multiple worker threads.
Concurrency excels in reducing wait times for I/O-bound tasks, while parallelism dramatically improves execution times for CPU-bound tasks.
You can learn more about improving web scraping speed in our dedicated article:
When comparing concurrency and parallelism, it's essential to consider the nature of the tasks being executed:
Concurrency is generally more efficient for I/O-bound tasks, such as making multiple web requests or reading files. By allowing tasks to overlap, concurrency can significantly reduce wait times. For example, when fetching multiple URLs concurrently, the total time taken aligns closely with the longest individual request duration rather than the sum of all delays. This can lead to a total execution time of approximately 5 seconds when fetching five URLs with delays ranging from 1 to 5 seconds.
For CPU-bound tasks, such as complex calculations or data processing, parallelism often outperforms concurrency. By utilizing multiple CPU cores, parallelism can execute tasks simultaneously, resulting in much faster execution times. For instance, processing a list of numbers to calculate their squares in parallel can reduce total execution time significantly compared to sequential or concurrent approaches, especially with larger datasets.
Scaling your web scraping operations requires robust tools that can handle concurrency and parallelism efficiently. This is where ScrapFly comes into play.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
ScrapFly allows you to run scrapes concurrently, right out of the box. We leverage asyncio
in Python to efficiently manage concurrent requests.
There are several ways to achieve concurrency in Python. You can also use:
Before using concurrency in ScrapFly, ensure you have the concurrency module installed:
pip install 'scrapfly-sdk[concurrency]'
Here’s an example of how you can perform concurrent scraping with ScrapFly:
import asyncio
import logging as logger
from sys import stdout
scrapfly_logger = logger.getLogger('scrapfly')
scrapfly_logger.setLevel(logger.DEBUG)
logger.StreamHandler(stdout)
from scrapfly import ScrapeConfig, ScrapflyClient
# Initialize the ScrapFly client with a concurrency limit of 2
scrapfly = ScrapflyClient(key='your_api_key_here', max_concurrency=2)
async def main():
targets = [
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True)
]
async for result in scrapfly.concurrent_scrape(scrape_configs=targets):
print(result)
asyncio.run(main())
In this example, ScrapFly efficiently handles concurrent scrapes using asyncio
. The max_concurrency
parameter controls how many requests can be executed concurrently, helping optimize resource usage and speed.
To wrap up this guide, let's have a look at some frequently asked questions about concurrency and parallelism:
Concurrency refers to the ability to handle multiple tasks at the same time, often by managing tasks that may be waiting for I/O operations, such as HTTP requests. In contrast, parallelism involves executing multiple tasks simultaneously, typically on separate CPU cores
asyncio
with libraries like httpx
or aiohttp
to send multiple HTTP requests concurrently. In JavaScript, async/await
and Promise.all()
help in handling concurrent web scraping tasks.multiprocessing
or concurrent.futures
allow you to distribute tasks across multiple CPU cores. In Node.js, the worker_threads
module can be used to achieve parallelism for heavy computations.You should use concurrency when your web scraping tasks are primarily I/O-bound, such as making multiple HTTP requests or waiting for responses from servers. This approach allows you to maximize resource utilization by overlapping waiting times. On the other hand, use parallelism when your tasks are CPU-bound, such as complex data processing or parsing, where you can benefit from utilizing multiple CPU cores to perform computations simultaneously.
Understanding the difference between concurrency and parallelism is crucial for optimizing web scraping performance. Concurrency enables efficient handling of multiple I/O-bound tasks, allowing your scrapers to manage simultaneous requests without delays. Meanwhile, parallelism accelerates the processing of CPU-bound tasks by leveraging multiple cores. By effectively applying these concepts, you can significantly enhance the speed and reliability of your web scraping operations.