In this section, we'll walk through the most important web scraping features step-by-step.
After completing this walk through you should be proficient enough to conquer any website
scraping with Scrapfly, so let's dive in!
We'll scrape the page, see some optional parameters and then extract the product details using
CSS selectors.
from scrapfly import ScrapflyClient, ScrapeConfig
# to enable debug logs use the built-in Python loggin:
import logging
logging.getLogger("scrapfly").setLevel(logging.DEBUG)
# Create a ScrapflyClient instance
client = ScrapflyClient(key='')
# Create scrape requests
api_result = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1"))
print(api_result.result["context"]) # metadata
print(api_result.result["config"]) # request data
print(api_result.scrape_result["content"]) # result html content
product = {
"title": api_result.selector.css("h3.product-title::text").get(),
"price": api_result.selector.css(".product-price::text").get(),
"description": api_result.selector.css(".product-description::text").get(),
}
print(product)
{
"title": "Box of Chocolate Candy",
"price": "$9.99 ",
"description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
}
Above, we first requested Scrapfly API to scrape the product page for us.
Then, we used the selector attribute to parse the product details
using CSS Selectors.
{
title: "Box of Chocolate Candy",
price: "$9.99 ",
description: "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy."
}
This example is very easy but what if we need more complex request configurations?
Next, let's take a look at available scraping request options.
Request Customization
All SDK requests are being configured through ScrapeConfig object attributes.
Most attributes mirror API parameters.
For more see More information request customization
Here's a quick demo example:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
api_result = client.scrape(ScrapeConfig(
url="https://httpbin.dev/post",
# select request method can be GET (default), POST, HEAD etc.
method="POST",
# attach request body
data='{"name": "scrapfly python"}',
# attach custom headers
headers={
"Content-Type": "application/json",
"Authorization": "Bearer 123",
},
))
print(api_result.content)
Using ScrapeConfig we can not only configure the outgoing scrape requests
but we can also enable Scrapfly specific features.
Developer Features
There are a few important developer features that can be enabled to make the onboarding process a bit easier.
The debug parameter can be enabled
to produce more details in the web log output and the cache
parameters are great for exploring the APIs while onboarding:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/product/1",
# when debug is set, scrapfly will record and save more details
debug=True,
render_js=True, # with debug and render_js it'll save a screenshot
# cache can be enabled to save bandwidth and speed up the process and save credits
cache=True,
cache_ttl=3600, # cache time to live in seconds
cache_clear=False, # set to True to clear cache at any time.
))
print(api_result.content)
By enabling debug we can see that the monitoring dashboard produces more details and even captures screenshots for reviewing!
The next feature set allows us to super charge our scrapers with web browsers, let's take a look.
Using Web Browsers
Scrapfly can scrape using real web browsers and to enable that the render_js
parameter is used. When enabled instead of using HTTP request Scrapfly will:
Return the rendered page content and browser data like captured background requests and database contents.
This makes Scrapfly scrapers incredibly powerful and customizable! Let's take a look at some examples.
To illustrate this let's take a look at this example page web-scraping.dev/reviews
which requires javascript to load:
To scrape this we can use scrapfly's web browsers and we can approach this in two ways:
Rendering Javascript
The first approach is to simply wait for the page to load and scrape the content:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/product/1",
render_js=True,
# wait for page element to appear
wait_for_selector=".review",
# or wait for a specific time
# rendering_wait=3000, # 3 seconds
))
reviews = []
for review in api_result.selector.css('.review'):
reviews.append({
"date": review.css('span::text').get(),
"text": review.css('p::text').get(),
"stars": len(review.css('svg')),
})
print(reviews)
# prints:
[
{
"date": "2022-07-22",
"text": "Absolutely delicious! The orange flavor is my favorite.",
"stars": 5
},
{
"date": "2022-08-16",
"text": "I bought these as a gift, and they were well received. Will definitely purchase again.",
"stars": 4
},
{
"date": "2022-09-10",
"text": "Nice variety of flavors. The chocolate is rich and smooth.",
"stars": 5
},
{
"date": "2022-10-02",
"text": "The cherry flavor is amazing. Will be buying more.",
"stars": 5
},
{
"date": "2022-11-05",
"text": "A bit pricey, but the quality of the chocolate is worth it.",
"stars": 4
}
]
This approach is quite simple as we get exactly what we see in our own web browser making the development process easier.
XHR Capture
The second approach is to capture the background requests that generate this data on load directly:
import json
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/reviews",
render_js=True,
# wait for page element to appear
wait_for_selector=".review",
))
# the browser captures all XHR calls as a list
all_xhr_calls = api_result.scrape_result['browser_data']['xhr_call']
# find the right call by inspecting call['body'] or call['url']
reviews_call = next(call for call in all_xhr_calls if 'GetReviews' in call['body'])
reviews = []
# pull the results from XHR response body which is a JSON string
reviews_response_data = json.loads(reviews_call['response']['body'])
for edge in reviews_response_data['data']['reviews']['edges']:
reviews.append(edge['node'])
print(reviews)
[
{
"rid": "teal-potion-4",
"text": "Unique flavor and great energy boost. It's the perfect gamer's drink!",
"rating": 5,
"date": "2023-05-18"
},
{
"rid": "red-potion-4",
"text": "Good flavor and keeps me energized. The bottle design is really fun.",
"rating": 5,
"date": "2023-05-17"
},
# ... 18 more items
]
The advantage of this approach is that we can capture direct JSON data and we don't need to parse anything!
Though it is a bit more complex and requires some web development knowledge.
Browser Control
Finally, we can fully control the entire browser. For example, we can use
Javascript Scenarios to
enter username and password and click the login button to authenticate on
web-scraping.dev/login:
We'll go to web-scraping.dev/login
Wait for page to load
Enter username to Username input
Enter password to Password input
click login
Wait for page to load
Here's how that would look like visually:
To achieve this using javascript scenarios all we have to do is describe this as a JSON template:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
# define your scenario, see docs for available options https://scrapfly.io/docs/scrape-api/javascript-scenario
scenario = [
{"fill": {"selector": "input[name=username]", "value":"user123"}},
{"fill": {"selector": "input[name=password]", "value":"password"}},
{"click": {"selector": "button[type='submit']"}},
{"wait_for_navigation": {"timeout": 5000}}
]
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/login",
# enable js and pass your JS script as string
render_js=True,
js_scenario=scenario,
))
print(api_result.selector.css("#secret-message ::text").get())
# prints:
"🤫"
Javascript scenarios really simplify the browser automation process though we can take this even further!
Javascript Execution
For more experienced web developers there's a full javascript environment access available through the
js parameter.
For example let's execute some javascript parsing code using querySelector() method:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
js = """
return Array.from(
document.querySelectorAll('.review > p')
).map(
(el) => el.textContent
)
"""
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/reviews",
# enable js and pass your JS script as string
render_js=True,
js=js,
# note: that we can also way for specific elements to load before js execution
wait_for_selector=".review",
))
print(api_result.scrape_result["browser_data"]["javascript_evaluation_result"])
# will print
[
"Unique flavor and great energy boost. It's the perfect gamer's drink!",
"Good flavor and keeps me energized. The bottle design is really fun.",
"Excellent energy drink for gamers. The tropical flavor is refreshing.",
"It’s fun, tastes good, and the energy boost is helpful during intense gaming sessions.",
"The best sneakers I've bought in a long time. Stylish, comfortable and the leather quality is top-notch.",
"The cherry cola flavor is a win. Keeps me energized and focused.",
"These shoes are a hit with my twins. They love the lights and they seem to be quite durable.",
"Excellent boots for outdoor adventures. They offer great support and are very comfortable.",
"Nice design and good quality, but a bit too tight for me. Otherwise, it's a pretty cool beanie.",
"Really enjoyed the citrus flavor and the energy boost it gives.",
"Great concept, but the sizing is a bit off. Order a size larger.",
"Delicious chocolates, and the box is pretty substantial. It'd make a nice gift.",
"The boots are durable, but the laces could be better quality.",
"This potion is a game changer. Love the energy boost and the flavor.",
"Light, comfortable, and nice design. I'm buying another pair.",
"Highly recommend! These shoes are great for evening walks. Kids love them!",
"It's like a health potion for gamers! The energy boost is spot on.",
"Really helps me focus during intense gaming marathons. The teal color is a nice touch.",
"The shoes are nice but they didn't fit me well. I had to exchange for a larger size.",
"Great taste, and the energy kick is awesome. Feels just like a strength potion."
]
Here the browser executed the requested snippet of javascript and returned the results.
With custom request options and cloud browsers you're really in control of every web scraping step!
Next, let's see the feature that allow to access any web page without being blocked through proxies and ASP.
Bypass Blocking
Scraper blocking can be very difficult to understand so Scrapfly provides one setting that simplifies the
scraper blocking bypass. The Anti Scraping Protection (asp)
bypass parameter will automatically configure requests and bypass most anti-scraping protection systems:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/product/1",
# Enable Anti Scraping Protection bypass:
asp=True,
))
While ASP can bypass most anti-scraping protection systems like Cloudflare, Datadome etc.
some blocking techniques are based on geographic location or proxy type.
Proxy Country
All Scrapfly requests go through a Proxy from over millions of IPs available from over 50+ countries.
Some websites, however, are only available in specific region or simply block less connections from specific countries.
For that the country parameter can be used
to define what country proxies are used.
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
api_result = client.scrape(ScrapeConfig(
url="https://tools.scrapfly.io/api/info/ip",
# Set which proxy countries can be used for this request:
country="US,CA",
# you can also block countries with a `-` prefix, for example block mexico:
# country = "-MX"
))
print(api_result.content)
{"country":"us","ip":"1.14.131.41"}
Here we can see what proxy country scrapfly used when we query Scrapfly's IP analysis API tool.
Proxy Type
Further, Scrapfly offers two types of IPs: datacenter and residential. For targets that are harder to reach
residential proxies can perform much better. Setting proxy_pool
parameter to residential pool type we can switch to these stronger proxies:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='')
api_result = client.scrape(ScrapeConfig(
url="https://tools.scrapfly.io/api/info/ip",
# See for available pools: https://scrapfly.io/dashboard/proxy
proxy_pool="public_residential_pool",
))
The Python SDK is asynchronous through Python's built-in asyncio module which means
each API call can be run concurrently for significantly faster scraping.
There are two ways to approach this:
await client.async_scrape() - can be used instead of client.scrape()
method to implement async scrape calls to your async application.
client.concurrent_scrape() - can be used as async generator to batch many scrapes.
See this example implementation:
import asyncio
from scrapfly import ScrapflyClient, ScrapeConfig
async def main():
client = ScrapflyClient(key="")
# create a pool of scrape configurations:
configs = [
ScrapeConfig(url="https://httpbin.dev/html")
for i in range(10)
]
results = []
errors = []
# execute them concurrently using async for generator:
async for result in client.concurrent_scrape(configs, concurrency=5):
if isinstance(result, Exception):
errors.append(result)
else:
results.append(result)
# alternatively - use async_scrape() to create your own concurrency control:
manual_results = asyncio.gather(client.async_scrape(config) for config in configs)
print(f"Results: {results}")
print(f"Errors: {errors}")
asyncio.run(main())
Here we used the asynchronous generator to scrape multiple pages concurrently. We can either set the
concurrency parameter to a desired limit or if omitted your account's
max concurrency limit will be used.
This covers the core functionalities of Scrapfly's Web Scraping API though there are many more features available.
For more see the full API specification
Now that we know how to scrape data using Scrapfly's web scraping API we can start parsing it for information
and for that Scrapfly's Extraction API
is an ideal choice.
Extraction API offers 3 ways to parse data: LLM prompts, Auto AI and custom extraction rules.
All of which are available through the extract() method and ExtractionConfig object
of Python SDK. Let's take a look at some examples.
LLM Prompts
Extraction API allows to prompt any text content
using LLM prompts. The prompts can be used
to summarize content, answer questions about the content or generate structured data like JSON or CSV.
As an example see this freeform prompt use with Python SDK:
from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig
client = ScrapflyClient(key="")
# First retrieve your html or scrape it using web scraping API
html = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1")).content
# Then, extract data using extraction_prompt parameter:
api_result = client.extract(ExtractionConfig(
body=html,
content_type="text/html",
extraction_prompt="extract main product price only",
))
print(api_result.result)
{
"content_type": "text/html",
"data": "9.99",
}
LLMs are great for freeform or creative questions but for extracting known data types like products, reviews etc.
there's a better option - AI Auto Extraction. Let's take a look at that next.
Auto Extraction
Scrapfly's Extraction API also includes a number of predefined models that can be used to
automatically extract common objects
like products, reviews, articles etc. without the need to write custom extraction rules.
The predefined models are available through the extraction_model parameter
of the ExtractionConfig object. For example, let's use the product model:
from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig
client = ScrapflyClient(key="")
# First retrieve your html or scrape it using web scraping API
html = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1")).content
# Then, extract data using extraction_model parameter:
api_result = client.extract(ExtractionConfig(
body=html,
content_type="text/html",
extraction_model="product",
))
print(api_result.result)
Auto Extraction is powerful but can be limited for unique niche scenarios where manual extraction can be more fit.
For that, let's take a look at Extraction Templates next which let you define your own extraction rules through JSON schema.
Extraction Templates
For more specific data extraction Scrapfly Extraction API allows to define custom extraction rules.
This is being done through a JSON schema which defines how data is selected through XPath or CSS selectors
and how data is being processed through pre-defined processors and formatters.
This is a great tool for developers who are already familiar with data parsing in web scraping.
See this example:
from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig
client = ScrapflyClient(key="")
# First retrieve your html or scrape it using web scraping API
html = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/reviews",
render_js=True,
wait_for_selector=".review"
)).content
# Then create your extraction template
template = {
"source": "html",
"selectors": [
{
"name": "date_posted",
# use css selectors
"type": "css",
"query": "[data-testid='review-date']::text",
"multiple": True, # one or multiple?
# post process results with formatters
"formatters": [ {
"name": "datetime",
"args": {"format": "%Y, %b %d — %A"}
} ]
}
]
}
api_result = client.extract(ExtractionConfig(
body=html,
content_type="text/html",
ephemeral_template=template,
))
print(api_result.result)
# will print
{
"data": {
"date_posted": [
"2023, May 18 — Thursday",
"2023, May 17 — Wednesday",
"2023, May 16 — Tuesday",
"2023, May 15 — Monday",
"2023, May 15 — Monday",
"2023, May 12 — Friday",
"2023, May 10 — Wednesday",
"2023, May 01 — Monday",
"2023, May 01 — Monday",
"2023, Apr 25 — Tuesday",
"2023, Apr 25 — Tuesday",
"2023, Apr 18 — Tuesday",
"2023, Apr 12 — Wednesday",
"2023, Apr 11 — Tuesday",
"2023, Apr 10 — Monday",
"2023, Apr 10 — Monday",
"2023, Apr 09 — Sunday",
"2023, Apr 07 — Friday",
"2023, Apr 07 — Friday",
"2023, Apr 05 — Wednesday"
]
},
"content_type": "application/json"
}
For all available selectors, formatters and extractors see
Templates documentation.
Above we define a template that selects review dates using CSS selectors and then re-formats them to a new
date format using datetime formatters.
With this we can now scrape any page and extract any data we need!
To wrap this up let's take a look at another data capture format next - Screenshot API.
Screenshot API
While it's possible to capture screenshots using web scraping API Scrapfly also
includes a dedicated screenshot API
that significantly streamlines the screenshot scraping process.
The Screenshot API can be accessed through the SDK's screenshot() method
and configured through the ScreenshotConfig configuration object.
Here's a basic example:
The screenshot API also inherits many features from web-scraping API like cache
, webhook and cache that are fully functional.
Here all we did is provide an url to capture and the API has returned us a screenshot.
Resolution
Next, we can heavily customize how the screenshot is being captured. For example, we can change
the viewport size from the default 1920x1080 to any other resolution like 540x1200
to simulate mobile views:
Here by setting the capture parameter to fullpage we've captured the entire page.
Though, if page requires scrolling to load more content we can also capture that using another parameter.
Auto Scroll
Just like with the Web Scraping API we can force automatic scroll on the page to load dynamic elements that
load on scrolling. In this example, we're capturing a screenshot of
web-scraping.dev/testimonials which loads new testimonial
entries when the user scrolls the page:
from scrapfly import ScrapflyClient, ScreenshotConfig
client = ScrapflyClient(key="")
# First retrieve your html or scrape it using web scraping API
api_result = client.screenshot(ScreenshotConfig(
url="https://web-scraping.dev/testimonials",
capture="fullpage",
auto_scroll=True, # scroll to the bottom
))
print(api_result.image) # binary image
print(api_result.metadata) # json metadata
Here the auto scrolled to the very bottom and loaded all of the testimonials before screenshot capture.
Next, we can capture only specific areas of the page. Let's take a look how.
Capture Areas
To capture specific areas we can use XPath or CSS selectors to define what to capture.
For this, the capture parameter is used with the selector for an element to capture.
Here using a CSS selector we can restrict our capturing only to areas that are relevant to us.
Finally, for more capture configurations we can use screenshot options, let's take a look at that next
Capture Options
Capture options can apply various page modifications to capture the page in a specific way.
For example, using block_banners option we can block cookies banners and
using the dark_mode we can apply a custom dark theme to the scraped page.
This concludes our onboarding tutorial though Scrapfly has many more features and options available.
For that explore the getting started pages and api specification of each API as each of these features
are available in all of Scrapfly SDKs and packages!