Web Scraping With Scrapfly
Scrapfly Python SDK is powerful but intuitive. It supports modern Python features like asyncio and type hints which makes it a breeze to develop production-level scrapers. To start, take a look at our introduction and overview video:
If you're not ready to code yet check out Scrapfly's Visual API Player
Quickstart Python SDK Example
Let's take a look at a classic web scraping example by scraping hackernews. It's a popular discussion forum for tech news and we'll be scraping submission data from the first page of entries.
All SDK examples can be found on SDK's Github repository: github.com/scrapfly/python-scrapfly/tree/master/examples
To start, we'll be using Python 3.7 and scrapfly-sdk
which is available
on pypi.org Python package index. It can be installed using pip terminal command:
Optional Extras
Scrapfly SDK also comes with some optional extras:
- Concurrency module:
pip install 'scrapfly-sdk[concurrency]'
- Scrapy module:
pip install 'scrapfly-sdk[scrapy]'
- Performance module:
pip install 'scrapfly-sdk[speedups]'
(brotli compression and msgpack serialization) - All modules:
pip install 'scrapfly-sdk[all]'
- Parsing integration:
pip install parsel
First Scrape
We will scrape Hacker News to retrieve all article submissions of the day.
To extract data from HTML content, we will use parsel
which automatically integrates which scrapfly-sdk
.
Above, we first requested Scrapfly API to scrape the page for us.
Then, we used the selector
attribute to access the HTML tree of the result.
From there on, we used a combination of CSS and XPath selectors to extract details from the HTML elements.
This scraper should produce 30 Article results:
Logging
Scrapfly logs data through Python's default logging library and can be controlled by retrieving
the scrapfly
logger:
Request Customization
All SDK requests are being configured through ScrapeConfig
object attributes.
Most attributes mirror API parameters.
For more see More information request customization
Here's a quick demo example:
All scraping data is configured through the ScrapeConfig
object and here's the
example output from our demo above:
Proxy Pool Selection
SDK has access to all Scrapfly proxy pools and can be controlled through country
and
proxy_pool
parameters:
Proxy Geo Targeting
The country
parameter indicates the origin country of selected proxy.
This can be used to scrape geo-graphically locked content that's only available in certain
countries:
Expected ouput:
Session
Persisten scraping sessions
are also available in Python SDK through the ScrapeConfig.session
parameter:
The above should produce results similar to:
Download File / Image
To download binary data such as images, PDFs or other files send requests as usual and the result data
can be saved to a file using the helper ScrapflyClient.sink()
method:
Let's take a look at an example of how to download a Jeppesen chart of VQPR airport in PDF format.
This is the expected log output which shows file has been stored locally:
Cache
Cache is especially useful when working with scrapfly SDK. It's a great tool for the development and testing of scrapers as it significantly speeds up the scraping process.
The cache is controlled to cache
and cache_ttl
(expiration) parameters:
Expected output:
Screenshot
Screenshots feature is available through
screenshot
parameter which takes a dictionary of screenshot selections where
the key is screenshot name and value area to capture. The area can be a selector or fullpage
to capture everything.
Alternatively, a quick shortcut .screenshot()
method can be used to
capture the whole page:
Anti Scraping Protection (ASP)
Anti Scraping Protection bypass can be enabled
through the asp
parameter which will automatically configure requests and bypass
most anti-scraping protection systems:
Reporter
Scrapfly SDK comes out of the box with a configurable reporter. It allows to collection of most scrapfly events like status codes, errors etc. through configurable callbacks. This feature is crucial for long-term web scraping project support and we highly recommend taking a look at it:
Scrapfly includes some pre-built reporters:
- PrintReporter - prints to stdout
- ChainReporter - base for chaining multiple reporting callbacks
- NoopReporter - empty reporter (default)
- SentryReporter - Sentry Integration, when installed with python sentry sdk and configured the SDK will capture exceptions with enriched context data to your Sentry instance.
Prebuilt imports: