Web Scraping With Scrapfly

Scrapfly Python SDK is powerful but intuitive. It supports modern Python features like asyncio and type hints which makes it a breeze to develop production-level scrapers. To start, take a look at our introduction and overview video:

If you're not ready to code yet check out Scrapfly's Visual API Player

Quickstart Python SDK Example

Let's take a look at a classic web scraping example by scraping hackernews. It's a popular discussion forum for tech news and we'll be scraping submission data from the first page of entries.

All SDK examples can be found on SDK's Github repository: github.com/scrapfly/python-scrapfly/tree/master/examples

To start, we'll be using Python 3.7 and scrapfly-sdk which is available on pypi.org Python package index. It can be installed using pip terminal command:

Optional Extras

Scrapfly SDK also comes with some optional extras:

  • Concurrency module: pip install 'scrapfly-sdk[concurrency]'
  • Scrapy module: pip install 'scrapfly-sdk[scrapy]'
  • Performance module: pip install 'scrapfly-sdk[speedups]' (brotli compression and msgpack serialization)
  • All modules: pip install 'scrapfly-sdk[all]'
  • Parsing integration: pip install parsel

First Scrape

We will scrape Hacker News to retrieve all article submissions of the day. To extract data from HTML content, we will use parsel which automatically integrates which scrapfly-sdk.


Above, we first requested Scrapfly API to scrape the page for us. Then, we used the selector attribute to access the HTML tree of the result. From there on, we used a combination of CSS and XPath selectors to extract details from the HTML elements.

This scraper should produce 30 Article results:

Logging

Scrapfly logs data through Python's default logging library and can be controlled by retrieving the scrapfly logger:

Request Customization

All SDK requests are being configured through ScrapeConfig object attributes. Most attributes mirror API parameters. For more see More information request customization

Here's a quick demo example:

All scraping data is configured through the ScrapeConfig object and here's the example output from our demo above:

Proxy Pool Selection

SDK has access to all Scrapfly proxy pools and can be controlled through country and proxy_pool parameters:

Proxy Geo Targeting

The country parameter indicates the origin country of selected proxy. This can be used to scrape geo-graphically locked content that's only available in certain countries:

Expected ouput:

Session

Persisten scraping sessions are also available in Python SDK through the ScrapeConfig.session parameter:

The above should produce results similar to:

Download File / Image

To download binary data such as images, PDFs or other files send requests as usual and the result data can be saved to a file using the helper ScrapflyClient.sink() method:

Let's take a look at an example of how to download a Jeppesen chart of VQPR airport in PDF format.

This is the expected log output which shows file has been stored locally:

Cache

Cache is especially useful when working with scrapfly SDK. It's a great tool for the development and testing of scrapers as it significantly speeds up the scraping process.

The cache is controlled to cache and cache_ttl(expiration) parameters:

Expected output:

Screenshot

Screenshots feature is available through screenshot parameter which takes a dictionary of screenshot selections where the key is screenshot name and value area to capture. The area can be a selector or fullpage to capture everything.

Alternatively, a quick shortcut .screenshot() method can be used to capture the whole page:

Anti Scraping Protection (ASP)

Anti Scraping Protection bypass can be enabled through the asp parameter which will automatically configure requests and bypass most anti-scraping protection systems:

Reporter

Scrapfly SDK comes out of the box with a configurable reporter. It allows to collection of most scrapfly events like status codes, errors etc. through configurable callbacks. This feature is crucial for long-term web scraping project support and we highly recommend taking a look at it:

Scrapfly includes some pre-built reporters:

  • PrintReporter - prints to stdout
  • ChainReporter - base for chaining multiple reporting callbacks
  • NoopReporter - empty reporter (default)
  • SentryReporter - Sentry Integration, when installed with python sentry sdk and configured the SDK will capture exceptions with enriched context data to your Sentry instance.

Prebuilt imports:

Summary