Python SDK

Introduction

Python SDK give you a handy abstraction to interact with Scrapfly API. Many are automatically handle for you like:

Full python API specification is available here : https://scrapfly.github.io/python-scrapfly/docs/scrapfly

  • Automatic base64 encode of JS snippet
  • Error Handling
  • Body json encode if Content-Type: application/json
  • Body url encode encode and set Content Type: application/x-www-form-urlencoded if not content type specified
  • Convert Binary response into python ByteIO object

Installation

Source code of Python SDK is available on Github scrapfly-sdk package is available through pypi.

openapi openapi Pip install | Bash
            pip install scrapfly-sdk
        

You can also install extra package scrapfly[speedups] to get benefits of brotli compression and msgpack serialization.

openapi openapi Pip install | Bash
            pip install 'scrapfly-sdk[speedups]'
        

Scrape

Step by step guide

Follow the step by step guide with practical exemple

Start!
openapi openapi intro.py | Python
            from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.org/anything'))

# Automatic retry errors maked "retryable" and wait delay recommended before retrying
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.org/anything'))

# scrape result, content, iframes, response headers, response cookies states, screenshots, ssl, dns etc
print(api_response.scrape_result)

# html content
print(api_response.scrape_result['content'])

# Context of scrape, session, webhook, asp, cache, debug
print(api_response.context)

# raw api result
print(api_response.content)

# True if the scrape respond with >= 200 < 300 http status
print(api_response.success)

# Api status code /!\ Not the api status code of the scrape!
print(api_response.status_code)

# Convert API Scrape Result into well known requests.Response object
print(api_response.upstream_result_into_response())
        

Discover python full specification:

Using Context

openapi openapi context.py | Python
            from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

with scrapfly as scraper:
    response:ScrapeApiResponse = scraper.scrape(ScrapeConfig(url='https://httpbin.org/anything', country='fr'))
        

Download Binary Response

openapi openapi download.py | Python
            from scrapfly import ScrapflyClient, ScrapeApiResponse

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://www.intel.com/content/www/us/en/ethernet-controllers/82599-10-gbe-controller-datasheet.html'))
scrapfly.sink(api_response) # you can specify path and name via named arguments
        

Error Handling

Error handling is a big part for scraper, so we design a system to reflect what happened when it's going bad to handle it properly from Scraper.

Errors with related code and explanation are documented and available here, if you want to know more.

openapi openapi error.py | Python
            error.message              # message
error.code                 # error code of error
error.retry_delay         # recommended time wait before retrying if retryable
error.retry_times         # recommended retry times if retryable
error.resource            # Related resource, Proxy, ASP, Webhook, Spider
error.is_retryable        # True or False
error.documentation_url   # documentation explaining the error in details
error.api_response        # Api Response object
error.http_status_code    # http code
        

By default, if the upstream website that you scrape respond with bad http code, the SDK will raise UpstreamHttpClientError or UpstreamHttpServerError regarding the http status code. You can disable this behavior by setting raise_on_upstream_error attribute to false. ScrapeConfig(raise_on_upstream_error=False)

Account

You can retrieve account information

openapi openapi account.py | Python
            from scrapfly import ScrapflyClient

scrapfly = ScrapflyClient(key='')
print(scrapfly.client.account())
        

Keep Alive HTTP Session

Take benefits of Keep-Alive Connection

openapi openapi context.py | Python
            from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

with scrapfly as client:
    api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
        url='https://news.ycombinator.com/',
        render_js=True,
        screenshots={
            'main': 'fullpage'
        }
    ))

    // more scrape calls
        

Concurrency out of the box

You can run scrape concurrently out of the box. We use asyncio for that.

There are many ways to achieve concurrency in python you can also check :

First of all, ensure you have installed concurrency module

openapi openapi Pip Install | Bash
            pip install 'scrapfly-sdk[concurrency]'
        
openapi openapi concurrency.py | Python
            import asyncio

import logging as logger
from sys import stdout

scrapfly_logger = logger.getLogger('scrapfly')
scrapfly_logger.setLevel(logger.DEBUG)
logger.StreamHandler(stdout)

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='', max_concurrency=2)

async def main():
    results = await scrapfly.concurrent_scrape(scrape_configs=[
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True),
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True),
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True),
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True),
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True),
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True),
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True),
        ScrapeConfig(url='http://httpbin.org/anything', render_js=True)
    ])


print(results)

asyncio.run(main())