Python SDK
Introduction
Python SDK gives you a handy abstraction to interact with Scrapfly API. Many are automatically handled for you like:
The Full python API specification is available here: https://scrapfly.github.io/python-scrapfly/docs/scrapfly
- Automatic base64 encode of JS snippet
- Error Handling
- Body json encode if
Content-Type: application/json
- Body URL encode and set
Content Type: application/x-www-form-urlencoded
if no content type specified - Convert Binary response into a python
ByteIO
object
Installation
Source code of Python SDK is available on Github scrapfly-sdk package is available through PyPi.
pip install 'scrapfly-sdk'
You can also install extra package scrapfly[speedups]
to get
brotli compression and msgpack serialization benefits.
pip install 'scrapfly-sdk[speedups]'
Scrape
Step by step guide
Follow the step by step guide with practical example covering Scrapfly major features
Discover nowIf you plan to scrape protected website - make sure to enable Anti Scraping Protection
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
# Automatic retry errors marked "retryable" and wait delay recommended before retrying
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
# Automatic retry error based on status code
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/status/500'), retry_on_status_code=[500])
# scrape result, content, iframes, response headers, response cookies states, screenshots, ssl, dns etc
print(api_response.scrape_result)
# html content
print(api_response.scrape_result['content'])
# Context of scrape, session, webhook, asp, cache, debug
print(api_response.context)
# raw api result
print(api_response.content)
# True if the scrape respond with >= 200 < 300 http status
print(api_response.success)
# Api status code /!\ Not the api status code of the scrape!
print(api_response.status_code)
# Upstream website status code
print(api_response.upstream_status_code)
# Convert API Scrape Result into well known requests.Response object
print(api_response.upstream_result_into_response())
Discover python full specification:
- Client : https://scrapfly.github.io/python-scrapfly/scrapfly/client.html
- ScrapeConfig : https://scrapfly.github.io/python-scrapfly/scrapfly/scrape_config.html
- API response : https://scrapfly.github.io/python-scrapfly/scrapfly/api_response.html
Using Context
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
with scrapfly as scraper:
response:ScrapeApiResponse = scraper.scrape(ScrapeConfig(url='https://httpbin.dev/anything', country='fr'))
How to configure Scrape Query
You can check the ScrapeConfig
implementation to check all available options
available here.
All parameters listed in this documentation can be used when you construct the scrape config object.
Download Binary Response
from scrapfly import ScrapflyClient, ScrapeApiResponse
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://www.intel.com/content/www/us/en/ethernet-controllers/82599-10-gbe-controller-datasheet.html'))
scrapfly.sink(api_response) # you can specify path and name via named arguments
Error Handling
Error handling is a big part of scraper, so we design a system to reflect what happened when it's going bad to handle it properly from Scraper. Here a simple snippet to handle errors on your owns
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse, UpstreamHttpClientError, \
ScrapflyScrapeError, UpstreamHttpServerError
scrapfly = ScrapflyClient(key='')
try:
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://httpbin.dev/status/404',
))
except UpstreamHttpClientError as e: # HTTP 400 - 500
print(e.api_response.scrape_result['error'])
raise e
except UpstreamHttpServerError as e: # HTTP >= 500
print(e.api_response.scrape_result['error'])
raise e
# UpstreamHttpError can be used to catch all related error regarding the upstream website
except ScrapflyScrapeError as e:
print(e.message)
print(e.code)
raise e
Errors with related code and explanation are documented and available here, if you want to know more.
- scrapfly.UpstreamHttpClientError Upstream website that you scrape response with http code >= 300 < 400
- scrapfly.UpstreamHttpServerError Upstream website that you scrape response with http code >= 500 < 600
- scrapfly.ApiHttpClientError Scrapfly API respond with >= 300 < 400
- scrapfly.ApiHttpServerError Scrapfly API respond with >= 500 < 600
- scrapfly.ScrapflyProxyError Error related to Proxy
- scrapfly.ScrapflyThrottleError Error related to Throttle
- scrapfly.ScrapflyAspError Error related to ASP
- scrapfly.ScrapflyScheduleError Error related to Schedule
- scrapfly.ScrapflyWebhookError Error related to Webhook
- scrapfly.ScrapflySessionError Error related to Session
- scrapfly.TooManyConcurrentRequest Maximum of concurrent request allowed by your plan reached
- scrapfly.QuotaLimitReached Quota Limit of your plan or project reached
error.message # Message
error.code # Error code of error
error.retry_delay # Recommended time wait before retrying if retryable
error.retry_times # Recommended retry times if retryable
error.resource # Related resource, Proxy, ASP, Webhook, Spider
error.is_retryable # True or False
error.documentation_url # Documentation explaining the error in details
error.api_response # Api Response object
error.http_status_code # Http code
By default, if the upstream website that you scrape responds with bad HTTP code, the SDK will raise
UpstreamHttpClientError
or UpstreamHttpServerError
regarding the HTTP status code.
You can disable this behavior by setting the raise_on_upstream_error attribute to false. ScrapeConfig(raise_on_upstream_error=False)
If you want to report to your app for monitoring / tracking purpose on your side, checkout reporter feature.
Account
You can retrieve account information
from scrapfly import ScrapflyClient
scrapfly = ScrapflyClient(key='')
print(scrapfly.client.account())
Keep Alive HTTP Session
Take benefits of Keep-Alive
Connection
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
with scrapfly as client:
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://news.ycombinator.com/',
render_js=True,
screenshots={
'main': 'fullpage'
}
))
// more scrape calls
Concurrency out of the box
You can run scrape concurrently out of the box. We use asyncio
for that.
In python, there are many ways to achieve concurrency. You can also check:
First of all, ensure you have installed concurrency module
pip install 'scrapfly-sdk[concurrency]'
import asyncio
import logging as logger
from sys import stdout
scrapfly_logger = logger.getLogger('scrapfly')
scrapfly_logger.setLevel(logger.DEBUG)
logger.StreamHandler(stdout)
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='', max_concurrency=2)
async def main():
targets = [
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='http://httpbin.dev/anything', render_js=True)
]
async for result in scrapfly.concurrent_scrape(scrape_configs=targets):
print(result)
asyncio.run(main())