Python SDK
Python SDK gives you a handy abstraction to interact with Scrapfly API. It includes all of scrapfly features and many convenient shortcuts:
- Automatic base64 encode of JS snippet
- Error Handling
- Body json encode if
Content-Type: application/json
- Body URL encode and set
Content Type: application/x-www-form-urlencoded
if no content type specified - Convert Binary response into a python
ByteIO
object
Step by Step Introduction
For a hands-on introduction see our Scrapfly SDK introduction page!
Discover NowThe Full python API specification is available here: https://scrapfly.github.io/python-scrapfly/docs/scrapfly
For more on Python SDK use with Scrapfly, select "Python SDK" option in Scrapfly docs top bar.
Installation
Source code of Python SDK is available on Github scrapfly-sdk package is available through PyPi.
You can also install extra package scrapfly[speedups]
to get
brotli compression and msgpack serialization benefits.
Scrape
If you plan to scrape protected website - make sure to enable Anti Scraping Protection
Discover python full specification:
- Client : https://scrapfly.github.io/python-scrapfly/scrapfly/client.html
- ScrapeConfig : https://scrapfly.github.io/python-scrapfly/scrapfly/scrape_config.html
- API response : https://scrapfly.github.io/python-scrapfly/scrapfly/api_response.html
Using Context
How to configure Scrape Query
You can check the ScrapeConfig
implementation to check all available options
available here.
All parameters listed in this documentation can be used when you construct the scrape config object.
Download Binary Response
Error Handling
Error handling is a big part of scraper, so we design a system to reflect what happened when it's going bad to handle it properly from Scraper. Here a simple snippet to handle errors on your owns
Errors with related code and explanation are documented and available here, if you want to know more.
- scrapfly.UpstreamHttpClientError Upstream website that you scrape response with http code >= 300 < 400
- scrapfly.UpstreamHttpServerError Upstream website that you scrape response with http code >= 500 < 600
- scrapfly.ApiHttpClientError Scrapfly API respond with >= 300 < 400
- scrapfly.ApiHttpServerError Scrapfly API respond with >= 500 < 600
- scrapfly.ScrapflyProxyError Error related to Proxy
- scrapfly.ScrapflyThrottleError Error related to Throttle
- scrapfly.ScrapflyAspError Error related to ASP
- scrapfly.ScrapflyScheduleError Error related to Schedule
- scrapfly.ScrapflyWebhookError Error related to Webhook
- scrapfly.ScrapflySessionError Error related to Session
- scrapfly.TooManyConcurrentRequest Maximum of concurrent request allowed by your plan reached
- scrapfly.QuotaLimitReached Quota Limit of your plan or project reached
By default, if the upstream website that you scrape responds with bad HTTP code, the SDK will raise
UpstreamHttpClientError
or UpstreamHttpServerError
regarding the HTTP status code.
You can disable this behavior by setting the raise_on_upstream_error attribute to false. ScrapeConfig(raise_on_upstream_error=False)
If you want to report to your app for monitoring / tracking purpose on your side, checkout reporter feature.
Account
You can retrieve account information
Keep Alive HTTP Session
Take benefits of Keep-Alive
Connection
Concurrency out of the box
You can run scrape concurrently out of the box. We use asyncio
for that.
In python, there are many ways to achieve concurrency. You can also check:
First of all, ensure you have installed concurrency module
Webhook Server
The Scrapfly Python SDK offers a built-in webhook server feature, allowing developers to easily set up and handle webhooks for receiving notifications and data from Scrapfly services. This documentation provides an overview of the create_server function within the SDK, along with an example of its usage.
Example Usage
In order to expose the local server to internet we use ngrok and you need a free account to run the example.
Below is an example demonstrating how to use the create_server function to set up a webhook server:
- Install dependencies:
pip install ngrok flask scrapfly
- Export your ngrok auth token in your terminal:
export NGROK_AUTHTOKEN=MY_NGROK_TOKEN
- Create a webhook on your Scrapfly dashboard with any endpoint (For example from https://webhook.site). Since Ngrok endpoint is only known at runtime only and random on each run, we will edit the endpoint once ngrok advertised it in the next step.
- Retrieve your webhook signing secret
- Run the command
python webhook_server.py --signing-secret=MY_SIGNING_SECRET
- Once the server is running, copy the exposed url advertised below the log line
"====== LISTENING ON ======"
- Edit your webhook url and replace it by the advertised url
With ngrok free plan, on each start of the server, a new random tunnel url is assigned, you need edit the webhook
In this example, the webhook server is set up using create_server, with a callback function webhook_callback defined to handle incoming webhook payloads. The signing secret is provided as a command-line argument, and ngrok is used for exposing the local server to the internet for testing.
External Integration
LlamaIndex
LlamaIndex, formerly known as GPT Index, is a data framework designed to facilitate the connection between large language models (LLMs) and a wide variety of data sources. It provides tools to effectively ingest, index, and query data within these models.
Integrate Scrapfly with LlamaIndexLangchain
LangChain is a robust framework designed for developing applications powered by language models. It focuses on enabling the creation of applications that can leverage the capabilities of large language models (LLMs) for a variety of use cases.
Integrate Scrapfly with Langchain