Web Scraping With Scrapfly
If you dont want to start coding right now and discover the Scrapfly API without going straight to your text editor, you can use our Visual API Player which is a playground to interact with our API.
You can use our API with any language, you can check our Getting Started to start use the HTTP API.
Quickstart with our Python SDK
We will scrape the well-known website hackernews \ and extract articles with metadata info. We will use Scrapfly Python SDK. All the following code/example are available on GitHub: https://github.com/scrapfly/python-scrapfly/tree/master/examples
Code source of Python SDK is available on Github. scrapfly-sdk package is available through pypi
pip install 'scrapfly-sdk'
You can install extra package
- Concurrency module:
pip install 'scrapfly-sdk[concurrency]'
- Scrapy module:
pip install 'scrapfly-sdk[scrapy]'
- Performance module:
pip install 'scrapfly-sdk[speedups]'
(brotli compression and msgpack serialization) - All modules:
pip install 'scrapfly-sdk[all]'
pip install 'scrapfly-sdk[all]'
First Scrape
We will scrape Hacker news to retrieve all articles of the day. To extract data from HTML content, we will use beautifulsoup
.
pip install beautifulsoup4
import re
from contextlib import suppress
from dataclasses import dataclass
from typing import Optional
from pprint import pprint
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from bs4 import BeautifulSoup
scrapfly = ScrapflyClient(key='')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://news.ycombinator.com/'))
soup = BeautifulSoup(api_response.scrape_result['content'], "html.parser")
@dataclass
class Article:
title:Optional[str]=None
rank:Optional[int]=None
link:Optional[str]=None
user:Optional[str]=None
score:Optional[int]=None
comments:Optional[int]=None
def is_valid(self) -> bool:
if self.title is None or self.link is None:
return False
return True
articles = []
for item in soup.find("table", {"class": "itemlist"}).find_all("tr", {"class": "athing"}):
article = Article()
article.rank = int(item.find("span", {"class": "rank"}).get_text().replace('.', ''))
article.link = item.find("a", {"class": "storylink"})['href']
article.title = item.find("a", {"class": "storylink"}).get_text()
metadata = item.next_sibling()[1]
score = metadata.find("span", {"class": "score"})
if score is not None:
with suppress(IndexError):
article.score = int(re.findall(r"\d+", score.get_text())[0])
user = metadata.find("a", {"class": {"hnuser"}})
if user is not None:
article.user = user.get_text()
with suppress(IndexError):
article.comments = int(re.findall(r"(\d+)\scomment?", metadata.get_text())[0])
if article.is_valid() is True:
articles.append(article)
pprint(articles)
If you got this error ModuleNotFoundError: No module named 'dataclasses'
, because we use python dataclasses in this exemple
which require python 3.7 minimum. Replace dataclasses with a regular class if you want python <=3.6
This is what you should get after running the script above:
❯ python hackernews.py
INFO | scrapfly.client:scrape:121 - --> GET Scrapping https://news.ycombinator.com/
INFO | scrapfly.client:scrape:127 - <-- [200 OK] https://news.ycombinator.com/ | 0.97s
[Article(title='Apple unveils M1, its first system-on-a-chip for portable Mac computers', rank=1, link='https://9to5mac.com/2020/11/10/apple-unveils-m1-its-first-system-on-a-chip-for-portable-mac-computers/', user='runesoerensen', score=1035, comments=1102),
Article(title='How to get root on Ubuntu 20.04 by pretending nobody’s /home', rank=2, link='https://securitylab.github.com/research/Ubuntu-gdm3-accountsservice-LPE', user='generalizations', score=339, comments=88),
Article(title="On Apple's Piss-Poor Documentation", rank=3, link='https://www.caseyliss.com/2020/11/10/on-apples-pisspoor-documentation', user='ingve', score=994, comments=298),
Article(title='The election of the doge', rank=4, link='https://generalist.academy/2020/11/06/the-election-of-the-doge/', user='flannery', score=31, comments=12),
Article(title='Introducing the next generation of Mac', rank=5, link='https://www.apple.com/newsroom/2020/11/introducing-the-next-generation-of-mac/', user='redm', score=515, comments=757),
Article(title='.NET 5.0', rank=6, link='https://devblogs.microsoft.com/dotnet/announcing-net-5-0/', user='benaadams', score=652, comments=345),
Article(title='Using Rust to Scale Elixir for 11M Concurrent Users', rank=7, link='https://blog.discord.com/using-rust-to-scale-elixir-for-11-million-concurrent-users-c6f19fc029d3', user='aloukissas', score=123, comments=16),
Article(title='The Lonely Work of Moderating Hacker News (2019)', rank=8, link='https://www.newyorker.com/news/letter-from-silicon-valley/the-lonely-work-of-moderating-hacker-news', user='bluu00', score=486, comments=213),
Article(title='Origins of the youtube-dl project', rank=9, link='https://rg3.name/202011071352.html', user='rg3', score=426, comments=39),
Article(title='The intriguing maps that reveal alternate histories', rank=10, link='https://www.bbc.com/future/article/20201104-the-intriguing-maps-that-reveal-alternate-histories', user='chwolfe', score=14, comments=None),
Article(title='Getting a biometric security key right', rank=11, link='https://www.yubico.com/blog/getting-a-biometric-security-key-right/', user='tiziano88', score=90, comments=30),
Article(title='Graduate Level Math Notes (Harvard)', rank=12, link='https://github.com/Dongryul-Kim/harvard_notes', user='E-Reverance', score=14, comments=1),
Article(title='Eleven Years of Go', rank=13, link='https://blog.golang.org/11years', user='mfrw', score=300, comments=119),
Article(title='Burnout can exacerbate work stress, further promoting a vicious circle', rank=14, link='https://www.uni-mainz.de/presse/aktuell/12451_ENG_HTML.php', user='rustoo', score=240, comments=158),
Article(title='Firejail – Sandbox Linux Applications', rank=15, link='https://github.com/netblue30/firejail', user='thushanfernando', score=111, comments=33),
Article(title='Zoom lied to users about end-to-end encryption for years, FTC says', rank=16, link='https://arstechnica.com/tech-policy/2020/11/zoom-lied-to-users-about-end-to-end-encryption-for-years-ftc-says/', user='eddieoz', score=1402, comments=353),
Article(title='Artificial intelligence may be making people buy things', rank=17, link='https://www.bbc.co.uk/news/technology-54522442', user='edward', score=24, comments=5),
Article(title='Vigil: The eternal morally vigilant programming language', rank=18, link='https://github.com/munificent/vigil', user='headalgorithm', score=113, comments=19),
Article(title='My experience as a poll worker in Pennsylvania', rank=19, link='https://portal.drewdevault.com/2020/11/10/2020-Election-worker.gmi', user='catacombs', score=301, comments=256),
Article(title='Puppetaria: Accessibility-First Puppeteer Scripts', rank=20, link='https://developers.google.com/web/updates/2020/11/puppetaria', user='feross', score=7, comments=None),
Article(title='InfluxDB is betting on Rust and Apache Arrow for next-gen data store', rank=21, link='https://www.influxdata.com/blog/announcing-influxdb-iox/', user='mhall119', score=162, comments=69),
Article(title='Substack (YC W18) is hiring to build a better business model for writing', rank=22, link='https://substack.com/jobs', user=None, score=None, comments=None),
Article(title='One pollster’s explanation for why the polls got it wrong', rank=23, link='https://www.vox.com/policy-and-politics/2020/11/10/21551766/election-polls-results-wrong-david-shor', user='satchet', score=135, comments=379),
Article(title='Body found in Canada identified as neo-nazi spam king', rank=24, link='https://krebsonsecurity.com/2020/11/body-found-in-canada-identified-as-neo-nazi-spam-king/', user='parsecs', score=99, comments=51),
Article(title='Times Change: Inside the New York Times’ Heated Reckoning With Itself', rank=25, link='https://nymag.com/intelligencer/2020/11/inside-the-new-york-times-heated-reckoning-with-itself.html', user='magda_wang', score=72, comments=94),
Article(title='eBPF – The Future of Networking and Security', rank=26, link='https://cilium.io/blog/2020/11/10/ebpf-future-of-networking/', user='genbit', score=158, comments=27),
Article(title='Separating User Database and Authorization from Apps with Istio and FusionAuth', rank=27, link='https://reachablegames.com/oidc-fusionauth-istio/', user='mooreds', score=24, comments=1),
Article(title="Oh, the irony: iFixit tools in Apple's lab", rank=28, link='https://twitter.com/iFixit/status/1326264991192764416', user='miles', score=8, comments=1),
Article(title='A Cyrillic orthography for the Polish language', rank=29, link='http://steen.free.fr/cyrpol/index.html', user='keiferski', score=42, comments=20),
Article(title='Indistinguishability Obfuscation from Well-Founded Assumptions', rank=30, link='https://www.quantamagazine.org/computer-scientists-achieve-crown-jewel-of-cryptography-20201110/', user='theafh', score=161, comments=82)]
Logging
Customize logging configuration
import logging as logger
from sys import stdout
scrapfly_logger = logger.getLogger('scrapfly')
scrapfly_logger.setLevel(logger.DEBUG)
logger.StreamHandler(stdout)
Request Customization
More information request customization
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
print('===== HEAD request =====')
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
url='https://httpbin.dev/anything',
method='HEAD'
))
# HEAD do not have body, so API response is truncate to strict minimum (headers, status, reason of upstream
print(api_response.result)
print('======== Default Body Url Encode ========')
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
url='https://httpbin.dev/anything',
method='POST',
data={'hello': 'world'},
headers={'X-Scrapfly': 'Yes'}
))
print(api_response.scrape_result['content'])
print('======== Json content-type ======== ')
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
url='https://httpbin.dev/anything',
method='POST',
data={'hello': 'world'},
headers={'X-Scrapfly': 'Yes', 'content-type': 'application/json'}
))
print(api_response.scrape_result['content'])
This is what you should get after running the script above:
❯ python custom.py
===== HEAD request =====
INFO | scrapfly.client:scrape:129 - --> HEAD Scrapping https://httpbin.dev/anything
INFO | scrapfly.client:scrape:137 - <-- [200 OK] https://api.scrapfly.io/scrape?key=dfc249f758da45b2be8a182a3fc454d9&url=https%3A%2F%2Fhttpbin.dev%2Fanything&country=DE | 0s
{'result': {'request_headers': {}, 'response_headers': {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': '*', 'Access-Control-Allow-Methods': '*', 'Access-Control-Allow-Origin': '*', 'Charset': 'utf-8', 'Content-Encoding': 'br', 'Content-Length': '762', 'Content-Type': 'application/json', 'Date': 'Wed, 11 Nov 2020 12:15:00 GMT', 'Server': 'gunicorn/19.9.0', 'Strict-Transport-Security': 'max-age=360; includeSubDomains; preload', 'Vary': 'Accept-Encoding', 'X-Frame-Options': 'DENY', 'X-Request-Id': 'b9df6c22-c70c-4f0c-9c30-d48f7324eaeb', 'X-Scrapfly-Response-Time': '1036.63', 'X-Xss-Protection': '1; mode=block'}, 'status_code': 200, 'reason': 'OK', 'format': 'text', 'content': ''}, 'context': {}, 'config': {'cookies': {}, 'headers': {}, 'url': 'https://httpbin.dev/anything', 'retry': True, 'method': 'HEAD', 'country': 'DE', 'render_js': False, 'cache': False, 'cache_clear': False, 'asp': False, 'session': None, 'debug': False, 'cache_ttl': None, 'tags': None, 'correlation_id': None, 'body': None, 'data': None, 'graphql': None, 'js': None, 'rendering_wait': None, 'raise_on_upstream_error': True, 'screenshots': None, 'key': None, 'dns': False, 'ssl': False}}
======== Default Body Url Encode ========
INFO | scrapfly.client:scrape:129 - --> POST Scrapping https://httpbin.dev/anything
INFO | scrapfly.client:scrape:144 - <-- [200 OK] https://httpbin.dev/anything | 1.04s
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Content-Length": "0",
"Content-Type": "application/x-www-form-urlencoded",
"Dnt": "1",
"Host": "httpbin.dev",
"Sec-Gpc": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5fabd5c6-70e321736e4b5ce91b3cdc34",
"X-Scrapfly": "Yes"
},
"json": null,
"method": "POST",
"origin": "185.192.36.197",
"url": "https://httpbin.dev/anything"
}
======== Json content-type ========
INFO | scrapfly.client:scrape:129 - --> POST Scrapping https://httpbin.dev/anything
INFO | scrapfly.client:scrape:144 - <-- [200 OK] https://httpbin.dev/anything | 2.41s
{
"args": {},
"data": "{\"hello\": \"world\"}",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Content-Length": "18",
"Content-Type": "application/json",
"Dnt": "1",
"Host": "httpbin.dev",
"Sec-Gpc": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5fabd5c9-3ebc7c3b2306a76c05230200",
"X-Scrapfly": "Yes"
},
"json": {
"hello": "world"
},
"method": "POST",
"origin": "5.180.82.117",
"url": "https://httpbin.dev/anything"
}
Proxy Pool Selection
You can select the proxy proxy pool you want. If you want to learn more about proxy networks you can check our blog post about it.
from pprint import pprint
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
# Using datacenter network
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(url='https://httpbin.dev/anything', proxy_pool='public_datacenter_pool'))
# Using residential network
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(url='https://httpbin.dev/anything', proxy_pool='public_residential_pool'))
Session
More information about session
Using session to have a coherent scrape (same IP, cookies, referer)
from pprint import pprint
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(url='https://httpbin.dev/cookies/set?k1=v1&k2=v2&k3=v3', session='test'))
print("=== Initiated Session ===")
pprint(api_response.context['session'])
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(url='https://httpbin.dev/anything', session='test'))
print("=== Request headers ===")
pprint(api_response.scrape_result['request_headers'])
print("=== Response === ")
print(api_response.scrape_result['content'])
This is what you should get after running the script above:
❯ python session.py
INFO | scrapfly.client:scrape:129 - --> GET Scrapping https://httpbin.dev/cookies/set?k1=v1&k2=v2&k3=v3
INFO | scrapfly.client:scrape:135 - <-- [200 OK] https://httpbin.dev/cookies/set?k1=v1&k2=v2&k3=v3 | 0.56s
=== Initiated Session ===
{'cookie_jar': [{'comment': None,
'domain': 'httpbin.dev',
'expires': None,
'http_only': False,
'max_age': None,
'name': 'k1',
'path': '/',
'secure': False,
'size': 4,
'value': 'v1',
'version': None},
{'comment': None,
'domain': 'httpbin.dev',
'expires': None,
'http_only': False,
'max_age': None,
'name': 'k2',
'path': '/',
'secure': False,
'size': 4,
'value': 'v2',
'version': None},
{'comment': None,
'domain': 'httpbin.dev',
'expires': None,
'http_only': False,
'max_age': None,
'name': 'k3',
'path': '/',
'secure': False,
'size': 4,
'value': 'v3',
'version': None}],
'correlation_id': 'default',
'created_at': datetime.datetime(2020, 11, 11, 8, 55, 1),
'expire_at': datetime.datetime(2020, 11, 18, 8, 59, 19),
'identity': '1d39102a8372e6cf765f747fba879324607bf034',
'last_used_at': datetime.datetime(2020, 11, 11, 8, 59, 19),
'name': 'test',
'referer': 'https://httpbin.dev/cookies'}
INFO | scrapfly.client:scrape:129 - --> GET Scrapping https://httpbin.dev/anything
INFO | scrapfly.client:scrape:135 - <-- [200 OK] https://httpbin.dev/anything | 0.49s
=== Request headers ===
{'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'accept-encoding': 'gzip, deflate', 'accept-language': 'en-US,en;q=0.9', 'sec-gpc': '1', 'cache-control': 'max-age=0', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'referer': 'https://httpbin.dev/cookies', 'cookie': 'k1=v1; k2=v2; k3=v3'}
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Cookie": "k1=v1; k2=v2; k3=v3",
"Dnt": "1",
"Host": "httpbin.dev",
"Referer": "https://httpbin.dev/cookies",
"Sec-Gpc": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5faba7e8-4bc059c944854909180ebb64"
},
"json": null,
"method": "GET",
"origin": "5.181.127.128",
"url": "https://httpbin.dev/anything"
}
Download File / Image
In this example we will download a Jeppesen chart of VQPR airport in pdf format.
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
scrape_config = ScrapeConfig(url='https://vau.aero/navdb/chart/VQPR.pdf')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config)
print(api_response.scrape_result['content']) # BytesIO Object
scrapfly.sink(api_response) # create file
This is what you should get after running the script above:
❯ python download.py
INFO | scrapfly.client:scrape:129 - --> GET Scrapping https://vau.aero/navdb/chart/VQPR.pdf
INFO | scrapfly.client:scrape:135 - <-- [200 OK] https://vau.aero/navdb/chart/VQPR.pdf | 2.18s
<_io.BytesIO object at 0x7f892fdc0b30>
INFO | scrapfly.client:sink:232 - file VQPR.pdf created
Cache
More information about cache
Cache is a powerful feature to avoid calling upstream websites on each call. Just set a TTL, and the cache will expire automatically and will transparently scrape the upstream website. Also, cached results are fast compared to a regular scrape.
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
scrape_config = ScrapeConfig(url='https://news.ycombinator.com/', cache=True, cache_ttl=500)
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config)
print(api_response.context['cache'])
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config)
print(api_response.context['cache'])
This is what you should get after running the script above:
❯ python cache_hackernews.py
INFO | scrapfly.client:scrape:129 - --> GET Scrapping https://news.ycombinator.com/
INFO | scrapfly.client:scrape:135 - <-- [200 OK] https://news.ycombinator.com/ | 1.3s
{'state': 'MISS', 'entry': {'created_at': datetime.datetime(2020, 11, 11, 7, 48, 40), 'last_used_at': datetime.datetime(2020, 11, 11, 7, 48, 40), 'fingerprint': '85b37058ce05b829196617d43940024f8616d863', 'ttl': 500, 'expires_at': datetime.datetime(2020, 11, 11, 7, 57)}}
INFO | scrapfly.client:scrape:129 - --> GET Scrapping https://news.ycombinator.com/
INFO | scrapfly.client:scrape:135 - <-- [200 OK] https://news.ycombinator.com/ | 0.27s
{'state': 'HIT', 'entry': {'created_at': datetime.datetime(2020, 11, 11, 7, 48, 40), 'last_used_at': datetime.datetime(2020, 11, 11, 7, 48, 41), 'fingerprint': '85b37058ce05b829196617d43940024f8616d863', 'ttl': 500, 'expires_at': datetime.datetime(2020, 11, 11, 7, 57)}}
Proxy Geo Targeting
More information request customization
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://api.myip.com',
country="nl"
))
print(api_response.context['proxy'])
print(api_response.scrape_result['content'])
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://api.myip.com',
country="de"
))
print(api_response.context['proxy'])
print(api_response.scrape_result['content'])
This is what you should get after running the script above:
❯ python proxy_geo.py
INFO | scrapfly.client:scrape:129 - --> GET Scrapping https://api.myip.com
INFO | scrapfly.client:scrape:135 - <-- [200 OK] https://api.myip.com | 0.53s
{'zone': 'public_pool', 'country': 'nl', 'ipv4': '193.149.21.166'}
{"ip":"193.149.21.166","country":"Netherlands","cc":"NL"}
INFO | scrapfly.client:scrape:129 - --> GET Scrapping https://api.myip.com
INFO | scrapfly.client:scrape:135 - <-- [200 OK] https://api.myip.com | 0.43s
{'zone': 'public_pool', 'country': 'de', 'ipv4': '45.131.94.159'}
{"ip":"45.131.94.159","country":"Germany","cc":"DE"}
Screenshot
More information about screenshot
Taking a screenshot of Hackernews Website
Simple usage
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
scrapfly.screenshot(url='https://news.ycombinator.com/', name="hackernews.jpg")
Advanced usage
If you need to customize (headers, cookies, etc) via ScrapeConfig
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://news.ycombinator.com/',
render_js=True,
screenshots={
'hackernews': 'fullpage'
}
))
# will save hackernews.jpg
# you can provide path argument to specify the directory
scrapfly.save_screenshot(api_response=api_response, name='hackernews')
Anti Scraping Protection (ASP)
More information about Anti Scraping Protection
By default, everything is made to fake a real user behind the scene. If you got detected or protected are triggered anyway, you can activate ASP to unblock and bypass protection.
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://amazon.com',
render_js=True,
asp=True
))
Reporter
Scrapfly SDK comes out of the box with a configurable reporter - You can report bad status code, errors whatever you want it's a simple configurable callback. We strongly advise you to use it to retrieve and logs information to prepare a support request.
from time import sleep
from typing import Optional
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse, ScrapflyError
from scrapfly.reporter import PrintReporter, ChainReporter
# from scrapfly.reporter.sentry import SentryReporter if sentry-sdk installed
def my_reporter(error:Optional[Exception]=None, scrape_api_response:Optional[ScrapeApiResponse]=None):
if scrape_api_response is not None and scrape_api_response.scrape_result['status_code'] >= 400:
print('whhoops from my custom reporter')
# schedule retry for later, store some logs / metrics, anything you want
if error is not None:
if isinstance(error, ScrapflyError):
# custom action regarding the error code
if error.code in ['ERR::SCRAPE::OPERATION_TIMEOUT', 'ERR::SCRAPE::TOO_MANY_CONCURRENT_REQUEST']:
sleep(30)
elif error.code in ['ERR::THROTTLE::MAX_API_CREDIT_BUDGET_EXCEEDED', 'ERR::THROTTLE::MAX_CONCURRENT_REQUEST_EXCEEDED', 'ERR::THROTTLE::MAX_REQUEST_RATE_EXCEEDED']:
sleep(60)
else:
pass # handle non scrapfly error
scrapfly = ScrapflyClient(
key='',
reporter=ChainReporter(my_reporter, PrintReporter())
)
response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/status/404'))
print(response)
We have created some pre built reporters:
- PrintReporter Simply print to stdout useful data
- ChainReporter Allow to chain multiple callback
- NoopReporter The default one, it does nothing
- SentryReporter Sentry Integration, as soon as the python sentry sdk is installed and configured it capture exceptions with enriched data
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from scrapfly.reporter import PrintReporter
scrapfly = ScrapflyClient(
key='',
reporter=PrintReporter()
)
response = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/status/404'))
Prebuilt imports:
from scrapfly.reporter import PrintReporter, ChainReporter, NoopReporter
from scrapfly.reporter.sentry import SentryReporter