Web Scraping With Python

If you don't want to start coding right now, you can explore Scrapfly features through our API player a visual sandbox to call the API and play with it.

We will scrape the well-known website hackernews and extract articles with metadata info. We will use Scraply Python SDK

You can use Scrapfly with the language of your choice. If no SDK is available in your language, our API is following Open API convention; full specifications are available here

All the following code/example are available on GitHub: https://github.com/scrapfly/python-scrapfly/tree/master/examples

Install the SDK

Code source of Python SDK is available on Github. scrapfly-sdk package is available through pypi

pip install 'scrapfly-sdk'

You can install extra package

  • Concurrency module: pip install 'scrapfly-sdk[concurrency]'
  • Scrapy module: pip install 'scrapfly-sdk[scrapy]'
  • Performance module: pip install 'scrapfly-sdk[speedups]' (brotli compression and msgpack serialization)
  • All modules: pip install 'scrapfly-sdk[all]'
pip install 'scrapfly-sdk[all]'

First Scrape

We will scrape Hacker news to retrieve all articles of the day. To extract data from HTML content, we will use beautifulsoup.

pip install beautifulsoup4

import re
from contextlib import suppress
from dataclasses import dataclass
from typing import Optional
from pprint import pprint

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from bs4 import BeautifulSoup

scrapfly = ScrapflyClient(key='')

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://news.ycombinator.com/'))

soup = BeautifulSoup(api_response.scrape_result['content'], "html.parser")

@dataclass
class Article:
    title:Optional[str]=None
    rank:Optional[int]=None
    link:Optional[str]=None
    user:Optional[str]=None
    score:Optional[int]=None
    comments:Optional[int]=None

    def is_valid(self) -> bool:
        if self.title is None or self.link is None:
            return False

        return True

articles = []

for item in soup.find("table", {"class": "itemlist"}).find_all("tr", {"class": "athing"}):
    article = Article()

    article.rank = int(item.find("span", {"class": "rank"}).get_text().replace('.', ''))
    article.link = item.find("a", {"class": "storylink"})['href']
    article.title = item.find("a", {"class": "storylink"}).get_text()

    metadata = item.next_sibling()[1]
    score = metadata.find("span", {"class": "score"})

    if score is not None:
        with suppress(IndexError):
            article.score = int(re.findall(r"\d+", score.get_text())[0])

    user = metadata.find("a", {"class": {"hnuser"}})

    if user is not None:
        article.user = user.get_text()

    with suppress(IndexError):
        article.comments = int(re.findall(r"(\d+)\scomment?", metadata.get_text())[0])

    if article.is_valid() is True:
        articles.append(article)

pprint(articles)
If you got this error ModuleNotFoundError: No module named 'dataclasses', because we use python dataclasses in this exemple which require python 3.7 minimum. Replace dataclasses with a regular class if you want python <=3.6

This is what you should get after running the script above:

❯ python hackernews.py
INFO     | scrapfly.client:scrape:121 - --> GET Scrapping https://news.ycombinator.com/
INFO     | scrapfly.client:scrape:127 - <-- [200 OK] https://news.ycombinator.com/ | 0.97s
[Article(title='Apple unveils M1, its first system-on-a-chip for portable Mac computers', rank=1, link='https://9to5mac.com/2020/11/10/apple-unveils-m1-its-first-system-on-a-chip-for-portable-mac-computers/', user='runesoerensen', score=1035, comments=1102),
 Article(title='How to get root on Ubuntu 20.04 by pretending nobody’s /home', rank=2, link='https://securitylab.github.com/research/Ubuntu-gdm3-accountsservice-LPE', user='generalizations', score=339, comments=88),
 Article(title="On Apple's Piss-Poor Documentation", rank=3, link='https://www.caseyliss.com/2020/11/10/on-apples-pisspoor-documentation', user='ingve', score=994, comments=298),
 Article(title='The election of the doge', rank=4, link='https://generalist.academy/2020/11/06/the-election-of-the-doge/', user='flannery', score=31, comments=12),
 Article(title='Introducing the next generation of Mac', rank=5, link='https://www.apple.com/newsroom/2020/11/introducing-the-next-generation-of-mac/', user='redm', score=515, comments=757),
 Article(title='.NET 5.0', rank=6, link='https://devblogs.microsoft.com/dotnet/announcing-net-5-0/', user='benaadams', score=652, comments=345),
 Article(title='Using Rust to Scale Elixir for 11M Concurrent Users', rank=7, link='https://blog.discord.com/using-rust-to-scale-elixir-for-11-million-concurrent-users-c6f19fc029d3', user='aloukissas', score=123, comments=16),
 Article(title='The Lonely Work of Moderating Hacker News (2019)', rank=8, link='https://www.newyorker.com/news/letter-from-silicon-valley/the-lonely-work-of-moderating-hacker-news', user='bluu00', score=486, comments=213),
 Article(title='Origins of the youtube-dl project', rank=9, link='https://rg3.name/202011071352.html', user='rg3', score=426, comments=39),
 Article(title='The intriguing maps that reveal alternate histories', rank=10, link='https://www.bbc.com/future/article/20201104-the-intriguing-maps-that-reveal-alternate-histories', user='chwolfe', score=14, comments=None),
 Article(title='Getting a biometric security key right', rank=11, link='https://www.yubico.com/blog/getting-a-biometric-security-key-right/', user='tiziano88', score=90, comments=30),
 Article(title='Graduate Level Math Notes (Harvard)', rank=12, link='https://github.com/Dongryul-Kim/harvard_notes', user='E-Reverance', score=14, comments=1),
 Article(title='Eleven Years of Go', rank=13, link='https://blog.golang.org/11years', user='mfrw', score=300, comments=119),
 Article(title='Burnout can exacerbate work stress, further promoting a vicious circle', rank=14, link='https://www.uni-mainz.de/presse/aktuell/12451_ENG_HTML.php', user='rustoo', score=240, comments=158),
 Article(title='Firejail – Sandbox Linux Applications', rank=15, link='https://github.com/netblue30/firejail', user='thushanfernando', score=111, comments=33),
 Article(title='Zoom lied to users about end-to-end encryption for years, FTC says', rank=16, link='https://arstechnica.com/tech-policy/2020/11/zoom-lied-to-users-about-end-to-end-encryption-for-years-ftc-says/', user='eddieoz', score=1402, comments=353),
 Article(title='Artificial intelligence may be making people buy things', rank=17, link='https://www.bbc.co.uk/news/technology-54522442', user='edward', score=24, comments=5),
 Article(title='Vigil: The eternal morally vigilant programming language', rank=18, link='https://github.com/munificent/vigil', user='headalgorithm', score=113, comments=19),
 Article(title='My experience as a poll worker in Pennsylvania', rank=19, link='https://portal.drewdevault.com/2020/11/10/2020-Election-worker.gmi', user='catacombs', score=301, comments=256),
 Article(title='Puppetaria: Accessibility-First Puppeteer Scripts', rank=20, link='https://developers.google.com/web/updates/2020/11/puppetaria', user='feross', score=7, comments=None),
 Article(title='InfluxDB is betting on Rust and Apache Arrow for next-gen data store', rank=21, link='https://www.influxdata.com/blog/announcing-influxdb-iox/', user='mhall119', score=162, comments=69),
 Article(title='Substack (YC W18) is hiring to build a better business model for writing', rank=22, link='https://substack.com/jobs', user=None, score=None, comments=None),
 Article(title='One pollster’s explanation for why the polls got it wrong', rank=23, link='https://www.vox.com/policy-and-politics/2020/11/10/21551766/election-polls-results-wrong-david-shor', user='satchet', score=135, comments=379),
 Article(title='Body found in Canada identified as neo-nazi spam king', rank=24, link='https://krebsonsecurity.com/2020/11/body-found-in-canada-identified-as-neo-nazi-spam-king/', user='parsecs', score=99, comments=51),
 Article(title='Times Change: Inside the New York Times’ Heated Reckoning With Itself', rank=25, link='https://nymag.com/intelligencer/2020/11/inside-the-new-york-times-heated-reckoning-with-itself.html', user='magda_wang', score=72, comments=94),
 Article(title='eBPF – The Future of Networking and Security', rank=26, link='https://cilium.io/blog/2020/11/10/ebpf-future-of-networking/', user='genbit', score=158, comments=27),
 Article(title='Separating User Database and Authorization from Apps with Istio and FusionAuth', rank=27, link='https://reachablegames.com/oidc-fusionauth-istio/', user='mooreds', score=24, comments=1),
 Article(title="Oh, the irony: iFixit tools in Apple's lab", rank=28, link='https://twitter.com/iFixit/status/1326264991192764416', user='miles', score=8, comments=1),
 Article(title='A Cyrillic orthography for the Polish language', rank=29, link='http://steen.free.fr/cyrpol/index.html', user='keiferski', score=42, comments=20),
 Article(title='Indistinguishability Obfuscation from Well-Founded Assumptions', rank=30, link='https://www.quantamagazine.org/computer-scientists-achieve-crown-jewel-of-cryptography-20201110/', user='theafh', score=161, comments=82)]

Logging

Customize logging configuration

import logging as logger
from sys import stdout

scrapfly_logger = logger.getLogger('scrapfly')
scrapfly_logger.setLevel(logger.DEBUG)
logger.StreamHandler(stdout)

Request Customization

More information request customization
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

print('===== HEAD request =====')

api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url='https://httpbin.org/anything',
    method='HEAD'
))

# HEAD do not have body, so API response is truncate to strict minimum (headers, status, reason of upstream
print(api_response.result)

print('======== Default Body Url Encode ========')
api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url='https://httpbin.org/anything',
    method='POST',
    data={'hello': 'world'},
    headers={'X-Scrapfly': 'Yes'}
))

print(api_response.scrape_result['content'])

print('======== Json content-type ======== ')

api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url='https://httpbin.org/anything',
    method='POST',
    data={'hello': 'world'},
    headers={'X-Scrapfly': 'Yes', 'content-type': 'application/json'}
))

print(api_response.scrape_result['content'])

This is what you should get after running the script above:

❯ python custom.py
===== HEAD request =====
INFO     | scrapfly.client:scrape:129 - --> HEAD Scrapping https://httpbin.org/anything
INFO     | scrapfly.client:scrape:137 - <-- [200 OK] https://api.scrapfly.local/scrape?key=dfc249f758da45b2be8a182a3fc454d9&url=https%3A%2F%2Fhttpbin.org%2Fanything&country=DE | 0s
{'result': {'request_headers': {}, 'response_headers': {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': '*', 'Access-Control-Allow-Methods': '*', 'Access-Control-Allow-Origin': '*', 'Charset': 'utf-8', 'Content-Encoding': 'br', 'Content-Length': '762', 'Content-Type': 'application/json', 'Date': 'Wed, 11 Nov 2020 12:15:00 GMT', 'Server': 'gunicorn/19.9.0', 'Strict-Transport-Security': 'max-age=360; includeSubDomains; preload', 'Vary': 'Accept-Encoding', 'X-Frame-Options': 'DENY', 'X-Request-Id': 'b9df6c22-c70c-4f0c-9c30-d48f7324eaeb', 'X-Scrapfly-Response-Time': '1036.63', 'X-Xss-Protection': '1; mode=block'}, 'status_code': 200, 'reason': 'OK', 'format': 'text', 'content': ''}, 'context': {}, 'config': {'cookies': {}, 'headers': {}, 'url': 'https://httpbin.org/anything', 'retry': True, 'method': 'HEAD', 'country': 'DE', 'render_js': False, 'cache': False, 'cache_clear': False, 'asp': False, 'session': None, 'debug': False, 'cache_ttl': None, 'tags': None, 'correlation_id': None, 'body': None, 'data': None, 'graphql': None, 'js': None, 'rendering_wait': None, 'raise_on_upstream_error': True, 'screenshots': None, 'key': None, 'dns': False, 'ssl': False}}
======== Default Body Url Encode ========
INFO     | scrapfly.client:scrape:129 - --> POST Scrapping https://httpbin.org/anything
INFO     | scrapfly.client:scrape:144 - <-- [200 OK] https://httpbin.org/anything | 1.04s
{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "max-age=0",
    "Content-Length": "0",
    "Content-Type": "application/x-www-form-urlencoded",
    "Dnt": "1",
    "Host": "httpbin.org",
    "Sec-Gpc": "1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-5fabd5c6-70e321736e4b5ce91b3cdc34",
    "X-Scrapfly": "Yes"
  },
  "json": null,
  "method": "POST",
  "origin": "185.192.36.197",
  "url": "https://httpbin.org/anything"
}

======== Json content-type ========
INFO     | scrapfly.client:scrape:129 - --> POST Scrapping https://httpbin.org/anything
INFO     | scrapfly.client:scrape:144 - <-- [200 OK] https://httpbin.org/anything | 2.41s
{
  "args": {},
  "data": "{\"hello\": \"world\"}",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "max-age=0",
    "Content-Length": "18",
    "Content-Type": "application/json",
    "Dnt": "1",
    "Host": "httpbin.org",
    "Sec-Gpc": "1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-5fabd5c9-3ebc7c3b2306a76c05230200",
    "X-Scrapfly": "Yes"
  },
  "json": {
    "hello": "world"
  },
  "method": "POST",
  "origin": "5.180.82.117",
  "url": "https://httpbin.org/anything"
}

Session

More information about session

Using session to have a coherent scrape (same IP, cookies, referer)

from pprint import pprint
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')

api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(url='https://httpbin.org/cookies/set?k1=v1&k2=v2&k3=v3', session='test'))
print("=== Initiated Session ===")
pprint(api_response.context['session'])

api_response:ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(url='https://httpbin.org/anything', session='test'))

print("=== Request headers ===")
pprint(api_response.scrape_result['request_headers'])

print("=== Response === ")
print(api_response.scrape_result['content'])

This is what you should get after running the script above:

❯ python session.py
INFO     | scrapfly.client:scrape:129 - --> GET Scrapping https://httpbin.org/cookies/set?k1=v1&k2=v2&k3=v3
INFO     | scrapfly.client:scrape:135 - <-- [200 OK] https://httpbin.org/cookies/set?k1=v1&k2=v2&k3=v3 | 0.56s
=== Initiated Session ===
{'cookie_jar': [{'comment': None,
                 'domain': 'httpbin.org',
                 'expires': None,
                 'http_only': False,
                 'max_age': None,
                 'name': 'k1',
                 'path': '/',
                 'secure': False,
                 'size': 4,
                 'value': 'v1',
                 'version': None},
                {'comment': None,
                 'domain': 'httpbin.org',
                 'expires': None,
                 'http_only': False,
                 'max_age': None,
                 'name': 'k2',
                 'path': '/',
                 'secure': False,
                 'size': 4,
                 'value': 'v2',
                 'version': None},
                {'comment': None,
                 'domain': 'httpbin.org',
                 'expires': None,
                 'http_only': False,
                 'max_age': None,
                 'name': 'k3',
                 'path': '/',
                 'secure': False,
                 'size': 4,
                 'value': 'v3',
                 'version': None}],
 'correlation_id': 'default',
 'created_at': datetime.datetime(2020, 11, 11, 8, 55, 1),
 'expire_at': datetime.datetime(2020, 11, 18, 8, 59, 19),
 'identity': '1d39102a8372e6cf765f747fba879324607bf034',
 'last_used_at': datetime.datetime(2020, 11, 11, 8, 59, 19),
 'name': 'test',
 'referer': 'https://httpbin.org/cookies'}
INFO     | scrapfly.client:scrape:129 - --> GET Scrapping https://httpbin.org/anything
INFO     | scrapfly.client:scrape:135 - <-- [200 OK] https://httpbin.org/anything | 0.49s
=== Request headers ===
{'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'accept-encoding': 'gzip, deflate', 'accept-language': 'en-US,en;q=0.9', 'sec-gpc': '1', 'cache-control': 'max-age=0', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'referer': 'https://httpbin.org/cookies', 'cookie': 'k1=v1; k2=v2; k3=v3'}
{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "max-age=0",
    "Cookie": "k1=v1; k2=v2; k3=v3",
    "Dnt": "1",
    "Host": "httpbin.org",
    "Referer": "https://httpbin.org/cookies",
    "Sec-Gpc": "1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-5faba7e8-4bc059c944854909180ebb64"
  },
  "json": null,
  "method": "GET",
  "origin": "5.181.127.128",
  "url": "https://httpbin.org/anything"
}

Download File / Image

In this example we will download a Jeppesen chart of VQPR airport in pdf format.

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

scrape_config = ScrapeConfig(url='https://vau.aero/navdb/chart/VQPR.pdf')

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config)
print(api_response.scrape_result['content']) # BytesIO Object

scrapfly.sink(api_response) # create file

This is what you should get after running the script above:

❯ python download.py
INFO     | scrapfly.client:scrape:129 - --> GET Scrapping https://vau.aero/navdb/chart/VQPR.pdf
INFO     | scrapfly.client:scrape:135 - <-- [200 OK] https://vau.aero/navdb/chart/VQPR.pdf | 2.18s
<_io.BytesIO object at 0x7f892fdc0b30>
INFO     | scrapfly.client:sink:232 - file VQPR.pdf created

Cache

More information about cache

Cache is a powerful feature to avoid calling upstream websites on each call. Just set a TTL, and the cache will expire automatically and will transparently scrape the upstream website. Also, cached results are fast compared to a regular scrape.

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='')
scrape_config = ScrapeConfig(url='https://news.ycombinator.com/', cache=True, cache_ttl=500)

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config)
print(api_response.context['cache'])

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config)
print(api_response.context['cache'])

This is what you should get after running the script above:

❯ python cache_hackernews.py
INFO     | scrapfly.client:scrape:129 - --> GET Scrapping https://news.ycombinator.com/
INFO     | scrapfly.client:scrape:135 - <-- [200 OK] https://news.ycombinator.com/ | 1.3s
{'state': 'MISS', 'entry': {'created_at': datetime.datetime(2020, 11, 11, 7, 48, 40), 'last_used_at': datetime.datetime(2020, 11, 11, 7, 48, 40), 'fingerprint': '85b37058ce05b829196617d43940024f8616d863', 'ttl': 500, 'expires_at': datetime.datetime(2020, 11, 11, 7, 57)}}
INFO     | scrapfly.client:scrape:129 - --> GET Scrapping https://news.ycombinator.com/
INFO     | scrapfly.client:scrape:135 - <-- [200 OK] https://news.ycombinator.com/ | 0.27s
{'state': 'HIT', 'entry': {'created_at': datetime.datetime(2020, 11, 11, 7, 48, 40), 'last_used_at': datetime.datetime(2020, 11, 11, 7, 48, 41), 'fingerprint': '85b37058ce05b829196617d43940024f8616d863', 'ttl': 500, 'expires_at': datetime.datetime(2020, 11, 11, 7, 57)}}

Proxy Geo Targeting

More information request customization
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
    url='https://api.myip.com',
    country="nl"
))

print(api_response.context['proxy'])
print(api_response.scrape_result['content'])

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
    url='https://api.myip.com',
    country="de"
))

print(api_response.context['proxy'])
print(api_response.scrape_result['content'])

This is what you should get after running the script above:

❯ python proxy_geo.py
INFO     | scrapfly.client:scrape:129 - --> GET Scrapping https://api.myip.com
INFO     | scrapfly.client:scrape:135 - <-- [200 OK] https://api.myip.com | 0.53s
{'zone': 'public_pool', 'country': 'nl', 'ipv4': '193.149.21.166'}
{"ip":"193.149.21.166","country":"Netherlands","cc":"NL"}
INFO     | scrapfly.client:scrape:129 - --> GET Scrapping https://api.myip.com
INFO     | scrapfly.client:scrape:135 - <-- [200 OK] https://api.myip.com | 0.43s
{'zone': 'public_pool', 'country': 'de', 'ipv4': '45.131.94.159'}
{"ip":"45.131.94.159","country":"Germany","cc":"DE"}

Screenshot

More information about screenshot

Taking a screenshot of Hackernews Website

Simple usage

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')
scrapfly.screenshot(url='https://news.ycombinator.com/', name="hackernews.jpg")

Advanced usage

If you need to customize (headers, cookies, etc) via ScrapeConfig

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
    url='https://news.ycombinator.com/',
    render_js=True,
    screenshots={
        'hackernews': 'fullpage'
    }
))

# will save hackernews.jpg
# you can provide path argument to specify the directory
scrapfly.save_screenshot(api_response=api_response, name='hackernews')


Anti Scraping Protection (ASP)

More information about Anti Scraping Protection

By default, everything is made to fake a real user behind the scene. If you got detected or protected are triggered anyway, you can activate ASP to unblock and bypass protection.

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='')

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
    url='https://amazon.com',
    render_js=True,
    asp=True
))

GraphQL Content Extraction

Query content with Graphql directly - Check out the documentation for the full specification of this feature.

import logging as logger
from sys import stdout

scrapfly_logger = logger.getLogger('scrapfly')
scrapfly_logger.setLevel(logger.DEBUG)
logger.StreamHandler(stdout)

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

# Checkout our full documentation here : https://scrapfly.io/docs/scrape-api/graphql

scrapfly = ScrapflyClient(key='')


query = """
{
    page {
        news: query(selector: "table.itemlist > tbody > tr.athing") {
            title: text(selector: "td.title > a")
            website: text(selector: "td.title > span.comhead")
            link: attr(selector: "td.title > a" name: "href")
        }
    }
}
"""

# no need to specify headers etc, by default if not specified our API take your back!
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
    url='https://news.ycombinator.com/',
    graphql=query,
    render_js=True
))

print(api_response.content)

will print

{
    "page": {
        "news": [
            {
                "title": "Bill and Melinda Gates revealed as largest private farmland owners in US",
                "website": "(landreport.com)",
                "link": "https://landreport.com/2021/01/bill-gates-americas-top-farmland-owner/"
            },
            {
                "title": "U.S. insurrectionists received $500k in bitcoins from French donor",
                "website": "(chainalysis.com)",
                "link": "https://blog.chainalysis.com/reports/capitol-riot-bitcoin-donation-alt-right-domestic-extremism"
            },
            ...
    }
}