One of the biggest challenges in web scraping is blocking which can be caused by hundreds of different reasons. However, we can reduce all of these reasons to a single fact - web scraper connections appear different compared to a web browser.
What makes web scraper connections so easy to identify?
The easiest way to detect web scraping connection is request header analysis. Headers, are part of every connection and include important metadata. If our web scraper is connecting with headers that are unlike that of a web browser then it can be easily identified. For this, we need to understand how headers work, how they are presented in web browsers and how can we replicate this in our web scraping code. To summarize:
Another instant metadata information that is included with every HTTP connection is IP address itself. When web scraping we often use proxies to avoid sending inhuman amount of requests through a single connection. This is great for avoiding being detected by traffic analysis but not all IP addresses are equal. Some perform much better in web scraping than others. To summarize:
Transport Layer Security (TLS) is an end-to-end encryption protocol used in all HTTPS connections. The process of TLS handshake can lead to web scraper identification and fingerprinting. This is because every HTTP client - be it a programming library, or a web browser - performs initial TLS connection handshake slightly differently. These minor differences are collected and compiled into a fingerprint called JA3.
Some of these variables can instantly identify us as non-human connections, and some can provide unique tracking artifacts for fingerprinting. That being said, most of these leaks can be plugged or spoofed meaning not all hope is lost!
navigator.webdrivervariable) are patched in scraper controlled browsers.
As you can see avoiding web scraper blocking is an enormous subject - there are so many things that can identify us as a web scraper!
This is why ScrapFly web scraping API was found - abstracting away logic that deals with web scraping blocking results in much cleaner and easier to maintain code.
ScrapFly automatically resolves most of these blocking issues but does have several optional features which can be used to access even the most hard to access websites. Let's take a brief look how these features can be used in our python-sdk:
from scrapfly import ScrapflyClient, ScrapeConfig url = "https://quotes.toscrape.com/js/page/2/" with ScrapflyClient(key='<YOUR KEY>') as client: response = client.scrape( ScrapeConfig( url=url render_js=True # ^^^^^^^ enabled ) ) html = response.scrape_result['content']
Anti Scraping Protection Bypass feature solves various anti web scraping protection services that use fingerprinting and browser analysis to block web scrapers:
from scrapfly import ScrapflyClient, ScrapeConfig url = "https://some-protected-website.com/" with ScrapflyClient(key='<YOUR KEY>') as client: response = client.scrape( ScrapeConfig( url=url asp=True # ^^^^^^^ enabled ) ) html = response.scrape_result['content']
Smart Proxies feature allows selection of multiple proxy types like mobile proxies, residential proxies as well as pinning proxy location to specific countries:
from scrapfly import ScrapflyClient, ScrapeConfig url = "https://httpbin.org/ip" with ScrapflyClient(key='<YOUR KEY>') as client: response = client.scrape( ScrapeConfig( url=url # see https://scrapfly.io/dashboard/proxy for available proxy pools proxy_pool='public_mobile_pool', # use mobile proxies country='US', # use proxies located in the United States ) ) html = response.scrape_result['content']
And much more - see our full documentation and try it out yourself for free!
Avoiding web scraper blocking can be an exhausting task so checkout ScrapFly which solves all of that for you!