Anti Scraping Protection (ASP)
SLA and Service Interruption
Please note that
- We can't provide ETA regarding service restoration due to the R&D, however with the volume we handle and the number of corporates account we have, most of incidents are resolved 1 business days on well known anti bot, on average around from 3 to 7 business days.
- The API Credit cost may fluctuate, if a website introduce new protection(s) or migrate to another anti bot server. The underlying resources required to handle it can change (residential network, browser usage, custom solution)
- SLA plan are available from a minimum commitment of $50k/Month
Scrapfly's Anti-Scraping Protection is designed to unblock protected websites that are inaccessible to bots. We accomplish this by incorporating various concepts that help maintain a coherent fingerprint, making it as close to that of a real user as possible when scraping a website.
If you are interested in understanding the technical aspects of how we achieve this undetectability, we have published a series of articles on the subject available in the learning resources section below.
Usage and Abuse Limitation
- Automated Online Payment
- Account Creation
- Spam Posts
- Vote Falsification
- Credit Card Testing
- Login Brute Force
- Referral / Gifting systems
- Ads fraud
- Banks
- Ticketing (Automated Buying System)
- Betting, Casino, Gambling
Scrapfly is capable of identifying and resolving obstacles posed by commonly used anti-scraping measures. Our platform also provides support for custom anti-scraping measures implemented by popular websites. Scrapfly ASP bypass does not require any extra input from you, and you will receive successful responses automatically.
Usage
When ASP is enabled, anti-bot solution vendor are automatically detected and everything is managed to bypass it.
require "uri"
require "net/http"
url = URI("https://api.scrapfly.io/scrape?asp=true&key=__API_KEY__&url=https%3A%2F%2Fhttpbin.dev%2Fanything")
https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true
request = Net::HTTP::Get.new(url)
response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?asp=true&key=&url=https%253A%252F%252Fhttpbin.dev%252Fanything
The ASP fine-tune some parameters regardless of user configuration. Those mutation can increase the price, you can check the pricing section for more details.
- Proxy Pool: ASP can access some private pool we maintain (Residential/Datacenter), Upgrade to undetected network etc
- Browser Usage: If disabled, ASP might enable it
-
Headers: Browser header you can set will be ignored except non generic browser user-agent, auto fine-tuned based on resource type (image, file, html etc), referer.
We can also add custom headers if the target require or challenge method require them.
referer
is auto generated if not present, you can passnone
as header value to no pass anyreferer
header to the target websitecookie
ASP auto handle session usage and reuse challenge cookies for faster resultaccept
can be changed regarding the type of resources (images, script, json, xhr, etc)content-type
based on request body and website target formatuser-agent
: Make sure to set a custom user-agent only when the target require it, otherwise the user agent is already managed by ASP and will match the fingerprint used for the scrape.- Chrome based user agent are ignored and will be replace by the one provided for the fingerprint
- Non-Chrome user agent are left untouched
- Country: Base on target website location and usual traffic, we fine-tune the country - If you set a specific country, we will respect it
- OS: To align fingerprint end-to-end we may change the OS and related headers based on the exit proxy hardware
- Body: JSON are re-encoded to produce the same serialized output than a browser
ASP Limitations
While popular anti-bot vendors can be bypassed without any additional effort, there are still some areas that require manual configuration of calls.
For best results, it's important to understand how the target websites work and replicate their behavior in scraping calls. ASP bypass handles bot detection and it's up to the user to configure last mile settings to avoid identification through use patterns.
How to avoid anti bot detection on POST request
Avoiding anti-bot detection on a POST request can be tricky, but there are some key areas to focus on:
-
Mimic a real user's behavior: Anti-bot systems often check for unusual behavior that may indicate a bot, such as a high number of requests from the same IP address or at the same time. You can mimic a real user's behavior by visiting some pages to retrieve navigation cookies/referers urls.
-
Handling CSRF: Cross-Site Request Forgery (CSRF) is a common anti-bot measure used by websites.
For more, see these tutorials and resources:
- CSRF header tutorial on Scrapfly's Scrapeground.
- introduction to headers in scraper blocking blog post.
-
Use realistic headers: Anti-bot systems can detect bots by looking at the headers of the requests. You should try to replicate the headers of a real user's request as closely as possible. This includes the
Accept
,Content-Type
,Referer
andOrigin
headers. Make sure to configure correctly the value ofAccept
andContent-Type
regarding the content you expect (json, html).For more, see these tutorials and resources:
- Referer header tutorial on Scrapfly's Scrapeground.
- introduction to headers in scraper blocking blog post.
-
Authentication: If the website requires authentication, make sure you include the correct credentials in your request. This might involve logging in to the website first, then including the session cookie or token in your POST request. If the API/Website requires it, ASP is not able to manage this, you must handle it on your side.
For more, see these tutorials and resources:
- Cookies authentication tutorial on Scrapfly's Scrapeground.
Overall, the key to bypassing anti-bot measures on a POST request is to replicate the headers, cookies, and authentication of a regular browser request as closely as possible. This requires careful inspection of the website's code and network traffic to identify the required elements.
Website with Private/Hidden API
Scraping a private API can be a bit more challenging than scraping public APIs. Here are some recommendations to follow:
- Make sure you have permission: Before scraping any private API, make sure you have the necessary permission from the website owner or API provider. Scraping a private API without permission can result in legal consequences regarding the type of data exposed.
- Mimic a real user: When scraping a private API, it's important to mimic a real user as closely as possible. This means sending the same headers and parameters that a real user would send when accessing the API.
- Use authentication: Most private APIs require some form of authentication, such as a token or API key. Make sure you obtain the necessary credentials and use them in your requests.
- Monitor for changes: Private APIs can change over time, so it's important to monitor for any changes in the API's structure or authentication requirements. If you notice any changes, update your scraping code accordingly.
Overall, scraping private APIs requires more attention to detail and careful configuration of requests. Following these recommendations can help ensure a successful and ethical scraping process.
Maximize Your Success Rate
Network Quality
In many cases, datacenter IPs are sufficient. However, anti-bot vendors may check the origin of the IP when protecting websites, to determine if the traffic comes from a datacenter or a regular connection. In such cases, residential networks can provide a better IP reputation, as they are registered under a regular ASN that helps control the origin of the IP.
- Introduction To Proxies in Web Scraping
- How to Avoid Web Scraping Blocking: IP Address Guide
- Learn how to change the network type
API Usage: proxy_pool=public_residential_pool
, checkout the related documentation
Use a Browser
Most of anti bot check the browser fingerprint / javascript engine to generate a proof of legitimacy.
API Usage: render_js=true
, checkout the related documentation
Verify Cookies and Headers
Observe headers/cookies of regular calls that are successful; you can figure out if you need to add extra headers or retrieve specific cookies to authenticate. You can use the dev tool and inspect the network activity.
API Usage: headers[referer]=https%3A%2F%2Fexample.com
(value is url encoded), checkout the related documentation
Navigation Coherence
To ensure navigation coherence when scraping unofficial APIs, you may need to obtain cookies from your navigation. One way to do this is by enabling session and rendering JavaScript during the initial scraping to retrieve cookies. Once the cookies are stored in your session, you can continue scraping without rendering JavaScript while still applying the previously obtained cookies for consistency. The following Scrapfly features you must take a look to achieve that:
API Usage: session=my-unique-session-name
, checkout the related documentation
API Usage: render_js=true
, checkout the related documentation
Geo Blocking
When browsing certain websites, users may encounter blocks based on their IP location. Scrapfly can bypass this issue by default, as it selects a random country from its pool. However, specifying the country based on the location of the website can be a helpful way to avoid geo-blocking.
API Usage: country=us
, checkout the related documentation
Pricing
Our Anti-Scraping Protection (ASP) solution is a sophisticated tool that provides advanced protection against scraping attempts.
It is designed to adapt to various anti-scraping measures implemented on different websites.
To achieve this, the ASP dynamically fine-tune your configuration parameters based on the target and the anti-scraping solution in place, and this can have an impact on pricing.
The main impact on the API Cost are:
- Browser Usage
- Proxy Pool
- Target/Shield
You will find the pricing grid for browser usage and proxy network type. Specific target/shield have fees and are not publicly documented, only very specific one have fees otherwise there is no additional cost (Those fees are displayed in the cost section on your log). To get the full detail of the cost, you can the dedicated troubleshooting section
To ensure predictability and control of your spending, we recommend creating an account and gradually monitoring the usage costs as you increase your volume.
You can also the use api budget on scrape call cost_budget=25
Once you have determined the actual cost, you can check our set of tools to make it more
predictable and ensure staying within budget.
It's totally free on non-blocked scrapeIf you scrape various websites, and you don't know which is protected or not, just keep it enabled, no extra cost is applied on non-protected traffic.
Furthermore, when ASP is enabled, a lot of things are automatically handled with the fine-tuning of parameters to prevent detection which result in saving.
Integration
Related Errors
All related errors are listed below. You can see full description and example of error response on the Errors section. You can also check the troubleshooting section if you have timeout issue with ASP.
- ERR::ASP::CAPTCHA_ERROR - Something wrong happened with the captcha. We will figure out to fix the problem as soon as possible
- ERR::ASP::CAPTCHA_TIMEOUT - The budgeted time to solve the captcha is reached
- ERR::ASP::SHIELD_ERROR - The ASP encounter an unexpected problem. We will fix it as soon as possible. Our team has been alerted
- ERR::ASP::SHIELD_EXPIRED - The ASP shield previously set is expired, you must retry.
- ERR::ASP::SHIELD_NOT_ELIGIBLE - The feature requested is not eligible while using the ASP for the given protection/target
- ERR::ASP::SHIELD_PROTECTION_FAILED - The ASP shield failed to solve the challenge against the anti scrapping protection
- ERR::ASP::TIMEOUT - The ASP made too much time to solve or respond
- ERR::ASP::UNABLE_TO_SOLVE_CAPTCHA - Despite our effort, we were unable to solve the captcha. It can happened sporadically, please retry
- ERR::ASP::UPSTREAM_UNEXPECTED_RESPONSE - The response given by the upstream after challenge resolution is not expected. Our team has been alerted