Anti Scraping Protection (ASP)

Service Level Agreenment (SLA) and Service Interruption

Service interruptions may occasionally occur independent of Scrapfly's control. As anti scraping protection technology is constantly evolving Scrapfly engineers are working hard to keep up with the latest changes. This may take hours to days to weeks to implement as a reliable production-grade remedy. It is essential to bear this in mind and develop your software accordingly when utilizing this feature.

Please note that

We can't provide ETA regarding service restoration due to the R&D, however with the volume we handle and the number of corporates account we have, most of incidents are resolved 1 business days on well known anti bot, on average around from 3 to 7 business days.
The API Credit cost may fluctuate, if a website introduce new protection(s) or migrate to another anti bot server. The underlying resources required to handle it can change (residential network, browser usage, custom solution)
SLA plan are available from a minimum commitment of $50k/Month

Scrapfly's Anti-Scraping Protection is designed to unblock protected websites that are inaccessible to bots. We accomplish this by incorporating various concepts that help maintain a coherent fingerprint, making it as close to that of a real user as possible when scraping a website.

To use ASP just enable asp parameter in your API call.

Scrapfly is capable of identifying and resolving obstacles posed by commonly used anti-scraping measures. Our platform also provides support for custom anti-scraping measures implemented by popular websites. Scrapfly ASP bypass does not require any extra input from you, and you will receive successful responses automatically.

If you are interested in understanding the technical aspects of how we achieve this undetectability, we have published a series of articles on the subject available in the learning resources section below.

Usage and Abuse Limitation

To summarize our TOS, following usage are prohibited:

Automated Online Payment
Account Creation
Spam Posts
Vote Falsification
Credit Card Testing
Login Brute Force
Referral / Gifting systems
Ads fraud
Banks
Ticketing (Automated Buying System)
Betting, Casino, Gambling

The use of ASP can be authorized for use by cybersecurity firms (red teams) after obtaining approval from the relevant parties for the specific domains they wish to test.

Usage

When ASP is enabled, anti-bot solution vendor are automatically detected and everything is managed to bypass it.

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?asp=true&key=__API_KEY__&url=https%3A%2F%2Fhttpbin.dev%2Fanything")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body

https://api.scrapfly.io/scrape?asp=true&key=&url=https%253A%252F%252Fhttpbin.dev%252Fanything

ASP will fine-tune some parameters regardless of user configuration. Some examples are listed below:

These adjustments can increase the request credit price and for that check the pricing section for more details.

Proxy Pool: ASP can access exclusive private proxy pools specific to scraped targets or upgrade to a better general proxy pool.
Browser Usage: ASP might enable it to bypass pages that require javascript.
Headers: Some browser headers set by you might be ignored or modified. Headers based on resource type (image, file, html etc) and referer can be fine-tuned as well. We can also add custom headers if the target require or challenge method require them.
- referer is auto generated if not present, you can pass none as header value to no pass any referer header to the target website
- cookie ASP auto handle session usage and reuse challenge cookies for faster result
- accept can be changed regarding the type of resources (images, script, json, xhr, etc)
- content-type based on request body and website target format
- user-agent: Make sure to set a custom user-agent only when required by the target website, as the user agent is already managed by ASP for optimal bypass.
  - Chrome based user agent are ignored and will be replaced by the one provided for the fingerprint
  - Non-Chrome user agent are left untouched
Country: Base on target website location and usual traffic, ASP might fine-tune the proxy country. If you set country explicitly the ASP will respect this.
OS: To align fingerprint for optimal bypass we may change the OS and related headers based on the exit proxy hardware.
Body: JSON are re-encoded to produce the same serialized output as a real web browser.

ASP Limitations

While popular anti-bot vendors can be bypassed without any additional effort, there are still some areas that require manual configuration of calls.

For best results, it's important to understand how the target websites work and replicate their behavior in scraping calls. ASP bypass handles bot detection, and it's up to the user to configure last mile settings to avoid identification through use patterns.

How to avoid anti bot detection on POST request

Avoiding anti-bot detection on a POST request can be tricky, but there are some key areas to focus on:

Mimic a real user's behavior: Anti-bot systems often check for unusual behavior that may indicate a bot, such as a high number of requests from the same IP address or at the same time. You can mimic a real user's behavior by visiting some pages to retrieve navigation cookies/referers urls.
Handling CSRF: Cross-Site Request Forgery (CSRF) is a common anti-bot measure used by websites.

For more, see these tutorials and resources:
- CSRF header tutorial on Scrapfly's Scrapeground.
- introduction to headers in scraper blocking blog post.
Use realistic headers: Anti-bot systems can detect bots by looking at the headers of the requests. You should try to replicate the headers of a real user's request as closely as possible. This includes the Accept, Content-Type, Referer and Origin headers. Make sure to configure correctly the value of Accept and Content-Type regarding the content you expect (json, html).

For more, see these tutorials and resources:
- Referer header tutorial on Scrapfly's Scrapeground.
- introduction to headers in scraper blocking blog post.
Authentication: If the website requires authentication, make sure you include the correct credentials in your request. This might involve logging in to the website first, then including the session cookie or token in your POST request. If the API/Website requires it, ASP is not able to manage this, you must handle it on your side.

For more, see these tutorials and resources:
- Cookies authentication tutorial on Scrapfly's Scrapeground.

Overall, the key to bypassing anti-bot measures on a POST request is to replicate the headers, cookies, and authentication of a regular browser request as closely as possible. This requires careful inspection of the website's code and network traffic to identify the required elements.

Website with Private/Hidden API

Scraping a private API can be a bit more challenging than scraping public APIs. Here are some recommendations to follow:

Make sure you have permission: Before scraping any private API, make sure you have the necessary permission from the website owner or API provider. Scraping a private API without permission can result in legal consequences.
Mimic a real user: When scraping a private API, it's important to mimic a real user as closely as possible. This means sending the same headers and parameters that a real user would send when accessing the API.
Use authentication: Most private APIs require some form of authentication, such as a token or API key. Make sure you obtain the necessary credentials and use them in your requests.
Monitor for changes: Private APIs can change over time, so it's important to monitor for any changes in the API's structure or authentication requirements. If you notice any changes, update your scraping code accordingly.

Overall, scraping private APIs requires more attention to detail and careful configuration of requests. Following these recommendations can help ensure a successful and ethical scraping process.

Maximize Your Success Rate

Network Quality

In many cases, datacenter IPs are sufficient. However, anti-bot vendors may check the origin of the IP when protecting websites, to determine if the traffic comes from a datacenter or a regular connection. In such cases, residential networks can provide a better IP reputation, as they are registered under a regular ASN that helps control the origin of the IP.

API Usage: proxy_pool=public_residential_pool, checkout the related documentation

Use a Browser

Most anti bots check the browser fingerprint and javascript engine to generate detection metrics.

API Usage: render_js=true, checkout the related documentation

Verify Cookies and Headers

Observe headers/cookies of regular calls that are successful; you can figure out if you need to add extra headers or retrieve specific cookies to authenticate. You can use the dev tool and inspect the network activity.

API Usage: headers[referer]=https%3A%2F%2Fexample.com (value is url encoded), checkout the related documentation

Navigation Coherence

To ensure navigation coherence when scraping unofficial APIs, you may need to obtain cookies from your navigation. One way to do this is by enabling session and rendering JavaScript during the initial scraping to retrieve cookies. Once the cookies are stored in your session, you can continue scraping without rendering JavaScript while still applying the previously obtained cookies for consistency. The following Scrapfly features you must take a look to achieve that:

API Usage: session=my-unique-session-name, checkout the related documentation

API Usage: render_js=true, checkout the related documentation

Geo Blocking

When browsing certain websites, users may encounter blocks based on their IP location. Scrapfly can bypass this issue by default, as it selects a random country from its pool. However, specifying the country based on the location of the website can be a helpful way to avoid geo-blocking.

API Usage: country=us, checkout the related documentation

Pricing

Our Anti-Scraping Protection (ASP) solution is a sophisticated tool that provides advanced protection against scraping attempts. It is designed to adapt to various anti-scraping measures implemented on different websites. To achieve this, the ASP dynamically fine-tune your configuration parameters based on the target and the anti-scraping solution in place, and this can have an impact on pricing.

The main impact on the API Cost are:

Browser Usage
Proxy Pool
Target/Shield

You will find the pricing grid for browser usage and proxy network type. Specific target/shield have fees and are not publicly documented, only very specific one have fees otherwise there is no additional cost (Those fees are displayed in the cost section on your log). To get the full detail of the cost, you can the dedicated troubleshooting section

To ensure predictability and control of your spending, we recommend creating an account and gradually monitoring the usage costs as you increase your volume. You can also the use api budget on scrape call cost_budget=25 Once you have determined the actual cost, you can check our set of tools to make it more predictable and ensure staying within budget.

It's totally free on non-blocked scrape
If you scrape various websites, and you don't know which is protected or not, just keep it enabled, no extra cost is applied on non-protected traffic.

Furthermore, when ASP is enabled, a lot of things are automatically handled with the fine-tuning of parameters to prevent detection which result in saving.

Integration

ASP example with Python SDK

Related Errors

All related errors are listed below. You can see full description and example of error response on the Errors section. You can also check the troubleshooting section if you have timeout issue with ASP.

ERR::ASP::CAPTCHA_ERROR - Something wrong happened with the captcha. We will figure out to fix the problem as soon as possible
ERR::ASP::CAPTCHA_TIMEOUT - The budgeted time to solve the captcha is reached
ERR::ASP::SHIELD_ERROR - The ASP encounter an unexpected problem. We will fix it as soon as possible. Our team has been alerted
ERR::ASP::SHIELD_EXPIRED - The ASP shield previously set is expired, you must retry.
ERR::ASP::SHIELD_NOT_ELIGIBLE - The feature requested is not eligible while using the ASP for the given protection/target
ERR::ASP::SHIELD_PROTECTION_FAILED - The ASP shield failed to solve the challenge against the anti scrapping protection
ERR::ASP::TIMEOUT - The ASP made too much time to solve or respond
ERR::ASP::UNABLE_TO_SOLVE_CAPTCHA - Despite our effort, we were unable to solve the captcha. It can happened sporadically, please retry
ERR::ASP::UPSTREAM_UNEXPECTED_RESPONSE - The response given by the upstream after challenge resolution is not expected. Our team has been alerted