Anti Scraping Protection (ASP)

SLA and Service Interruption
Service interruptions may occasionally occur independent of Scrapfly's control. As anti scraping protection technology is constantly evolving Scrapfly engineers are working hard to keep up with the latest changes. This may take hours to days to weeks to implement as a reliable production-grade remedy. It is essential to bear this in mind and develop your software accordingly when utilizing this feature.

Please note that
  • We can't provide ETA regarding service restoration due to the R&D, however with the volume we handle and the number of corporates account we have, most of incidents are resolved 1 business days on well known anti bot, on average around from 3 to 7 business days.
  • The API Credit cost may fluctuate, if a website introduce new protection(s) or migrate to another anti bot server. The underlying resources required to handle it can change (residential network, browser usage, custom solution)
  • SLA plan are available from a minimum commitment of $50k/Month

Scrapfly's Anti-Scraping Protection is designed to unblock protected websites that are inaccessible to bots. We accomplish this by incorporating various concepts that help maintain a coherent fingerprint, making it as close to that of a real user as possible when scraping a website.

If you are interested in understanding the technical aspects of how we achieve this undetectability, we have published a series of articles on the subject available in the learning resources section below.

Usage and Abuse Limitation
To summarize our TOS, following usage are prohibited:
  • Automated Online Payment
  • Account Creation
  • Spam Posts
  • Vote Falsification
  • Credit Card Testing
  • Login Brute Force
  • Referral / Gifting systems
  • Ads fraud
  • Banks
  • Ticketing (Automated Buying System)
  • Betting, Casino, Gambling
The use of ASP can be authorized for use by cybersecurity firms (red teams) after obtaining approval from the relevant parties for the specific domains they wish to test.

Scrapfly is capable of identifying and resolving obstacles posed by commonly used anti-scraping measures. Our platform also provides support for custom anti-scraping measures implemented by popular websites. Scrapfly ASP bypass does not require any extra input from you, and you will receive successful responses automatically.

Usage

When ASP is enabled, anti-bot solution vendor are automatically detected and everything is managed to bypass it.

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?asp=true&key=__API_KEY__&url=https%3A%2F%2Fhttpbin.dev%2Fanything")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body
"https://api.scrapfly.io/scrape?asp=true&key=&url=https%3A%2F%2Fhttpbin.dev%2Fanything"

"api.scrapfly.io"
"/scrape"

asp  = "true" 
key  = "" 
url  = "https://httpbin.dev/anything" 

The ASP fine-tune some parameters regardless of user configuration. Those mutation can increase the price, you can check the pricing section for more details.

  • Proxy Pool: ASP can access some private pool we maintain (Residential/Datacenter), Upgrade to undetected network etc
  • Browser Usage: If disabled, ASP might enable it
  • Headers: Browser header you can set will be ignored except non generic browser user-agent, auto fine-tuned based on resource type (image, file, html etc), referer. We can also add custom headers if the target require or challenge method require them.
    • referer is auto generated if not present, you can pass none as header value to no pass any referer header to the target website
    • cookie ASP auto handle session usage and reuse challenge cookies for faster result
    • accept can be changed regarding the type of resources (images, script, json, xhr, etc)
    • content-type based on request body and website target format
    • user-agent: Make sure to set a custom user-agent only when the target require it, otherwise the user agent is already managed by ASP and will match the fingerprint used for the scrape.
      • Chrome based user agent are ignored and will be replace by the one provided for the fingerprint
      • Non-Chrome user agent are left untouched
  • Country: Base on target website location and usual traffic, we fine-tune the country - If you set a specific country, we will respect it
  • OS: To align fingerprint end-to-end we may change the OS and related headers based on the exit proxy hardware
  • Body: JSON are re-encoded to produce the same serialized output than a browser

ASP Limitations

While popular anti-bot vendors can be bypassed without any additional effort, there are still some areas that require manual configuration of calls.

For best results, it's important to understand how the target websites work and replicate their behavior in scraping calls. ASP bypass handles bot detection and it's up to the user to configure last mile settings to avoid identification through use patterns.

How to avoid anti bot detection on POST request

Avoiding anti-bot detection on a POST request can be tricky, but there are some key areas to focus on:

  1. Mimic a real user's behavior: Anti-bot systems often check for unusual behavior that may indicate a bot, such as a high number of requests from the same IP address or at the same time. You can mimic a real user's behavior by visiting some pages to retrieve navigation cookies/referers urls.

  2. Handling CSRF: Cross-Site Request Forgery (CSRF) is a common anti-bot measure used by websites.

    For more, see these tutorials and resources:

  3. Use realistic headers: Anti-bot systems can detect bots by looking at the headers of the requests. You should try to replicate the headers of a real user's request as closely as possible. This includes the Accept, Content-Type, Referer and Origin headers. Make sure to configure correctly the value of Accept and Content-Type regarding the content you expect (json, html).

    For more, see these tutorials and resources:

  4. Authentication: If the website requires authentication, make sure you include the correct credentials in your request. This might involve logging in to the website first, then including the session cookie or token in your POST request. If the API/Website requires it, ASP is not able to manage this, you must handle it on your side.

    For more, see these tutorials and resources:

Overall, the key to bypassing anti-bot measures on a POST request is to replicate the headers, cookies, and authentication of a regular browser request as closely as possible. This requires careful inspection of the website's code and network traffic to identify the required elements.

Website with Private/Hidden API

Scraping a private API can be a bit more challenging than scraping public APIs. Here are some recommendations to follow:

  1. Make sure you have permission: Before scraping any private API, make sure you have the necessary permission from the website owner or API provider. Scraping a private API without permission can result in legal consequences regarding the type of data exposed.
  2. Mimic a real user: When scraping a private API, it's important to mimic a real user as closely as possible. This means sending the same headers and parameters that a real user would send when accessing the API.
  3. Use authentication: Most private APIs require some form of authentication, such as a token or API key. Make sure you obtain the necessary credentials and use them in your requests.
  4. Monitor for changes: Private APIs can change over time, so it's important to monitor for any changes in the API's structure or authentication requirements. If you notice any changes, update your scraping code accordingly.

Overall, scraping private APIs requires more attention to detail and careful configuration of requests. Following these recommendations can help ensure a successful and ethical scraping process.

Maximize Your Success Rate

Network Quality

In many cases, datacenter IPs are sufficient. However, anti-bot vendors may check the origin of the IP when protecting websites, to determine if the traffic comes from a datacenter or a regular connection. In such cases, residential networks can provide a better IP reputation, as they are registered under a regular ASN that helps control the origin of the IP.

API Usage: proxy_pool=public_residential_pool, checkout the related documentation

Use a Browser

Most of anti bot check the browser fingerprint / javascript engine to generate a proof of legitimacy.

API Usage: render_js=true, checkout the related documentation

Verify Cookies and Headers

Observe headers/cookies of regular calls that are successful; you can figure out if you need to add extra headers or retrieve specific cookies to authenticate. You can use the dev tool and inspect the network activity.

API Usage: headers[referer]=https%3A%2F%2Fexample.com (value is url encoded), checkout the related documentation

To ensure navigation coherence when scraping unofficial APIs, you may need to obtain cookies from your navigation. One way to do this is by enabling session and rendering JavaScript during the initial scraping to retrieve cookies. Once the cookies are stored in your session, you can continue scraping without rendering JavaScript while still applying the previously obtained cookies for consistency. The following Scrapfly features you must take a look to achieve that:

API Usage: session=my-unique-session-name, checkout the related documentation
API Usage: render_js=true, checkout the related documentation

Geo Blocking

When browsing certain websites, users may encounter blocks based on their IP location. Scrapfly can bypass this issue by default, as it selects a random country from its pool. However, specifying the country based on the location of the website can be a helpful way to avoid geo-blocking.

API Usage: country=us, checkout the related documentation

Pricing

Our Anti-Scraping Protection (ASP) solution is a sophisticated tool that provides advanced protection against scraping attempts. It is designed to adapt to various anti-scraping measures implemented on different websites. To achieve this, the ASP dynamically fine-tune your configuration parameters based on the target and the anti-scraping solution in place, and this can have an impact on pricing.

The main impact on the API Cost are:

  • Browser Usage
  • Proxy Pool
  • Target/Shield

You will find the pricing grid for browser usage and proxy network type. Specific target/shield have fees and are not publicly documented, only very specific one have fees otherwise there is no additional cost (Those fees are displayed in the cost section on your log). To get the full detail of the cost, you can the dedicated troubleshooting section

To ensure predictability and control of your spending, we recommend creating an account and gradually monitoring the usage costs as you increase your volume. You can also the use api budget on scrape call cost_budget=25 Once you have determined the actual cost, you can check our set of tools to make it more predictable and ensure staying within budget.

It's totally free on non-blocked scrape

If you scrape various websites, and you don't know which is protected or not, just keep it enabled, no extra cost is applied on non-protected traffic.

Furthermore, when ASP is enabled, a lot of things are automatically handled with the fine-tuning of parameters to prevent detection which result in saving.

Integration

All related errors are listed below. You can see full description and example of error response on the Errors section. You can also check the troubleshooting section if you have timeout issue with ASP.

Learning Resources

Summary