Troubleshooting
Troubleshooting Scrapfly scrapers is relatively easy as Scrapfly provides extensive logging. Let's take a look at some examples.
To start the Monitoring dashboard displays all scrape requests and their details. Inspecting the scrape monitoring dashboard is the first step to troubleshooting and we'll cover this extensively in the following sections.
For additional details make sure to enable the debug parameter which will provide more details about the scrape requests and even capture screenshots when render_js is enabled.
Replicating with Player
The first step to figuring out why a scrape request is not working as expected is to replicate the Web Player which allow easy and reliable scrape configuration
๐
If the Web Player works as expected, then the issue is likely related to the API call configuration. Ensure that the API call is configured as per API Specification. In particular, note that some API parameters need to be url encoded.
Timeout
If you receive a lot of ERR::SCRAPE::OPERATION_TIMEOUT
error
retry=false
is set, you need to increase thetimeout
parameterasp=true
is set andretry=false
, some protection take time to bypass, we recommend to settimeout=60000
eventimeout=90000
- If the url you hit is redirected like url ending with
/
orwwww
, try to avoid them by providing the direct url. Each redirection slow the scrape -
If you use
asp=true
, try to set by default the parameter fine tune to bypass. For example if you useproxy_pool=public_datacenter_pool
whereas all scrape under ASP use residential (you can see in the log page, on the right section), it will be faster. Same for browser usage. The less the ASP fine tune your parameters the faster is it.
You can the find full documentation about timeout mechanism
Confirming Scrape Instructions
Another way to troubleshoot this is to check the Monitoring dashboard and validate that Scrapfly received the same scrape instructions as you intended.
๐
If there's a mismatch, it's likely that the API call was incorrectly configured. If that's not the case then let's proceed further and take a look at potential causes.
Note that Scrapfly can alter parts of scrape details to bypass scraper blocking when Anti Scraping Protection bypass is enabled.
Unexpected Results
The results you see in your browser are not always easy to replicate in a scraper. Here are the main causes that could explain missing or different data:
Status Code Overview
The first step is to ensure that the scrape request resulted in a successful response. To do this the
result.status_code
:
Here's a quick summary of the most common status codes and what are common causes for each:
200
All 200 range codes (200, 201, 202 etc.) are considered success. If that's the case skip to the Missing Scrape Details section below.
400
This error stands for client error and this likely means the scrape request is misconfigured:
- Check the scrape URL
- Check the scrape request headers
- Check the scrape request body if it's a POST or PUT type request
- See the Missing Scrape Details section below
404
The page is not found and this likely means the scrape URL is invalid. Typo in the URL is most likely cause though alternatively, this can also mean request misconfiguration. See the Missing Scrape Details section below.
410
The page is gone and this likely means the scrape URL has become invalid. This is common when scraping pages with an expiration date like second hand listings or advertisements.
405
The page doesn't accept current HTTP method. This can be caused by sending POST-type request to GET endpoints and visa versa.
406
The page doesn't accept current content type. This can be caused by sending JSON-type request to
HTML endpoints and visa versa or setting invalid Accept
header.
Missing Scrape Details
The most common reason for missing data is that the scraper is missing configuration details that are required to fully load the page.
When replicating requests from web browser in Scrapfly it's important to match all request details.
URL Parameters
URL parameters are everything after the ?
symbol and optimally this should match
what we see in the browser.
For example in the URL https://web-scraping.dev/product/2?variant=one&COLOR=dark%20blue
we should keep both variant=one
and COLOR=dark%20blue
parameters in the scrape request
as they appear in the URL including:
- Parameter order. Here, "variant" then "COLOR"
- Parameter name spelling and case. Here, "COLOR" is uppercase
- Parameter value encoding and formatting (if any). Here, the color value is url encoded as "dark%20blue"
Some non-functional parameters like analytics tracking parameters should be ignored.
These parameters often appear as non-sensical IDs (like ?tid=cfa44df
) and
can be easily confirmed if the website functionality remains the same when they are removed.
Request Headers
While Scrapfly configures all headers related to fingerprinting and blocking it cannot predict all custom headers for all websites.
This is particularly important when
websites use customer headers that are usually identified with X-
prefix -
these should be replicated and included in scrape requests.
Related
Scrapeground CSRF ExerciseCookies
Some pages can be cookie-locked and require cookies set from previous requests. The easiest way to handle this is to use Scrapfly Session and request the pages in the required order.
Related
Scrapeground Cookies ExerciseDynamic Websites
The website could be dynamically loaded through browser javascript. If you're scraping without
render_js parameter Scrapfly is not executing javascript which can cause the said data difference.
If you are using render_js
then ensure the scraper is waiting for the website to fully load
using the wait_for_selector or rendering_wait parameters.
Geo Location
Data could be different or missing because of a different proxy country.
By default, Scrapfly selects a fitting proxy randomly which might not always
match the desired scrape region. Try changing the proxy country
parameter to match your region, e.g. country=us
.
Scraper Blocking
Another cause could be anti-bot measures which are designed to block scrapers. For this, make sure Anti Scraping Protection bypass is enabled.
Additionally, you can improve success chances by:
Errors
Scrapfly uses a comprehensive error code status system that indicates exactly what went wrong with a scrape request
and error details can be accessed through Scrapfly response result.error
field:
This field contains all of the information needed to troubleshoot the error.
In particular, error code
and http_code
can be used to troubleshoot any error page.
The http and error codes are defined in the Errors documentation or can be looked up through using the command.
For example,
will lookup all errors related to http status code 422
and
will lookup all pages related to this error.
Scraping Costs
To troubleshoot billing issues, first check the cost
field in the
Monitoring dashboard. This field breaks down all of the credit costs
used for this scrape request.
๐
Note that scrape cost can vary for each scrape request depending on the scrape details. Many costs like Anti Scraping Protection bypass are only billed when use is required and tech is reused when possible with reduced charge.