Javascript Rendering

Scrapfly's headless browser feature is the ultimate solution for web scraping needs that involve javascript-rendered content. Each Scrapfly scrape request that has render_js option enabled runs on a dedicated cloud browser instance that is optimized for web scraping and responding quickly and reliably to web scraping actions.

Scrapfly's advanced cache resource is powered by a global private CDN for maximum efficiency, and our solution is designed to handle proxy peering with ease.

Scraping using cloud browsers is slower and requires more proxy resources. The scraping time depends on factors such as the proxy location, website hosting distance, content size, number of resources, and amount of javascript.

When Javascript Rendering is enabled, Scrapfly also tracks web resources like:

  • Intermediate HTTP queries (request/response)
  • Local Storage
  • Session Storage
  • Screenshot (on demand)
  • Remote javascript execution result (on demand)
  • Websockets (Upgrade request and dataframe)

Javascript rendering also enables advanced Scrapfly features like Javascript Scenarios which allow to issue common browser control commands like button clicks and form inputs.

When to Use Javascript Rendering?

Many modern websites require Javascript to work and load pages through javascript powered techniques like XHR. So, the most reliable way to tell is to try scraping without javascript rendering, check whether desired content exists and compare it with javascript rendering enabled. For that, use Scrapfly's web player to experiment with various configurations real-time!

Wait For Your Content

Rendering delay is vital for scraping slowly loaded content as dynamic pages take time to fully load all the content.

For slower elements the Scrapfly browsers can be instructed to wait a bit longer through explicit delay or waiting for a specific element to appear:

In this example, Scrapfly will wait 5s before extracting the content of the page. The rendering_wait parameter is expressed in milliseconds. The maximum allowed time to wait is 25s.

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?render_js=true&rendering_wait=5000&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body
  • CSS Selector and Xpath are case sensitive
  • Characters like ~, :, / need to be escaped with \\ Example: #selector:1234 becomes #selector\\:1234

In this example, Scrapfly will wait until product reviews load indicated by the visible presence of #reviews CSS selector. The selector watcher will timeout after 15s. Selectors are case-sensitive and need to be urlencoded.

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?render_js=true&wait_for_selector=%23reviews&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body

Alternatively, XPath selectors can be used as well which are often preferred because of more advanced querying features like selecting values based on text content. For example we can wait for reviews containing the word "delicious" to load //div[contains(@class,"review")]/p[contains(text(),"delicious")]

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?render_js=true&wait_for_selector=%2F%2Fdiv%5Bcontains%28%40class%2C%22review%22%29%5D%2Fp%5Bcontains%28text%28%29%2C%22delicious%22%29%5D&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body
Related API errors :

To wait for an intermediate request to respond, you have to prefix the selector with xhr:

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&asp=true&render_js=true&wait_for_selector=xhr%3A%2Fapi%2Fgraphql&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Freviews")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body

XHR pattern will match as soon it's find in the URL of the XHR requests. The pattern is case sensitive and support wildcard * character. E.g: xhr:/page/*/reviews*

Javascript Execution

Scrapfly provides a way to inject javascript code to Scrapfly browsers through the js parameter

The provided javascript code has to be base64 encoded, then it'll be executed after rendering wait and before wait_for_selector (if any).

The return value of the provided javascript code is also returned by Scrapfly API (as long as it's serializable): result.browser_data.javascript_evaluation_result

For example, this JS script used on new.ycombinator.com page will retrieve all article titles

It should be encoded as base64:

and finally it can be passed to Scrapfly API as demonstrated in this example:

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?render_js=true&js=cmV0dXJuIEFycmF5LmZyb20oZG9jdW1lbnQucXVlcnlTZWxlY3RvckFsbCgnLnJldmlldyA-IHAnKSkubWFwKChlbCkgPT4gZWwudGV4dENvbnRlbnQp&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?render_js=true&js=returnArrayfromdocumentquerySelectorAllreviewpmapeleltextContent&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1

The API returns the results under result.browser_data.javascript_evaluation_result key which contains the script's return value:

Snippets

Here are some common javascript execution snippets used in web scraping:

  • Scroll to the bottom of the page to fully render the HTML rendering on the website:

Screenshot

You are able to take many screenshots when you use a browser. There is a dedicated page to screenshot feature available here.

Resource Tracking

Scrapfly tracks all background requests performed by the scraped websites (XHR). The captured data includes all necessary request details like the URL, headers, method and the sent body. This data is available under result.browser_data.

Scrapfly also tracks Local Storage and Session Storage content which is available under local_storage_data and session_storage_data keys. See this example response:

Limitations

Javascript rendering feature is only available with GET method. You can't use a browser to send POST, PATCH, PUT, HEAD requests.

Following XHR / Fetched resources are not tracked:

  • Fonts: .woff, .woff2, .otf, .ttf
  • Media: .webm, .oga, .aac, .m4a, .mp3, .wav, .mp4
  • Image: .svg, .png, .gif, .jpg, .jpeg, .ico
  • Style: .css
  • Other: .pbf

You can retrieve an emitted XHR call with associate URL, headers, body, and method. We do not attach the response. If you need the response content, you can simply directly call the XHR URL.

It's not possible to directly download the media/image content with a browser, it will load the image url in html document and img tag to display it. Without the browser you will retrieve the base64 of the binary content.

All related errors are listed below. You can see the full description and examples of errors response on Errors section

Pricing

Using JavaScript rendering will cost 5 Scrape API Credits against your quota. Keep in mind JavaScript Rendering is slow and uses many data/resources. For maximum performance, you should avoid it when it's not required.

API Response contains header X-Scrapfly-Api-Cost indicate you the billed amount.

Summary