Web Scraping With R

article feature image

R programming language is one of the most popular languages used in modern data science and web-scraping can often be used to efficiently generate datasets right there in the R language itself! R ecosystem is equipped with features we need for quality web scraping: HTTP clients, HTML parsers and various data-processing utils!

In this article we'll take a deep dive into web scraping in R the right way. We'll cover fast asynchronous HTTP connections, how to avoid basic blocking and how to parse HTML files for data. Finally, we'll solidify this knowledge with an example job listing information scraper of https://www.indeed.com/!

Making Connection

Web scraping in generally comprised of two steps: getting data and parsing data. In this section, we'll focus on getting data and that is done via HTTP connections.

To retrieve public resources we (the client) must connect to the server and hope the server gives us the data of the document. This HTTP interaction is referred to as requests and responses:

illustration of a standard http exchange

illustration of a standard HTTP exchange

As you can see in the illustration above this protocol involves lots of different parts like request method, location, headers etc. but before we start exploring these bits we should choose a HTTP client in R!

HTTP clients: crul

To handle our HTTP connections we need a HTTP client library and R language primarily has two competing libraries:

  • httr - an older de facto standard client that comes with all of the basic http functions.
  • crul - a newer take on HTTP protocol with modern client API and asynchronous functionality.

In this article we'll stick with crul as it offers vital optional functionality for web scraping - asynchronous (parallel) requests as well as offering a more accessible client API.

The reason we want asynchronous connections is because HTTP protocol involves a lot of waiting - every time your client makes a request it must wait for server to respond blocking your code for few seconds. Async feature allows us to send multiple requests at the same time and consolidate that waiting!
In other words, 1 sync request would take us .1 second of processing and 2 seconds of waiting for the data to travel compared to 10 async requests which will take us 1 second of processing and 2 seconds of waiting for the data to travel. That's 1 for 2.1 second or 10 for 3 seconds - a colossal speed improvement!

Let's prepare our R environment with crul package which we'll use to get familiar with the HTTP protocol:

> install.packages("crul", repos = "https://dev.ropensci.org")

Understanding Requests and Responses

When it comes to web-scraping we don't exactly need to know every little detail about http requests and responses, however it's good to have a general overview and to know which parts of this protocol are especially useful in web-scraping. Let's take a look at exactly that!

Request Method

Http requests are conveniently divided into few types that perform distinct function:

  • GET requests are intended to request a document.
  • POST requests are intended to request a document by sending a document.
  • HEAD requests are intended to request documents meta information.
  • PATCH requests are intended to update a document.
  • PUT requests are intended to either create a new document or update it.
  • DELETE requests are intended to delete a document.

When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET and POST type requests. To add, HEAD requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check its metadata whether it's worth the effort.

Request Location

To understand what is resource location, first we should take a quick look at URL's structure itself:

illustration showing general URL structure

Example of a URL structure

Here, we can visualize each part of a URL: we have protocol, which when it comes to HTTP is either http or https, then we have host, which is essentially address of the server, and finally we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up R interactive shell (R in the terminal) and let it figure it out for you by using crul library's url_parse function:

$ R
> crul::url_parse("http://www.domain.com/path/to/resource?arg1=true&arg2=false")
$scheme
[1] "http"
$domain
[1] "www.domain.com"
$port
[1] NA
$path
[1] "path/to/resource"
$parameter
$parameter$arg1
[1] "true"
$parameter$arg2
[1] "false"
$fragment
[1] NA

Crul comes with two main URL parsing utils:

Request Headers

While it might appear like request headers are just minor metadata details in web scraping, they are extremely important. Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.

Let's take a look at some of the most important headers and what they mean:

User-Agent is an identity header that tells the server who's requesting the document.

# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

Whenever you visit a web page in your web browser identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, we don't want to be denied content, so we have to blend in by faking our user agent to look like that one of a browser.

There are many online databases that contain latest user-agent strings of various platforms, like this

Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.

Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web-scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:

text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

For more, see

X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of the scraped website/webapp.

These are few of most important observations, for more see extensive full documentation page on http headers

Response Status Code

Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate). Let's take a quick look at status codes that are most relevant to web scraping:

  • 200 range codes generally mean success!
  • 300 range codes tend to mean redirection - in other words if we request content at /product1.html it might be moved to a new location /products/1.html and server would inform us about that.
  • 400 range codes mean request is malformed or denied. Our web scraper could be missing some headers, cookies or authentication details.
  • 500 range codes typically mean server issues. The website might be unavailable right now or is purposefully disabling access to our web scraper.

For more on http status codes, see documentation:

Response Headers

When it comes to web scraping, response headers provide some important information for connection functionality and efficiency. For example, Set-Cookie header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag, Last-Modified are intended to help client with caching to optimize resource usage.

For the entire list of all HTTP headers, see

Finally, just like with request headers, headers prefixed with an X- are custom web functionality headers.


We took a brief overlook of core HTTP components, and now it's time we give it a go and see how HTTP works in practical R!

Making GET Requests

Now that we're familiar with the HTTP protocol and how it's used in web-scraping let's put it to practice using R's crul library.

Let's start off with a basic GET request:

library("crul")
response <- HttpClient$new('https://httpbin.org/headers')$get()

# response url - it can be different from above if redirect happened
print(response$url)
# status code:
print(response$status_code)
# check whether response succeeded, i.e. status code <= 201
print(response$success())
# response headers:
print(response$response_headers)

# response binary content
print(response$content)
# response content as text ()
print(response$parse())
# can also load text json response to R's named list:
jsonlite::fromJSON(response$parse())

Here we're using http://httpbin.org/ HTTP testing service, in this case we're using /headers endpoint which shows request headers the server received from us.
When run, this script should print basic details about our made request:

> library("crul")
> response <- HttpClient$new('https://httpbin.org/headers')$get()
> 
> # response url - it can be different from above if redirect happened
> print(response$url)
[1] "https://httpbin.org/headers"
> # status code:
> print(response$status_code)
[1] 200
> # check whether response succeeded, i.e. status code <= 201
> print(response$success())
[1] TRUE
> # response headers:
> print(response$response_headers)
$status
[1] "HTTP/2 200"
$date
[1] "Wed, 02 Mar 2022 08:28:04 GMT"
$`content-type`
[1] "application/json"
$`content-length`
[1] "286"
$server
[1] "gunicorn/19.9.0"
$`access-control-allow-origin`
[1] "*"
$`access-control-allow-credentials`
[1] "true"
> # response binary content
> print(response$content)
  [1] 7b 0a 20 20 22 68 65 61 64 65 72 73 22 3a 20 7b 0a 20 20 20 20 22 41 63 63
 [26] 65 70 74 22 3a 20 22 61 70 70 6c 69 63 61 74 69 6f 6e 2f 6a 73 6f 6e 2c 20
 [51] 74 65 78 74 2f 78 6d 6c 2c 20 61 70 70 6c 69 63 61 74 69 6f 6e 2f 78 6d 6c
 [76] 2c 20 2a 2f 2a 22 2c 20 0a 20 20 20 20 22 41 63 63 65 70 74 2d 45 6e 63 6f
[101] 64 69 6e 67 22 3a 20 22 67 7a 69 70 2c 20 64 65 66 6c 61 74 65 22 2c 20 0a
[126] 20 20 20 20 22 48 6f 73 74 22 3a 20 22 68 74 74 70 62 69 6e 2e 6f 72 67 22
[151] 2c 20 0a 20 20 20 20 22 55 73 65 72 2d 41 67 65 6e 74 22 3a 20 22 6c 69 62
[176] 63 75 72 6c 2f 37 2e 38 31 2e 30 20 72 2d 63 75 72 6c 2f 34 2e 33 2e 32 20
[201] 63 72 75 6c 2f 31 2e 32 2e 30 22 2c 20 0a 20 20 20 20 22 58 2d 41 6d 7a 6e
[226] 2d 54 72 61 63 65 2d 49 64 22 3a 20 22 52 6f 6f 74 3d 31 2d 36 32 31 66 32
[251] 61 39 34 2d 33 31 32 61 66 38 33 62 33 33 63 37 32 35 34 65 33 33 34 36 39
[276] 64 30 39 22 0a 20 20 7d 0a 7d 0a
> # response content as text ()
> print(response$parse())
No encoding supplied: defaulting to UTF-8.
[1] "{\n  \"headers\": {\n    \"Accept\": \"application/json, text/xml, application/xml, */*\", \n    \"Accept-Encoding\": \"gzip, deflate\", \n    \"Host\": \"httpbin.org\", \n    \"User-Agent\": \"libcurl/7.81.0 r-curl/4.3.2 crul/1.2.0\", \n    \"X-Amzn-Trace-Id\": \"Root=1-621f2a94-312af83b33c7254e33469d09\"\n  }\n}\n"
> # can also load text json response to R's named list:
> jsonlite::fromJSON(response$parse())
No encoding supplied: defaulting to UTF-8.
$headers
$headers$Accept
[1] "application/json, text/xml, application/xml, */*"
$headers$`Accept-Encoding`
[1] "gzip, deflate"
$headers$Host
[1] "httpbin.org"
$headers$`User-Agent`
[1] "libcurl/7.81.0 r-curl/4.3.2 crul/1.2.0"

Making POST requests

Sometimes our web-scraper might need to submit some sort of forms to retrieve HTML results. For example, search queries often use POST requests with query details as either JSON or Formdata values:

library("crul")

# send form type post request:
response <- HttpClient$new('https://httpbin.org/post')$post(
    body = list("query" = "cats", "page" = 1),
    encode = "form",
)
print(jsonlite::fromJSON(response$parse()))

# or json type post request:
response <- HttpClient$new('https://httpbin.org/post')$post(
    body = list("query" = "cats", "page" = 1),
    encode = "json",
)
print(jsonlite::fromJSON(response$parse()))

Ensuring Headers

As we've covered before our requests must provide metadata about themselves which helps server determine what content to return. Often, this metadata can be used to identify web scrapers and block them. Modern web browsers automatically include specific metadata details with every request so if we wish to not stand out as a web scraper we should replicate this behavior.

Primarily, User-Agent and Accept headers are often dead giveaways so we should set them some common values. This can either be done globally or per request basis:

library("crul")

# we can set headers for every request
response <- HttpClient$new("https://httpbin.org/headers",
    headers = list(
        "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
    )
)$get()
print(jsonlite::fromJSON(response$parse())$headers)

# or set headers for the whole script (recommended)
set_headers(
    "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
)
response <- HttpClient$new("https://httpbin.org/headers")$get()
print(jsonlite::fromJSON(response$parse())$headers)

In the example above we're setting our headers to mimic Chrome web browser on windows platform. This simple change can prevent a lot of web scraping blocking and is recommended for every web scraper.

Tracking cookies

Sometimes when web-scraping we care about persistent connection state. For website where we need to login or configure website (like change currency) cookies are vital part of web scraping process.

Crul supports cookie tracking per HttpClient basis meaning all requests attached to one client object share cookies:

library("crul")

session <- HttpClient$new('http://httpbin.org/')
# set some cookies:
resp_set_cookies <- session$get('/cookies/set/foo/bar')
# see current cookies:
resp_retrieve_cookies <- session$get('/cookies')
print(resp_retrieve_cookies$parse())

In the example above, we're using httpbin.org's /cookies endpoint to set some cookies for the session. Once cookies are set we're being redirected to a page that displays sent cookies:

{
  "cookies": {
    "foo": "bar"
  }
}

Now that we know our way around HTTP in R and Crul let's take a look at connection speed! How can we make these connections faster and more efficient?

Asynchronous (Parallel) Requests

Since HTTP protocol is a data exchange protocol between two parties there's a lot of waiting involved. In other words when our client sends a request it needs to wait for it to travel all the way to the server and comeback which stalls our program. Why should our program sit idly and wait for request to travel around the globe? This is called an IO (input/output) block.

We chose R's crul package over httr for this particular feature - it makes asynchronous requests very accessible:

library("crul")

start = Sys.time()
responses <- Async$new(
    urls = c(
        "http://httpbin.org/links/4/0",
        "http://httpbin.org/links/4/1",
        "http://httpbin.org/links/4/2",
        "http://httpbin.org/links/4/3"
    ),
)$get()
print(responses)
print(Sys.time() - start)

In the example above we are batching multiple urls to execute them together. Alternatively we can go even further and execute varying different requests:

library("crul")

start = Sys.time()
responses <- AsyncVaried$new(
        HttpRequest$new("http://httpbin.org/links/4/0")$get(),
        HttpRequest$new("http://httpbin.org/links/4/1")$get(),
        HttpRequest$new("http://httpbin.org/links/4/2")$get(),
        HttpRequest$new("http://httpbin.org/links/4/3", headers=list("User-Agent"="different"))$get(),
        HttpRequest$new("http://httpbin.org/post")$post(body=list(query="cats", page = 1))
)$request()
print(responses)
print(Sys.time() - start)

The above approach allows us to mix varying type and parameter requests.


Now that we're familiar and comfortable with HTTP protocol in R's crul let's take a look how can we make sense from the HTML data we're retrieving. In the next section we'll take a look at HTML parsing using CSS and XPATH selectors in R!

Parsing HTML: Rvest

Retrieving HTML documents is only one part of web scraping process - we also have to parse them for data we're looking for. Luckily, the HTML format is designed to be machine parsable, so we can take advantage of this and use special CSS or XPATH selector languages to find exact parts of the page to extract.

We've covered both CSS and XPATH selectors in great detail previous articles:

Parsing HTML with CSS Selectors

For more on CSS selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Parsing HTML with CSS Selectors
Parsing HTML with Xpath

For more on xpath selectors see our in depth introduction article which covers xpath syntax, usage and various tips and tricks.

Parsing HTML with Xpath

In R there's one library that is supports both CSS and XPATH selectors: rvest
Let's take a look at some common rvest use cases:

library("rvest")

tree <- read_html('
<div class="links">
  <a href="https://twitter.com/@scrapfly_dev">Twitter</a>
  <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a>
</div>
')

# we can execute basic css selectors and pull all text values:
print(tree %>% html_element("div.links") %>% html_text())
# "[1] "\n  Twitter\n  LinkedIn\n""

# we can also execute xpath selectors:
print(tree %>% html_element(xpath="//div[@class='links']") %>% html_text())
# "[1] "\n  Twitter\n  LinkedIn\n""

# html_text2 - outputs are cleaned fo trailing/leading space characters:
print(tree %>% html_element("div") %>% html_text2())
# "[1] "Twitter LinkedIn""

# we can select attribute of a single element:
print(tree %>% html_element("div") %>% html_attr('class'))
# "links"

# or attributes of multiple elements:
print(tree %>% html_elements("div.links a") %>% html_attr('href'))
# [1] "https://twitter.com/@scrapfly_dev"         
# [2] "https://www.linkedin.com/company/scrapfly/"

The primary function here is R's pipe symbol %>% which allows us to process our HTML tree through multiple processors like XPATH or CSS selectors and text or attribute extractors.

Rvest also comes with some special parsing functionalities inspired by data science use cases. For example, it allows us to convert HTML tables to R's data frames:

library("rvest")

tree <- read_html('
<div class="table-wrapper">
<table>
  <th>
    <td>model</td>
    <td>year</td>
  </th>
  <tr>
    <td>Mazda</td>
    <td>2011</td>
  </tr>
  <tr>
    <td>Toyota</td>
    <td>1992</td>
  </tr>
</table>
</div>
')

tree %>% html_element('.table-wrapper') %>% html_table()
## A tibble: 2 × 2
#  X1        X2
# <chr>  <int>
# 1 Mazda   2011
# 2 Toyota  1992

In the example above html_table() pipe function automatically extracts the whole table from given selector if it includes <table> node. It picks up table headers from <th> nodes and even converts values to appropriate types (in this example year values were converted to integers).

The best way to really explore harvest is with an example projects so let's do just that!

Putting It All Together: indeed.com scraper

To solidify our knowledge we'll write a short web scraper for https://uk.indeed.com/.
We'll be scraping job listing data from given search query:

  1. We'll scrape indeed.com search page from given search query and location like https://uk.indeed.com/jobs?q=ruby&l=Scotland
  2. find all 10 job listings on the page by using [rvest]
  3. find total amount of jobs available for the query
  4. repeat above for remaining pages left in the query as a batch request (all n>1 pages will be retrieved together)
  5. combine all results and return them

We'll write a fast and easy to understand scraper by separating our logic into single purpose functions.
Let's start our scraper from the bottom up by defining our constants and company parse function:

library("crul")
library("rvest")

HEADERS <- list(
    "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding" = "gzip, deflate, br",
    "Accept-Language" = "en-US,en;q=0.9"
)

Here, we're defining our HEADERS constant. To avoid having our scraper blocked we need to ensure it looks like a web browser. For this we're simply copying header a Chrome browser would use on a Windows computer.

Next, we can define our company parse function:

parse_search <- function(response){
    # build rvest tree
    tree <- read_html(response$parse())
    # find total jobs available
    total_jobs <- tree %>% html_element("#searchCountPages") %>% html_text2()
    total_jobs <- strtoi(stringr::str_match(total_jobs, "(\\d+) jobs")[,2])
    # find displayed job container boxes:
    jobs <- tree %>% html_elements('#mosaic-zone-jobcards .result')
    # extract job listing from each job container box
    parsed <- list()
    for (job in jobs){
        parsed <- append(parsed, list(
            title = job %>% html_element('h2') %>% html_text2(),
            company = job %>% html_element('.companyOverviewLink') %>% html_text2(),
            location = job %>% html_element('.companyLocation') %>% html_text2(),
            date = job %>% html_element(xpath=".//span[contains(@class,'date')]/text()") %>% html_text2(),
            url =  url_build(response$url, job %>% html_attr('href')),
            company_url = url_build(response$url, job %>% html_element(xpath='.//a[contains(@class,"companyOverviewLink")]') %>% html_attr('href'))
        ))
    }
    # return parsed jobs and total job count in the query
    print(glue("found total {length(jobs)} jobs from total of {total_jobs}"))
    list(jobs=parsed, total=total_jobs)
}

Here we have a function that takes a response object, builds a rvest HTML tree and using combination of CSS and XPATH selections we're extracting job listing details. Important bit to note here is that we're first selecting job container boxes and iterate through them extracting job details - isolating context is an important HTML parsing technique which helps avoid structural issues.

We can actually already test this code by explicitly scraping 1 company:

url <- "https://uk.indeed.com/jobs?q=r&l=scotland")
response <- HttpClient$new(url, header=HEADERS)$get()
print(parse_search(response))

This will return us the results of the first query page and count of total listings:

found total 15 jobs from total of 330
$jobs
$jobs$title
[1] "Parts Supervisor"
$jobs$company
[1] "BMW Group Retail"
$jobs$location
[1] "Edinburgh"
$jobs$date
[1] "4 days ago"
$jobs$url
[1] "https://uk.indeed.com/rc/clk?jk=ad7698381a9870de&fccid=bf564b9e3f3db3fc&vjs=3"
$jobs$company_url
[1] "https://uk.indeed.com/cmp/BMW-Group"

<...>

$total
[1] 330

Now we have to finish our scraper with a scraping loop that will scrape first page, parse the results and then scrape remaining pages in parallel:

scrape_search_page <- function(query, location, offset=0, limit=10){
    # first we need to create all urls we'll be scraping based on offset and limit arguments
    # 0:10 will create first page scrape url, and 10:80 will create 1-8 pages since there are 10 results per page
    print(glue("scraping {query} at {location} in range {offset}:{limit}"))
    urls <- list()
    for (i in seq(offset+10, limit, by=10)){
        urls <- append(urls, glue("https://uk.indeed.com/jobs?q={query}&l={location}&start={i}"))
    }

    # then we want to retrieve these urls in parallel:
    print(glue("scraping search page urls: {urls}"))
    responses <- Async$new(
        urls = urls,
        headers = HEADERS,
    )$get()
    
    # finally we want to unpack results of each individual page into final dataset
    found_jobs <- list()
    total_jobs <- NULL
    for (response in responses){
        page_results <- parse_search(response)
        found_jobs <- append(found_jobs, page_results$jobs)
        total_jobs <- page_results$total
    }
    # we return jobs we parsed and total jobs presented in the search page:
    list(jobs=found_jobs, total=total_jobs)
}

scrape_search <- function(query, location){
    # this is our main function that scrapes all pages of the query explicitly
    # first, we scrape the first page
    first_page_results <- scrape_search_page(query, location)
    found_jobs <- first_page_results$jobs
    total_jobs <- first_page_results$total

    # then we scrape remaining pages: 
    print(glue("scraped first page, found {length(found_jobs)} jobs; continuing with remaining: {total_jobs}"))
    remaining_page_results <- scrape_search_page(query, location, offset = 10, limit = total_jobs)
    # finally, we return combined dataset
    append(found_jobs, remaining_page_results$jobs)
}

Here we've defined our scraping loop which utilizes parallel connection feature of crul. As you can see the common theme here is: get first page and examine the context and using this context we can optimize our scraper to quickly retrieve remaining page results in parallel!

If we run our script:

start = Sys.time()
print(scrape_search("r", "Scotland"))
print(Sys.time() - start)

We can see that we scraped over 300 job listings in under 10 seconds! We can easily embed this tiny scraper into our R workflow without need to worry about caching or data storage as each dataset scrape would only take few seconds to refresh.


Our example, while complete, is not production ready as web scraping suffers from a lot of challenges like web scraper blocking, connection issues/throttling and dealing with those in R language a bit out of the scope of this tutorial. For this, let's take a look at ScrapFly's web scraping API and how could we use it in R to abstract away these issues to ScrapFly service so we don't have to deal with them ourselves!

ScrapFly

ScrapFly offers a valuable middleware service to web scraper developers which solves many complex connection issues that can make web content hard to reach or even unreachable.

scrapfly middleware

ScrapFly service does the heavy lifting for you!

Often web scrapers get blocked, throttled or have insufficient http client abilities to reach the desired content. For this, ScrapFly provides an API service that can execute connection logic for you.

ScrapFly contains many features such as:

  • Smart Proxy Selection - all ScrapFly requests are proxied through either datacenter or residential proxies. This allows scrapers to access geographically locked content and greatly reduces chances of being blocked or throttled.
  • Javascript Rendering - as many websites rely on javascript to render data content HTTP clients such as R's crul often are not enough to scrape dynamic data displayed on these websites. For this we can use ScrapFly as an abstraction layer that requests and renders the dynamic content for us.
  • Anti Scrape Protection Solution - many modern websites don't want to be scraped and use various services or captcha that prevent scrapers from accessing content - ScrapFly can solve this.
  • Many other tools like DNS analysis, SSL analysis and Screenshot capture

Let's take a look how can we use ScrapFly's API to scrape content with all of these features in R!

library(crul)

SCRAPFLY_KEY <- "YOUR_API_KEY"

scrapfly <- function(url,
                     render_js = FALSE,
                     wait_for_selector = NULL,
                     asp = FALSE,
                     country = "us",
                     proxy_pool = "public_datacenter_pool") {
    url_build("https://api.scrapfly.io/scrape", query = list(
        key = SCRAPFLY_KEY,
        url = url,
        asp = if (asp) "true" else "false",
        render_js = if (render_js) "true" else "false",
        wait_for_selector = wait_for_selector,
        country = country,
        proxy_pool = proxy_pool
    ))
}


response <- HttpClient$new(scrapfly('https://httpbin.org/headers'))$get()
response_with_javascript_render <- HttpClient$new(scrapfly('https://httpbin.org/headers', render_js = TRUE))$get()
data <- jsonlite::fromJSON(response$parse())
print(data$result$content)
print(data$result$response_headers)
print(data$result$cookies)
print(data$config)

In the example above, we define a function that turns regular web url into a ScrapFly API url, which we can pass to our requests to take advantage of ScrapFly features with no extra effort.

For more features, usage scenarios and other info see ScrapFly's documentation!

Summary

In this extensive introduction article on web scraping in R we've taken a look at HTTP connection strategies using crul library: how to submit get and post requests, change headers, keep track of cookies and finally, how to do that all efficiently using asynchronous (parallel) requests.
With http bits figured out we've taken a look at HTML parsing using rvest which supports both CSS and XPATH selector based element extraction.

Finally, we put everything together in a small web scraper example of https://uk.indeed.com/ and taken a quick look at some common web scraping challenges and how we can use ScrapFly web scraping API to solve them for us!

Related Posts

How to Avoid Web Scraper IP Blocking?

How IP addresses are used in web scraping blocking. Understanding IP metadata and fingerprinting techniques to avoid web scraper blocks.

How Headers Are Used to Block Web Scrapers and How to Fix It

Introduction to web scraping headers - what do they mean, how to configure them in web scrapers and how to avoid being blocked.

Web Scraping Graphql with Python

Introduction to web scraping graphql powered websites. How to create graphql queries in python and what are some common challenges.