R programming language is one of the most popular languages used in modern data science and web-scraping can often be used to efficiently generate datasets right there in the R language itself!
R language ecosystem is equipped with all of the features we need for web scraping: HTTP clients, HTML parsers and various data-processing utilities.
In this article, we'll take a deep dive into web scraping in R - the right way.
We'll cover fast asynchronous HTTP connections, how to avoid basic blocking and how to parse HTML files for data.
Finally, we'll solidify this knowledge with an example job listing information scraper of https://www.indeed.com/!
Web scraping is generally comprised of two steps: getting data and parsing data. In this section, we'll focus on getting data and that is done via HTTP connections.
To retrieve public resources we (the client) must connect to the server and hope the server gives us the data of the document. This HTTP interaction is referred to as requests and responses:
As you can see in the illustration above this protocol involves lots of different parts like request method, location, headers etc. but before we start exploring these bits we should choose a HTTP client in R!
To handle our HTTP connections we need an HTTP client library and R language primarily has two competing libraries:
In this article, we'll stick with crul as it offers vital optional functionality for web scraping - asynchronous (parallel) requests which are vital for fast scraping.
The HTTP involves a lot of waiting - every time a client makes a request it must wait for the server to respond blocking code in the mean time. So, if we can make multiple concurrent connections - we can skip the blocked waiting!
For example, 1 synchronous request would take us .1 second of actual processing and 2 seconds of waiting, compared to 10 asynchronous requests which will take us 10x.1 second of processing and 1x2 seconds of waiting - a colossal difference!
Let's prepare our R environment with crul package which we'll use to get familiar with the HTTP protocol:
> install.packages("crul", repos = "https://dev.ropensci.org")
When it comes to web-scraping we don't exactly need to know every little detail about HTTP requests and responses however, it's good to have a general overview and to know which parts of this protocol are especially useful in web-scraping. Let's take a look at exactly that!
HTTP requests are conveniently divided into a few types that perform distinct functions. Most commonly in web scraping we'll encounter these type of requests:
GET
requests are intended to request a document.POST
requests are intended to request a document by sending a document.HEAD
requests are intended to request documents meta information.When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET
and POST
type requests. To add, HEAD
requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check its metadata whether it's worth the effort.
Other methods less commonly encountered in web scraping but still good to be aware of are:
PATCH
requests are intended to update a document.PUT
requests are intended to either create a new document or update it.DELETE
requests are intended to delete a document.Every HTTP requests needs to tell what resource is being requested and this is done through URL (Universal Resource Location) which has a detailed structure:
Here, we can visualize each part of a URL:
http
or https
.If you're ever unsure of a URL's structure, you can always fire up R interactive shell (R
in the terminal) and let it figure it out for you by using crul library's url_parse
function:
$ R
> crul::url_parse("http://www.domain.com/path/to/resource?arg1=true&arg2=false")
$scheme
[1] "http"
$domain
[1] "www.domain.com"
$port
[1] NA
$path
[1] "path/to/resource"
$parameter
$parameter$arg1
[1] "true"
$parameter$arg2
[1] "false"
$fragment
[1] NA
While it might appear like request headers are just minor metadata details in web scraping, they are extremely important. Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.
Let's take a look at some of the most important headers and what they mean:
User-Agent is an identity header that tells the server who's requesting the document.
# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
Whenever you visit a web page in your web browser identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, we don't want to be denied content, so we have to blend in by faking our user agent to look like that one of a browser.
There are many online databases that contain latest user-agent strings of various platforms, like this
Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.
Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web-scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
For more, see
X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of the scraped website/webapp.
These are few of most important observations, for more see extensive full documentation page on http headers
Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate). Let's take a quick look at status codes that are most relevant to web scraping:
/product1.html
it might be moved to a new location /products/1.html
and server would inform us about that.When it comes to web scraping, response headers provide some important information for connection functionality and efficiency. For example, Set-Cookie
header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag
, Last-Modified
are intended to help client with caching to optimize resource usage.
Finally, just like with request headers, headers prefixed with an X-
are custom web functionality headers.
We took a brief overlook of core HTTP components, and now it's time we give it a go and see how HTTP works in practical R!
Now that we're familiar with the HTTP protocol and how it's used in web-scraping let's put it to practice using R's crul library.
Let's start off with a basic GET request:
library("crul")
response <- HttpClient$new('https://httpbin.org/headers')$get()
# response url - it can be different from above if redirect happened
print(response$url)
# status code:
print(response$status_code)
# check whether response succeeded, i.e. status code <= 201
print(response$success())
# response headers:
print(response$response_headers)
# response binary content
print(response$content)
# response content as text ()
print(response$parse())
# can also load text json response to R's named list:
jsonlite::fromJSON(response$parse())
Here we're using http://httpbin.org/ HTTP testing service, in this case we're using /headers
endpoint which shows request headers the server received from us.
When run, this script should print basic details about our made request:
> library("crul")
> response <- HttpClient$new('https://httpbin.org/headers')$get()
>
> # response url - it can be different from above if redirect happened
> print(response$url)
[1] "https://httpbin.org/headers"
> # status code:
> print(response$status_code)
[1] 200
> # check whether response succeeded, i.e. status code <= 201
> print(response$success())
[1] TRUE
> # response headers:
> print(response$response_headers)
$status
[1] "HTTP/2 200"
$date
[1] "Wed, 02 Mar 2022 08:28:04 GMT"
$`content-type`
[1] "application/json"
$`content-length`
[1] "286"
$server
[1] "gunicorn/19.9.0"
$`access-control-allow-origin`
[1] "*"
$`access-control-allow-credentials`
[1] "true"
> # response binary content
> print(response$content)
[1] 7b 0a 20 20 22 68 65 61 64 65 72 73 22 3a 20 7b 0a 20 20 20 20 22 41 63 63
[26] 65 70 74 22 3a 20 22 61 70 70 6c 69 63 61 74 69 6f 6e 2f 6a 73 6f 6e 2c 20
[51] 74 65 78 74 2f 78 6d 6c 2c 20 61 70 70 6c 69 63 61 74 69 6f 6e 2f 78 6d 6c
[76] 2c 20 2a 2f 2a 22 2c 20 0a 20 20 20 20 22 41 63 63 65 70 74 2d 45 6e 63 6f
[101] 64 69 6e 67 22 3a 20 22 67 7a 69 70 2c 20 64 65 66 6c 61 74 65 22 2c 20 0a
[126] 20 20 20 20 22 48 6f 73 74 22 3a 20 22 68 74 74 70 62 69 6e 2e 6f 72 67 22
[151] 2c 20 0a 20 20 20 20 22 55 73 65 72 2d 41 67 65 6e 74 22 3a 20 22 6c 69 62
[176] 63 75 72 6c 2f 37 2e 38 31 2e 30 20 72 2d 63 75 72 6c 2f 34 2e 33 2e 32 20
[201] 63 72 75 6c 2f 31 2e 32 2e 30 22 2c 20 0a 20 20 20 20 22 58 2d 41 6d 7a 6e
[226] 2d 54 72 61 63 65 2d 49 64 22 3a 20 22 52 6f 6f 74 3d 31 2d 36 32 31 66 32
[251] 61 39 34 2d 33 31 32 61 66 38 33 62 33 33 63 37 32 35 34 65 33 33 34 36 39
[276] 64 30 39 22 0a 20 20 7d 0a 7d 0a
> # response content as text ()
> print(response$parse())
No encoding supplied: defaulting to UTF-8.
[1] "{\n \"headers\": {\n \"Accept\": \"application/json, text/xml, application/xml, */*\", \n \"Accept-Encoding\": \"gzip, deflate\", \n \"Host\": \"httpbin.org\", \n \"User-Agent\": \"libcurl/7.81.0 r-curl/4.3.2 crul/1.2.0\", \n \"X-Amzn-Trace-Id\": \"Root=1-621f2a94-312af83b33c7254e33469d09\"\n }\n}\n"
> # can also load text json response to R's named list:
> jsonlite::fromJSON(response$parse())
No encoding supplied: defaulting to UTF-8.
$headers
$headers$Accept
[1] "application/json, text/xml, application/xml, */*"
$headers$`Accept-Encoding`
[1] "gzip, deflate"
$headers$Host
[1] "httpbin.org"
$headers$`User-Agent`
[1] "libcurl/7.81.0 r-curl/4.3.2 crul/1.2.0"
Sometimes our web-scraper might need to submit some sort of forms to retrieve HTML results. For example, search queries often use POST
requests with query details as either JSON or Formdata values:
library("crul")
# send form type post request:
response <- HttpClient$new('https://httpbin.org/post')$post(
body = list("query" = "cats", "page" = 1),
encode = "form",
)
print(jsonlite::fromJSON(response$parse()))
# or json type post request:
response <- HttpClient$new('https://httpbin.org/post')$post(
body = list("query" = "cats", "page" = 1),
encode = "json",
)
print(jsonlite::fromJSON(response$parse()))
As we've covered before our requests must provide metadata about themselves which helps server determine what content to return. Often, this metadata can be used to identify web scrapers and block them. Modern web browsers automatically include specific metadata details with every request so if we wish to not stand out as a web scraper we should replicate this behavior.
Primarily, User-Agent
and Accept
headers are often dead giveaways so we should set them some common values. This can either be done globally or per request basis:
library("crul")
# we can set headers for every request
response <- HttpClient$new("https://httpbin.org/headers",
headers = list(
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
)
)$get()
print(jsonlite::fromJSON(response$parse())$headers)
# or set headers for the whole script (recommended)
set_headers(
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
)
response <- HttpClient$new("https://httpbin.org/headers")$get()
print(jsonlite::fromJSON(response$parse())$headers)
In the example above we're setting our headers to mimic Chrome web browser on windows platform. This simple change can prevent a lot of web scraping blocking and is recommended for every web scraper.
Sometimes when web-scraping we care about persistent connection state. For website where we need to login or configure website (like change currency) cookies are vital part of web scraping process.
Crul supports cookie tracking per HttpClient
basis meaning all requests attached to one client object share cookies:
library("crul")
session <- HttpClient$new('http://httpbin.org/')
# set some cookies:
resp_set_cookies <- session$get('/cookies/set/foo/bar')
# see current cookies:
resp_retrieve_cookies <- session$get('/cookies')
print(resp_retrieve_cookies$parse())
In the example above, we're using httpbin.org's /cookies
endpoint to set some cookies for the session. Once cookies are set we're being redirected to a page that displays sent cookies:
{
"cookies": {
"foo": "bar"
}
}
Now that we know our way around HTTP in R and Crul let's take a look at connection speed! How can we make these connections faster and more efficient?
Since HTTP protocol is a data exchange protocol between two parties there's a lot of waiting involved. In other words when our client sends a request it needs to wait for it to travel all the way to the server and comeback which stalls our program. Why should our program sit idly and wait for request to travel around the globe? This is called an IO (input/output) block.
We chose R's crul
package over httr
for this particular feature - it makes asynchronous requests very accessible:
library("crul")
start = Sys.time()
responses <- Async$new(
urls = c(
"http://httpbin.org/links/4/0",
"http://httpbin.org/links/4/1",
"http://httpbin.org/links/4/2",
"http://httpbin.org/links/4/3"
),
)$get()
print(responses)
print(Sys.time() - start)
In the example above we are batching multiple urls to execute them together. Alternatively we can go even further and execute varying different requests:
library("crul")
start = Sys.time()
responses <- AsyncVaried$new(
HttpRequest$new("http://httpbin.org/links/4/0")$get(),
HttpRequest$new("http://httpbin.org/links/4/1")$get(),
HttpRequest$new("http://httpbin.org/links/4/2")$get(),
HttpRequest$new("http://httpbin.org/links/4/3", headers=list("User-Agent"="different"))$get(),
HttpRequest$new("http://httpbin.org/post")$post(body=list(query="cats", page = 1))
)$request()
print(responses)
print(Sys.time() - start)
The above approach allows us to mix varying type and parameter requests.
Now that we're familiar and comfortable with HTTP protocol in R's crul
let's take a look how can we make sense from the HTML data we're retrieving. In the next section we'll take a look at HTML parsing using CSS and XPATH selectors in R!
Retrieving HTML documents is only one part of web scraping process - we also have to parse them for data we're looking for. Luckily, the HTML format is designed to be machine parsable, so we can take advantage of this and use special CSS or XPATH selector languages to find exact parts of the page to extract.
We've covered both CSS and XPATH selectors in great detail previous articles:
In R there's one library that is supports both CSS and XPATH selectors: rvest
Let's take a look at some common rvest use cases:
library("rvest")
tree <- read_html('
<div class="links">
<a href="https://twitter.com/@scrapfly_dev">Twitter</a>
<a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a>
</div>
')
# we can execute basic css selectors and pull all text values:
print(tree %>% html_element("div.links") %>% html_text())
# "[1] "\n Twitter\n LinkedIn\n""
# we can also execute xpath selectors:
print(tree %>% html_element(xpath="//div[@class='links']") %>% html_text())
# "[1] "\n Twitter\n LinkedIn\n""
# html_text2 - outputs are cleaned fo trailing/leading space characters:
print(tree %>% html_element("div") %>% html_text2())
# "[1] "Twitter LinkedIn""
# we can select attribute of a single element:
print(tree %>% html_element("div") %>% html_attr('class'))
# "links"
# or attributes of multiple elements:
print(tree %>% html_elements("div.links a") %>% html_attr('href'))
# [1] "https://twitter.com/@scrapfly_dev"
# [2] "https://www.linkedin.com/company/scrapfly/"
The primary function here is R's pipe symbol %>%
which allows us to process our HTML tree through multiple processors like XPATH or CSS selectors and text or attribute extractors.
Rvest also comes with some special parsing functionalities inspired by data science use cases. For example, it allows us to convert HTML tables to R's data frames:
library("rvest")
tree <- read_html('
<div class="table-wrapper">
<table>
<th>
<td>model</td>
<td>year</td>
</th>
<tr>
<td>Mazda</td>
<td>2011</td>
</tr>
<tr>
<td>Toyota</td>
<td>1992</td>
</tr>
</table>
</div>
')
tree %>% html_element('.table-wrapper') %>% html_table()
## A tibble: 2 × 2
# X1 X2
# <chr> <int>
# 1 Mazda 2011
# 2 Toyota 1992
In the example above html_table()
pipe function automatically extracts the whole table from given selector if it includes <table>
node. It picks up table headers from <th>
nodes and even converts values to appropriate types (in this example year values were converted to integers).
The best way to really explore harvest is with an example projects so let's do just that!
To solidify our knowledge we'll write a short web scraper for https://uk.indeed.com/.
We'll be scraping job listing data from a given search query for R jobs:
https://uk.indeed.com/jobs?q=r&l=Scotland
rvest
We'll write a fast and easy-to-understand scraper by separating our logic into single-purpose functions.
Let's start our scraper from the bottom up by defining our constants and company parse function:
library("crul")
library("rvest")
HEADERS <- list(
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding" = "gzip, deflate, br",
"Accept-Language" = "en-US,en;q=0.9"
)
Here, we're defining our HEADERS
constant. To avoid having our scraper blocked we need to ensure it looks like a web browser. For this, we're simply copying the headers a Chrome browser would use on a Windows computer.
Next, we can define our company parse function. We'll use CSS selectors to extract job details and for that, we can use Chrome developer tools - take a look at this quick introduction:
In R, our parser logic would look something like this:
parse_search <- function(response){
# build rvest tree
tree <- read_html(response$parse())
# find total jobs available
total_jobs <- tree %>% html_element("#searchCountPages") %>% html_text2()
total_jobs <- strtoi(stringr::str_match(total_jobs, "(\\d+) jobs")[,2])
# find displayed job container boxes:
jobs <- tree %>% html_elements('#mosaic-zone-jobcards .result')
# extract job listing from each job container box
parsed <- list()
for (job in jobs){
parsed <- append(parsed, list(
title = job %>% html_element('h2') %>% html_text2(),
company = job %>% html_element('.companyOverviewLink') %>% html_text2(),
location = job %>% html_element('.companyLocation') %>% html_text2(),
date = job %>% html_element(xpath=".//span[contains(@class,'date')]/text()") %>% html_text2(),
url = url_build(response$url, job %>% html_attr('href')),
company_url = url_build(response$url, job %>% html_element(xpath='.//a[contains(@class,"companyOverviewLink")]') %>% html_attr('href'))
))
}
# return parsed jobs and total job count in the query
print(glue("found total {length(jobs)} jobs from total of {total_jobs}"))
list(jobs=parsed, total=total_jobs)
}
Here we have a function that takes a response object, builds a rvest HTML tree and using a combination of CSS and XPath selections we're extracting job listing details.
We do this by first selecting job listing container boxes (the li
elements). Then, we iterate through each one of them and extract the job details.
We can test this code by explicitly scraping 1 company:
url <- "https://uk.indeed.com/jobs?q=r&l=scotland")
response <- HttpClient$new(url, header=HEADERS)$get()
print(parse_search(response))
This will return the results of the first query page and the count of total listings:
found total 15 jobs from total of 330
$jobs
$jobs$title
[1] "Parts Supervisor"
$jobs$company
[1] "BMW Group Retail"
$jobs$location
[1] "Edinburgh"
$jobs$date
[1] "4 days ago"
$jobs$url
[1] "https://uk.indeed.com/rc/clk?jk=ad7698381a9870de&fccid=bf564b9e3f3db3fc&vjs=3"
$jobs$company_url
[1] "https://uk.indeed.com/cmp/BMW-Group"
<...>
$total
[1] 330
To finish our scraper let's add a scraping loop that will scrape the first page, parse the results and then scrape the remaining pages in parallel:
scrape_search_page <- function(query, location, offset=0, limit=10){
# first we need to create all urls we'll be scraping based on offset and limit arguments
# 0:10 will create first page scrape url, and 10:80 will create 1-8 pages since there are 10 results per page
print(glue("scraping {query} at {location} in range {offset}:{limit}"))
urls <- list()
for (i in seq(offset+10, limit, by=10)){
urls <- append(urls, glue("https://uk.indeed.com/jobs?q={query}&l={location}&start={i}"))
}
# then we want to retrieve these urls in parallel:
print(glue("scraping search page urls: {urls}"))
responses <- Async$new(
urls = urls,
headers = HEADERS,
)$get()
# finally we want to unpack results of each individual page into final dataset
found_jobs <- list()
total_jobs <- NULL
for (response in responses){
page_results <- parse_search(response)
found_jobs <- append(found_jobs, page_results$jobs)
total_jobs <- page_results$total
}
# we return jobs we parsed and total jobs presented in the search page:
list(jobs=found_jobs, total=total_jobs)
}
scrape_search <- function(query, location){
# this is our main function that scrapes all pages of the query explicitly
# first, we scrape the first page
first_page_results <- scrape_search_page(query, location)
found_jobs <- first_page_results$jobs
total_jobs <- first_page_results$total
# then we scrape remaining pages:
print(glue("scraped first page, found {length(found_jobs)} jobs; continuing with remaining: {total_jobs}"))
remaining_page_results <- scrape_search_page(query, location, offset = 10, limit = total_jobs)
# finally, we return combined dataset
append(found_jobs, remaining_page_results$jobs)
}
Here we've defined our scraping loop which utilizes the parallel connection feature of crul. Let's try our scraper and time it:
start = Sys.time()
print(scrape_search("r", "Scotland"))
print(Sys.time() - start)
We can see that we scraped over 300 job listings in under 10 seconds! We can easily embed this tiny scraper into our R workflow without the need to worry about caching or data storage as each dataset scrape would only take a few seconds to refresh.
Our example, while complete, is not production ready as web scraping suffers from a lot of challenges like web scraper blocking and connection issues.
For this, let's take a look at ScrapFly's web scraping API and how could we use it in R to abstract away these issues to ScrapFly service so we don't have to deal with them ourselves!
Web scraping with R can be surprisingly accessible however scaling up R scrapers can be difficult due to lacking infrastructure in R language for bypass blocking and scaling. This is where Scrapfly can lend a hand!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Let's take a look how can we use ScrapFly's API to scrape content with all of these features in R!
library(crul)
SCRAPFLY_KEY <- "YOUR_API_KEY"
scrapfly <- function(url,
render_js = FALSE,
wait_for_selector = NULL,
asp = FALSE,
country = "us",
proxy_pool = "public_datacenter_pool") {
url_build("https://api.scrapfly.io/scrape", query = list(
key = SCRAPFLY_KEY,
url = url,
asp = if (asp) "true" else "false",
render_js = if (render_js) "true" else "false",
wait_for_selector = wait_for_selector,
country = country,
proxy_pool = proxy_pool
))
}
response <- HttpClient$new(scrapfly('https://httpbin.org/headers'))$get()
response_with_javascript_render <- HttpClient$new(scrapfly('https://httpbin.org/headers', render_js = TRUE))$get()
data <- jsonlite::fromJSON(response$parse())
print(data$result$content)
print(data$result$response_headers)
print(data$result$cookies)
print(data$config)
In the example above, we define a function that turns regular web url into a ScrapFly API url, which we can pass to our requests to take advantage of ScrapFly features with no extra effort.
For more features, usage scenarios and other info see ScrapFly's documentation!
In this extensive introduction article on web scraping in R we've taken a look at HTTP connection strategies using crul library: how to submit get and post requests, change headers, keep track of cookies and finally, how to do that all efficiently using asynchronous (parallel) requests.
With HTTP bits figured out, we've taken a look at HTML parsing using rvest which supports both CSS and XPath selector-based element extraction.
Finally, we put everything together in a small web scraper example of https://uk.indeed.com/ and took a quick look at some common web scraping challenges and how we can use ScrapFly web scraping API to solve them for us!