Guide to PHP 8.4 new DOM Selector Feature
Learn about PHP 8.4’s new DOM Selector feature. Simplify DOM manipulation using intuitive CSS selectors for cleaner, more efficient code.
In this web scraping tutorial we'll take a look at PHP and how can it be used to scrape the web. While Javascript and Python are the most popular language for web scraping, PHP has most of the same tools available which we'll take a deep look into today.
We'll start with an overview of scraping basics like how to send HTTP requests and how to parse HTML - all of this using two of most popular PHP web scraping libraries: Guzzle and DomCrawler.
Finally, we'll wrap everything up with a real-life example project by scraping product information from https://www.producthunt.com/.
Web scraping is public data collection and there are thousands of reasons why one might want to collect this public data, ranging from finding potential employees to competitive intelligence.
We at ScrapFly did extensive research into web scraping applications see our web scraping use cases article.
PHP is well known for being one of the most popular server-side web languages, which means it's great for embedded real-time scrapers! Not only that, PHP runs on many systems and is easily accessible.
We need two tools: an HTTP client and an HTML parser.
Both of these tools are available in PHP in the form of several community libraries though this tutorial, we'll focus on two libraries in particular:
We'll split this tutorial into two parts, each reflecting one of these tools: first, we'll take a look at scraping data using Guzzle and then we'll parse these documents using the dom parser capabilities of DomCrawler.
PHP offers numerous HTTP clients, however the two most commonly used ones are: The standard library's curl client and most popular community client called Guzzle.
There are many differences between these two clients but when it comes to web scraping the main ones are:
So, to summarize, Guzzle is easier to use and often faster while curl library is more feature rich but more difficult to use and harder to optimize. We'll stick with Guzzle for the time being, but before we take it for a spin let's do a quick overview of what is HTTP anyways?
To collect data from a public resource, we need to establish a connection with it first. Most of the web is served over HTTP which is rather simple: we (the client) send a request for a specific document to the website (the server) and once the server processes our request it replies with a response (he document) - a very straight forward exchange!
As you can see in this illustration: we send a request object which consists of a method (aka type), location and headers. In turn, we receive a response object which consists of the status code, headers and document content itself.
Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.
When it comes to web scraping we don't exactly need to know every little detail about HTTP protocol though we should be familiar with the concept of requests and responses.
HTTP requests are conveniently divided into a few types that perform distinct functions:
GET
requests are intended to request a document.POST
requests are intended to request a document by sending a document.HEAD
requests are intended to request the document's meta information.We'll mostly encounter these three in web scraping. We'll be using GET
to retrieve web pages, POST
to submit search forms and other web page actions and HEAD
to poke web pages and see whether they're worth scraping.
Other request methods that are rarely encountered in web scraping are:
PATCH
requests are intended to update a document.PUT
requests are intended to either create a new document or update it.DELETE
requests are intended to delete a document.URL (universal resource location) indicates what resources we are requesting. We can think of it as an ID made from several different parts:
Here, we can visualize each part of a URL: we have the protocol which when it comes to HTTP is either http
or https
, then we have the host which is the address (or domain) of the server, and finally, we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up PHP's interactive shell (php -a
) and let it figure it out for you:
php > var_dump(parse_url("https://www.domain.com/path/to/resource?arg1=true&arg2=false"));
array(4) {
'scheme' =>
string(4) "http"
'host' =>
string(14) "www.domain.com"
'path' =>
string(17) "/path/to/resource"
'query' =>
string(20) "arg1=true&arg2=false"
}
While it might appear like request headers are just minor metadata details in web scraping, they are extremely important.
Headers contain essential details about the request, like who's requesting the data? What type of data they are expecting? etc. Getting these wrong might result in the web scraper being denied access or returning an error response.
Let's take a look at some of the most important headers and what they mean:
User-Agent is an identity header that tells the server who's requesting the document.
# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
Whenever you visit a web page in your web browser identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers".
This helps the server to determine whether to serve or deny the client. In web scraping, we (obviously) don't want to be denied access, so we have to blend in by faking our user agent to look like that one of a browser.
There are many online databases that contain latest user-agent strings of various platforms, like user agent database by whatismyborwser.com
Cookie is used to store persistent data. This is a vital feature for websites to keep track of user state: user logins, configuration preferences etc.
Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
For more, see MDN default accepted values documentation
X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of the scraped website/webapp.
These are a few of most important observations, for more see extensive full documentation over at MDN's standard http header documentation
Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is required (like authentication).
Let's take a quick look at the status codes that are most relevant to web scraping:
/product1.html
it might be moved to a new location like /products/1.html
.For more on HTTP status codes, see HTTP status documentation at MDN
When it comes to web scraping, response headers provide some important information for connection functionality and efficiency. For example, Set-Cookie
header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag
, Last-Modified
are intended to help the client with caching to optimize resource usage.
Finally, just like with request headers, headers prefixed with an X-
are custom web functionality headers.
We took a brief overlook of core HTTP components, and now it's time we give it a go and see how HTTP works in practical PHP!
In this section, we'll be using Guzzle HTTP client and explore how it's used in common web scraping tasks.
First, we need to create a Client
object, also referred to as a Connection Pooling session or a HTTP persistent connection session. We'll be using this object to handle our configuration and send out requests:
use GuzzleHttp\Client;
$client = new Client();
$url = 'https://httpbin.org/html';
$response = $client->get($url);
// ^^^ Here we're using GET request but similarly we can use HEAD or POST
printf("POST request to %s", $url);
printf("status: %s\n", $response->getStatusCode());
printf("headers: %s\n", json_encode($response->getHeaders(), JSON_PRETTY_PRINT));
printf("body: %s", $response->getBody()->getContents());
// alternative to print full response structure use:
var_dump($response);
Here we're using https://httpbin.org/ HTTP testing service to retrieve a simple HTML page. When run, this script should print out the status code (e.g. 200), the headers(meta information) and the body(document data).
Sometimes our web scraper might need to submit some sort of form to retrieve HTML results. For example, search queries often use POST
requests with query details as JSON values:
use GuzzleHttp\Client;
$client = new Client();
$url = 'https://httpbin.org/post';
$response = $client->post(
'https://httpbin.org/post',
['json' => ['query' => 'foobar', 'page' => 2]]
// ^^^^^ using json argument we can pass an associative array which will be sent as a json type POST request
// alternatively we can use form type request:
// ['form_params' => ['query' => 'foobar', 'page' => 2]]
);
printf("POST request to %s", $url);
printf("status: %s\n", $response->getStatusCode());
printf("headers: %s\n", json_encode($response->getHeaders(), JSON_PRETTY_PRINT));
printf("body: %s", $response->getBody()->getContents());
Guzzle is smart enough to convert our PHP's associative array into correct JSON or form values for form submission. Based on json
argument, it'll prepare the request with appropriate Content-Type/Length
headers and convert the body value from an associative array to either JSON or a form object.
As we've covered before our requests must provide metadata about themselves which helps the server to determine what content to return.
Often, this metadata can be used to identify web scrapers and block them. Modern web browsers automatically include specific metadata details with every request so if we wish to not stand out as a web scraper we should replicate this behavior.
Primarily, User-Agent
and Accept
headers are often dead giveaways so when creating our Client
we can set them to values a normal Chrome browser would use:
$client = new Client([
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
]
]);
This will ensure that every request client is making will include these default headers.
Note that this is just the tip of the iceberg when it comes to bot blocking and request headers, however just setting User-Agent
and Accept
headers should make us much harder to detect!
Now that we know how to properly make requests using Guzzle let's take a look at how can we make them much faster by using asynchronous code structure.
Since HTTP protocol is a data exchange protocol between two parties there's a lot of waiting involved.
In other words, when our client sends a request it needs to wait for it to travel all the way to the server and come back which stalls our program. Why should our program sit idly and wait for requests to travel around the globe? This is called an IO (input/output) block.
The main way to deal with IO blocks in PHP is to use asynchronous promises or callbacks. In other words, when we make a request the HTTP client returns us a "promise" object that will turn into content sometime in the future. This allows us to concurrently schedule multiple requests which make our web scraper significantly faster!
Let's take a look at a synchronous code making 10 requests:
use GuzzleHttp\Client;
$client = new Client();
$_start = microtime(true);
// Array of 10 urls:
$urls = array_fill(0, 50, 'https://httpbin.org/html');
// Create promise objects from urls array:
$responses = array_map(
function ($url) use ($client) {
return $client->get($url);
},
$urls
);
printf('finished %d requests in %.2f seconds\n', count($responses), microtime(true) - $_start);
Here we are making 10 requests to https://httpbin.org/html and if we run the code it would take around 5
seconds to complete. It doesn't sound like much but this scales almost linearly: if we make 100 requests that'll be 50 seconds; 1000 requests - will be over 8 minutes!
Instead, let's use asynchronous programming with promises:
use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;
// initiate http client
$client = new Client([
// Client config allows us to set fail conditions
// for example we can set request timeout options:
'connect_timeout' => 5,
'timeout' => 2.00,
// we can also specify that 400/500 requests will be considered as failures as well:
'http_errors' => false,
]);
// create 10 Request objects:
$urls = array_fill(0, 50, 'https://httpbin.org/html');
$requests = array_map(function ($url) {
return new Request('GET', $url);
}, $urls);
// define are callbacks:
// This will be called for every successful response
function handleSuccess(Response $response, $index)
{
global $urls;
printf("success: %s\n", $urls[$index]);
}
function handleFailure($reason, $index)
{
global $urls;
printf(
"failed: %s, \n reason: %s\n",
$urls[$index],
$reason,
);
}
// scrape our requests
$_start = microtime(true);
$pool = new Pool($client, $requests, [
// we can set concurrency limit to prevent scraping too fast which might cause our scraper to be blocked
'concurrency' => 20,
'fulfilled' => 'handleSuccess',
'rejected' => 'handleFailure',
]);
$pool->promise()->wait();
printf('finished %d requests in %.2f seconds\n', count($urls), microtime(true) - $_start);
Here, we have reworked our code from synchronous code to promise + callback/errorback structure. We are creating 10 Request
objects and passing them to a request pool which will send them all together.
We also provide 2 functions to our pool: what to do with each successful request and what to do with each failed request. Ideally, we'd want to log/retry failed ones and parse data from good ones.
Here, the same 10 requests finish in 1-2 seconds which is at least 5 times faster than our synchronous example from before. When making thousands of requests the async approach can often be a hundred times faster!
In this section we've covered how can we retrieve HTML documents and how can we do it quickly while avoiding being blocked. Next, let's take a look at how can we extract data from HTML and finally put everything together into one cohesive example.
HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about it is that it's intended to be machine-readable text content, which is great news for web scraping as we can easily parse the data with code!
HTML DOM (Document Object Structure) is a tree-type structure that lends itself easily to machine parsing. For example, let's take this simple HTML content:
<head>
<title>
</title>
</head>
<body>
<h1>Introduction</h1>
<div>
<p>some description text: </p>
<a class="link" href="http://example.com">example link</a>
</div>
</body>
Here we see an extremely basic HTML document that a simple website might serve. You can already see the tree-like structure just by indentation of the text, but we can even go further and illustrate it:
This tree structure is brilliant for web scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under <head>
HTML element which in turn is under <title>
etc.
In other words - if we wanted to extract 1000 titles for 1000 different pages, we would write a rule to find head->title->text
for every one of them.
When it comes to HTML parsing, there are two standard ways to write these rules: CSS selectors and XPath selectors - let's dive further and see how can we use them to parse web-scrapped data!
We'll be using DomCrawler
as our HTML document parser, and it supports both CSS selector and XPATH selectors which we covered in depth in previous articles: Parsing HTML with CSS Selectors and Parsing HTML with Xpath
Let's start with a simple XPath selector-based parsing example:
use Symfony\Component\DomCrawler\Crawler;
// example html document
$html = <<<'HTML'
<head>
<title>My Website</title>
</head>
<body>
<div class="content">
<h1>First blog post</h1>
<p>Just started this blog!</p>
<a href="https://scrapfly.io/blog">Checkout My Blog</a>
</div>
</body>
HTML;
// first we build our Crawler tree
$tree = new Crawler($html);
// then we can run xpaths against it:
printf($tree->filterXPath('//a/@href')->text());
// https://scrapfly.io/blog
In the example above, we defined an example HTML document, built a tree object (Crawler
) and used a simple XPATH selector to extract the href
attribute of the first link.
However, often CSS selectors can be a more elegant solution. For this we can install optional dependency css-selector which provides CSS selector support to our Crawler
object as well:
printf($tree->filter('a::attr(href)')->text());
// https://scrapfly.io/blog
There's much more to DomCrawler
than just XPath and CSS selectors but for web scraping, we're mostly interested in these two features. Now that we're familiar with them, let's build a real web scraper!
It's time to put everything we've learned into an example PHP website scraper. In this section, we'll be scraping https://www.producthunt.com/ which essentially is a technical product directory where people submit and discuss new tech products.
Our scraper should find product urls (e.g. https://www.producthunt.com/products/slack#slack) from a product directory (e.g. https://www.producthunt.com/topics/developer-tools) and scrape each product for fields: title, subtitle, votes and tags:
Let's see the full scraping script and then take a look at individual actions/components:
use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;
use Symfony\Component\DomCrawler\Crawler;
// initiate http client
$client = new Client([
'connect_timeout' => 10,
'timeout' => 10.00,
'http_errors' => true,
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
]
]);
// global storage where all results will be added to:
$results = [];
// First we define our main scraping loop:
function scrape($urls, $callback, $errback)
{
// create 10 Request objects:
$requests = array_map(function ($url) {
return new Request('GET', $url);
}, $urls);
global $client;
$pool = new Pool($client, $requests, [
'concurrency' => 5,
'fulfilled' => $callback,
'rejected' => $errback,
]);
$pool->promise()->wait();
}
// Then, we define our callbacks:
// 1. This will be called for every product scrape:
function parseProduct(Response $response, $index)
{
$tree = new Crawler($response->getBody()->getContents());
$result = [
// we can use xpath selectors:
'title' => $tree->filterXpath('//h1')->text(),
'subtitle' => $tree->filterXpath('//h2')->text(),
// or css selectors:
'votes' => $tree->filter("span[class*='bigButtonCount']")->text(),
// to get multiple elements we need to use each() mapping:
'tags' => $tree->filterXpath(
"//div[contains(@class,'topicPriceWrap')]
//a[contains(@href, '/topics/')]/text()"
)->each(function ($node, $i) {
return $node->text();
}),
];
global $results;
array_push($results, $result);
}
// 2. This will be called for every directory scrape:
function parseDirectory(Response $response, $index)
{
$tree = new Crawler($response->getBody()->getContents());
$urls = $tree->filter("div[class*='item'] a[class*=comments]")->each(
function ($node, $i) {
return 'https://www.producthunt.com' . $node->attr('href');
}
);
scrape(
$urls,
'parseProduct',
'logFailure',
);
}
// 3. This will be called for every failed request be it product or directory:
function logFailure($reason, $index)
{
printf("failed: %s\n", $reason);
}
// Finally, we can define our scrape logic and run the scraper:
$start_urls = [
// define urls where to find product urls, like topic directory:
"https://www.producthunt.com/topics/developer-tools",
];
$_start = microtime(true);
scrape($start_urls, 'parseDirectory', 'logFailure');
printf('scraped %d results in %.2f seconds', count($results), microtime(true) - $_start);
echo '\n';
echo json_encode($results, JSON_PRETTY_PRINT);
This looks pretty lengthy, so let's break it down and take look at individual steps we're doing here:
Client
object which will handle all connectionsparseProduct()
will be called which will extract data from the HTML and append it to the $results
storage variable.parseDirectory()
which will be called when a directory scrape succeeds and scrape all found products.logFailure()
function. Ideally, in production, we want to implement some sort of retry functionality or store failures in the database to retry later (for now, let's just log them)start_urls
which contains URLs to product directories and schedules the entire scrape logic.If we run this script we should see output something like this:
scraped 20 results in 9.25 seconds
[
{
"title": "Unsplash 5.0",
"subtitle": "Free (do whatever you want) high-resolution photos.",
"votes": "7,003",
"tags": [
"Web App",
"Design Tools",
"Photography"
]
},
{
"title": "Sublime Text 3.0",
"subtitle": "The long awaited version 3 of the popular code editor",
"votes": "5,579",
"tags": [
"Linux",
"Windows",
"Mac"
]
},
...
Web scraping with PHP can be surprisingly straight forward however scaling up PHP scrapers can still be difficult and this is where Scrapfly can lend a hand!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Let's take a quick look at how can we enable ScrapFly middleware in a PHP web scraper:
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
$SCRAPFLY_KEY = 'YOUR_API_KEY';
// function that create ScrapFly request for url
// see more on:
// https://scrapfly.io/docs/scrape-api/getting-started?language=php
function scrapflyRequest($url, array $config = [])
{
global $SCRAPFLY_KEY;
$query = array_merge([
'key' => $SCRAPFLY_KEY,
'url' => urlencode($url),
], $config);
$req = new Request(
'GET',
'https://api.scrapfly.io/scrape?' . http_build_query($query),
['query' => $query]
);
var_dump($req);
return $req;
}
$client = new Client();
$response = $client->send(
scrapflyRequest('https://www.producthunt.com/products/slack#slack', [
// will use browser to render page with javascript: https://scrapfly.io/docs/scrape-api/javascript-rendering
'render_js' => 'true',
// select proxy location: https://scrapfly.io/docs/scrape-api/proxy
'country' => 'us',
// use custom proxy pools like residential or mobile proxies: https://scrapfly.io/docs/scrape-api/proxy
// 'proxy_pool' => 'public_residential_proxy',
// use anti bot bypass: https://scrapfly.io/docs/scrape-api/anti-scraping-protection
'asp' => 'true',
// return DNS data: https://scrapfly.io/docs/scrape-api/dns
'dns' => 'true',
// return SSL data: https://scrapfly.io/docs/scrape-api/ssl
'ssl' => 'true',
// debug request: https://scrapfly.io/docs/scrape-api/debug
'debug' => 'true',
])
);
$data = json_decode($response->getBody()->getContents());
var_dump($data->result->content);
In this example, we're making a simple request to https://www.producthunt.com/products/slack#slack through ScrapFly with special options like proxy location, javascript rendering and many more! Using ScrapFly allows us to focus on creating web scrapers rather than various connectivity issues and spider blocks - give it a go!
Let's wrap this article up with some frequently asked questions regarding web scraping in PHP:
Yes, php-webdriver can be used as a Selenium client to launch a real web browser and retrieve web data using web browser actions instead of Guzzle HTTP client we used today.
Web crawling involves a few extra components that help the scraper to discover web pages. In this tutorial, we've covered scraping as we provided URLs to scrape directly. On the other hand, a web crawler would be a program that can find product URLs by itself by exploring the given website.
In this extensive introduction article, we've taken an overview look at basic web scraping in PHP. We quickly introduced ourselves to HTTP protocol and HTML tree structure. Further, we've taken a look at two most popular web scraping libraries: Guzzle which is a modern http client and DomCrawler which allows us to parse data from HTML documents in either XPATH or CSS selectors.
Finally, we wrapped everything up with some examples and small product data scraper of https://www.producthunt.com/.
That's just the beginning of your web scraping journey. We hadn't covered a lot of challenges in web scraping like access blocking, proxies, dynamic content and many scaling techniques - there's still a lot to learn, but this introduction should be a good starting point.
To wrap this up, we'll take a look at ScrapFly's middleware service, which automatically resolves common web scraping issues like blocking and dynamic data rendering - give it a shot for free!