Web Scraping With PHP 101

In this web scraping tutorial we'll take a look at PHP and how can it be used to scrape the web. While Javascript and Python are the most popular language for web scraping, PHP has most of the same tools available which we'll take a deep look into today.

We'll start with an overview of scraping basics like how to send HTTP requests and how to parse HTML - all of this using two of most popular PHP web scraping libraries: Guzzle and DomCrawler.

Finally, we'll wrap everything up with a real-life example project by scraping product information from https://www.producthunt.com/.

What is web scraping?

Web scraping is public data collection and there are thousands of reasons why one might want to collect this public data, ranging from finding potential employees to competitive intelligence.

We at ScrapFly did extensive research into web scraping applications see our web scraping use cases article.

Why Web Scrape with PHP?

PHP is well known for being one of the most popular server-side web languages, which means it's great for embedded real-time scrapers! Not only that, PHP runs on many systems and is easily accessible.

Setup

We need two tools: an HTTP client and an HTML parser.
Both of these tools are available in PHP in the form of several community libraries though this tutorial, we'll focus on two libraries in particular:

  • Guzzle - HTTP client library that helps us retrieve web page contents.
  • DomCrawler - HTML parsing client that helps us extract specific details we want from full web page HTML documents.

We'll split this tutorial into two parts, each reflecting one of these tools: first, we'll take a look at scraping data using Guzzle and then we'll parse these documents using the dom parser capabilities of DomCrawler.

Making Requests

PHP offers numerous HTTP clients, however the two most commonly used ones are: The standard library's curl client and most popular community client called Guzzle.
There are many differences between these two clients but when it comes to web scraping the main ones are:

  • Guzzle offers a better user experience
    A more modern and user-friendly library API allows us to handle exceptions, retries and failures much easier making Guzzle web scrapers easier to maintain.
  • Guzzle offers async support
    We'll talk more about synchronous vs asynchronous code but essentially asynchronous client allows us to make multiple requests in parallel making our web scraper much faster!
  • Guzzle doesn't support newer http2 or http3(QUIC) protocols
    Currently, there are 3 active HTTP protocols: http1.1, http2 and http3(aka QUIC). While performance gains for web scraping in http2/3 aren't significant, http1 connections can result in web scraper being blocked. That being said, we'll cover some ways to get around this later in the article.

So, to summarize, Guzzle is easier to use and often faster while curl library is more feature rich but more difficult to use and harder to optimize. We'll stick with Guzzle for the time being, but before we take it for a spin let's do a quick overview of what is HTTP anyways?

HTTP Protocol Fundamentals

To collect data from a public resource, we need to establish a connection with it first. Most of the web is served over HTTP which is rather simple: we (the client) send a request for a specific document to the website (the server) and once the server processes our request it replies with a response (he document) - a very straight forward exchange!

illustration of a standard http exchange
illustration of a standard http exchange

As you can see in this illustration: we send a request object which consists of a method (aka type), location and headers. In turn, we receive a response object which consists of the status code, headers and document content itself.
Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.

What are Requests and Responses?

When it comes to web scraping we don't exactly need to know every little detail about HTTP protocol though we should be familiar with the concept of requests and responses.

Request Method

HTTP requests are conveniently divided into a few types that perform distinct functions:

  • GET requests are intended to request a document.
  • POST requests are intended to request a document by sending a document.
  • HEAD requests are intended to request the document's meta information.

We'll mostly encounter these three in web scraping. We'll be using GET to retrieve web pages, POST to submit search forms and other web page actions and HEAD to poke web pages and see whether they're worth scraping.

Other request methods that are rarely encountered in web scraping are:

  • PATCH requests are intended to update a document.
  • PUT requests are intended to either create a new document or update it.
  • DELETE requests are intended to delete a document.
    It's unlikely we'll see these when web scraping but it's good to know what they are neverhteless.

Request Location

URL (universal resource location) indicates what resources we are requesting. We can think of it as an ID made from several different parts:

illustration showing general URL structure
Example of a URL structure

Here, we can visualize each part of a URL: we have the protocol which when it comes to HTTP is either http or https, then we have the host which is the address (or domain) of the server, and finally, we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up PHP's interactive shell (php -a) and let it figure it out for you:

php > var_dump(parse_url("https://www.domain.com/path/to/resource?arg1=true&arg2=false"));
array(4) {
  'scheme' =>
  string(4) "http"
  'host' =>
  string(14) "www.domain.com"
  'path' =>
  string(17) "/path/to/resource"
  'query' =>
  string(20) "arg1=true&arg2=false"
}

Request Headers

While it might appear like request headers are just minor metadata details in web scraping, they are extremely important.
Headers contain essential details about the request, like who's requesting the data? What type of data they are expecting? etc. Getting these wrong might result in the web scraper being denied access or returning an error response.

Let's take a look at some of the most important headers and what they mean:

User-Agent is an identity header that tells the server who's requesting the document.

# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

Whenever you visit a web page in your web browser identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers".
This helps the server to determine whether to serve or deny the client. In web scraping, we (obviously) don't want to be denied access, so we have to blend in by faking our user agent to look like that one of a browser.

There are many online databases that contain latest user-agent strings of various platforms, like user agent database by whatismyborwser.com

Cookie is used to store persistent data. This is a vital feature for websites to keep track of user state: user logins, configuration preferences etc.

Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:

text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of the scraped website/webapp.

These are a few of most important observations, for more see extensive full documentation over at MDN's standard http header documentation

Response Status Code

Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is required (like authentication).
Let's take a quick look at the status codes that are most relevant to web scraping:

  • 200 range codes mean success!
  • 300 range codes mean redirection - the page is somewhere else now.
    In other words, if we request the content of /product1.html it might be moved to a new location like /products/1.html.
  • 400 range codes mean the request is malformed or denied.
    This usually happens if our web scraper is missing some headers, cookies or authentication details.
  • 500 range codes typically mean server issues.
    The website might be unavailable right now or is purposefully disabling access to our web scraper.

For more on HTTP status codes, see HTTP status documentation at MDN

Response Headers

When it comes to web scraping, response headers provide some important information for connection functionality and efficiency. For example, Set-Cookie header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag, Last-Modified are intended to help the client with caching to optimize resource usage.

Finally, just like with request headers, headers prefixed with an X- are custom web functionality headers.


We took a brief overlook of core HTTP components, and now it's time we give it a go and see how HTTP works in practical PHP!

Making GET Requests

In this section, we'll be using Guzzle HTTP client and explore how it's used in common web scraping tasks.

First, we need to create a Client object, also referred to as a Connection Pooling session or a HTTP persistent connection session. We'll be using this object to handle our configuration and send out requests:

use GuzzleHttp\Client;

$client = new Client();
$url = 'https://httpbin.org/html';
$response = $client->get($url);
//                   ^^^ Here we're using GET request but similarly we can use HEAD or POST
printf("POST request to %s", $url);
printf("status: %s\n", $response->getStatusCode());
printf("headers: %s\n", json_encode($response->getHeaders(), JSON_PRETTY_PRINT));
printf("body: %s", $response->getBody()->getContents());
// alternative to print full response structure use:
var_dump($response);

Here we're using https://httpbin.org/ HTTP testing service to retrieve a simple HTML page. When run, this script should print out the status code (e.g. 200), the headers(meta information) and the body(document data).

Making POST requests

Sometimes our web scraper might need to submit some sort of form to retrieve HTML results. For example, search queries often use POST requests with query details as JSON values:

use GuzzleHttp\Client;
$client = new Client();
$url = 'https://httpbin.org/post';
$response = $client->post(
    'https://httpbin.org/post',
    ['json' => ['query' => 'foobar', 'page' => 2]]
//   ^^^^^ using json argument we can pass an associative array which will be sent as a json type POST request
//  alternatively we can use form type request:
//  ['form_params' => ['query' => 'foobar', 'page' => 2]]
);
printf("POST request to %s", $url);
printf("status: %s\n", $response->getStatusCode());
printf("headers: %s\n", json_encode($response->getHeaders(), JSON_PRETTY_PRINT));
printf("body: %s", $response->getBody()->getContents());

Guzzle is smart enough to convert our PHP's associative array into correct JSON or form values for form submission. Based on json argument, it'll prepare the request with appropriate Content-Type/Length headers and convert the body value from an associative array to either JSON or a form object.

Setting Headers to Prevent Blocking

As we've covered before our requests must provide metadata about themselves which helps the server to determine what content to return.
Often, this metadata can be used to identify web scrapers and block them. Modern web browsers automatically include specific metadata details with every request so if we wish to not stand out as a web scraper we should replicate this behavior.

Primarily, User-Agent and Accept headers are often dead giveaways so when creating our Client we can set them to values a normal Chrome browser would use:

$client = new Client([
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    ]
]);

This will ensure that every request client is making will include these default headers.

Note that this is just the tip of the iceberg when it comes to bot blocking and request headers, however just setting User-Agent and Accept headers should make us much harder to detect!

How Headers Are Used to Block Web Scrapers and How to Fix It

For more on how headers are used in web scraper blocking see our complete overview tutorial.

How Headers Are Used to Block Web Scrapers and How to Fix It

Now that we know how to properly make requests using Guzzle let's take a look at how can we make them much faster by using asynchronous code structure.

Speed Up with Asynchronous Requests

Since HTTP protocol is a data exchange protocol between two parties there's a lot of waiting involved.
In other words, when our client sends a request it needs to wait for it to travel all the way to the server and come back which stalls our program. Why should our program sit idly and wait for requests to travel around the globe? This is called an IO (input/output) block.

The main way to deal with IO blocks in PHP is to use asynchronous promises or callbacks. In other words, when we make a request the HTTP client returns us a "promise" object that will turn into content sometime in the future. This allows us to concurrently schedule multiple requests which make our web scraper significantly faster!

Let's take a look at a synchronous code making 10 requests:

use GuzzleHttp\Client;

$client = new Client();

$_start = microtime(true);
// Array of 10 urls:
$urls = array_fill(0, 50, 'https://httpbin.org/html');
// Create promise objects from urls array:
$responses = array_map(
  function ($url) use ($client) {
    return $client->get($url);
  },
  $urls
);
printf('finished %d requests in %.2f seconds\n', count($responses), microtime(true) - $_start);

Here we are making 10 requests to https://httpbin.org/html and if we run the code it would take around 5 seconds to complete. It doesn't sound like much but this scales almost linearly: if we make 100 requests that'll be 50 seconds; 1000 requests - will be over 8 minutes!

Instead, let's use asynchronous programming with promises:

use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;

// initiate http client
$client = new Client([
    // Client config allows us to set fail conditions
    // for example we can set request timeout options:
    'connect_timeout' => 5,
    'timeout'         => 2.00,
    // we can also specify that 400/500 requests will be considered as failures as well:
    'http_errors'     => false,
]);

// create 10 Request objects:
$urls = array_fill(0, 50, 'https://httpbin.org/html');
$requests = array_map(function ($url) {
    return new Request('GET', $url);
}, $urls);

// define are callbacks:
// This will be called for every successful response
function handleSuccess(Response $response, $index)
{
    global $urls;
    printf("success: %s\n", $urls[$index]);
}

function handleFailure($reason, $index)
{
    global $urls;
    printf(
        "failed: %s, \n  reason: %s\n",
        $urls[$index],
        $reason,
    );
}

// scrape our requests
$_start = microtime(true);
$pool = new Pool($client, $requests, [
    // we can set concurrency limit to prevent scraping too fast which might cause our scraper to be blocked
    'concurrency' => 20,
    'fulfilled' => 'handleSuccess',
    'rejected' => 'handleFailure',
]);
$pool->promise()->wait();
printf('finished %d requests in %.2f seconds\n', count($urls), microtime(true) - $_start);

Here, we have reworked our code from synchronous code to promise + callback/errorback structure. We are creating 10 Request objects and passing them to a request pool which will send them all together.
We also provide 2 functions to our pool: what to do with each successful request and what to do with each failed request. Ideally, we'd want to log/retry failed ones and parse data from good ones.

Here, the same 10 requests finish in 1-2 seconds which is at least 5 times faster than our synchronous example from before. When making thousands of requests the async approach can often be a hundred times faster!


In this section we've covered how can we retrieve HTML documents and how can we do it quickly while avoiding being blocked. Next, let's take a look at how can we extract data from HTML and finally put everything together into one cohesive example.

Parsing HTML Content

HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about it is that it's intended to be machine-readable text content, which is great news for web scraping as we can easily parse the data with code!

HTML DOM (Document Object Structure) is a tree-type structure that lends itself easily to machine parsing. For example, let's take this simple HTML content:

<head>
  <title>
  </title>
</head>
<body>
  <h1>Introduction</h1>
  <div>
    <p>some description text: </p>
    <a class="link" href="http://example.com">example link</a>
  </div>
</body>

Here we see an extremely basic HTML document that a simple website might serve. You can already see the tree-like structure just by indentation of the text, but we can even go further and illustrate it:

illustration of a html node tree
example of a HTML node tree. Note that branches are ordered (left-to-right)

This tree structure is brilliant for web scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under <head> HTML element which in turn is under <title> etc.
In other words - if we wanted to extract 1000 titles for 1000 different pages, we would write a rule to find head->title->text for every one of them.

When it comes to HTML parsing, there are two standard ways to write these rules: CSS selectors and XPath selectors - let's dive further and see how can we use them to parse web-scrapped data!

Using DomCrawler

We'll be using DomCrawler as our HTML document parser, and it supports both CSS selector and XPATH selectors which we covered in depth in previous articles: Parsing HTML with CSS Selectors and Parsing HTML with Xpath

Let's start with a simple XPath selector-based parsing example:

use Symfony\Component\DomCrawler\Crawler;

// example html document
$html = <<<'HTML'
<head>
    <title>My Website</title>
</head>
<body>
    <div class="content">
        <h1>First blog post</h1>
        <p>Just started this blog!</p>
        <a href="https://scrapfly.io/blog">Checkout My Blog</a>
    </div>
</body>
HTML;

// first we build our Crawler tree
$tree = new Crawler($html);
// then we can run xpaths against it:
printf($tree->filterXPath('//a/@href')->text());
// https://scrapfly.io/blog

In the example above, we defined an example HTML document, built a tree object (Crawler) and used a simple XPATH selector to extract the href attribute of the first link.

However, often CSS selectors can be a more elegant solution. For this we can install optional dependency css-selector which provides CSS selector support to our Crawler object as well:

printf($tree->filter('a::attr(href)')->text());
// https://scrapfly.io/blog

There's much more to DomCrawler than just XPath and CSS selectors but for web scraping, we're mostly interested in these two features. Now that we're familiar with them, let's build a real web scraper!

Example Project

It's time to put everything we've learned into an example PHP website scraper. In this section, we'll be scraping https://www.producthunt.com/ which essentially is a technical product directory where people submit and discuss new tech products.

Our scraper should find product urls (e.g. https://www.producthunt.com/products/slack#slack) from a product directory (e.g. https://www.producthunt.com/topics/developer-tools) and scrape each product for fields: title, subtitle, votes and tags:

Let's see the full scraping script and then take a look at individual actions/components:

use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;
use Symfony\Component\DomCrawler\Crawler;

// initiate http client
$client = new Client([
    'connect_timeout' => 10,
    'timeout'         => 10.00,
    'http_errors'     => true,
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    ]
]);
// global storage where all results will be added to:
$results = [];

// First we define our main scraping loop:
function scrape($urls, $callback, $errback)
{
    // create 10 Request objects:
    $requests = array_map(function ($url) {
        return new Request('GET', $url);
    }, $urls);
    global $client;
    $pool = new Pool($client, $requests, [
        'concurrency' => 5,
        'fulfilled' => $callback,
        'rejected' => $errback,
    ]);
    $pool->promise()->wait();
}


// Then, we define our callbacks:
// 1. This will be called for every product scrape:
function parseProduct(Response $response, $index)
{
    $tree = new Crawler($response->getBody()->getContents());
    $result = [
        // we can use xpath selectors:
        'title' => $tree->filterXpath('//h1')->text(),
        'subtitle' => $tree->filterXpath('//h2')->text(),
        // or css selectors:
        'votes' => $tree->filter("span[class*='bigButtonCount']")->text(),
        // to get multiple elements we need to use each() mapping:
        'tags' => $tree->filterXpath(
            "//div[contains(@class,'topicPriceWrap')]
            //a[contains(@href, '/topics/')]/text()"
        )->each(function ($node, $i) {
            return $node->text();
        }),
    ];
    global $results;
    array_push($results, $result);
}
// 2. This will be called for every directory scrape:
function parseDirectory(Response $response, $index)
{
    $tree = new Crawler($response->getBody()->getContents());
    $urls = $tree->filter("div[class*='item'] a[class*=comments]")->each(
        function ($node, $i) {
            return 'https://www.producthunt.com' . $node->attr('href');
        }
    );
    scrape(
        $urls,
        'parseProduct',
        'logFailure',
    );
}


// 3. This will be called for every failed request be it product or directory:
function logFailure($reason, $index)
{
    printf("failed: %s\n", $reason);
}

// Finally, we can define our scrape logic and run the scraper:
$start_urls = [
    // define urls where to find product urls, like topic directory:
    "https://www.producthunt.com/topics/developer-tools",
];

$_start = microtime(true);
scrape($start_urls, 'parseDirectory', 'logFailure');
printf('scraped %d results in %.2f seconds', count($results), microtime(true) - $_start);
echo '\n';
echo json_encode($results, JSON_PRETTY_PRINT);

This looks pretty lengthy, so let's break it down and take look at individual steps we're doing here:

  1. We establish our global Client object which will handle all connections
  2. Then we defined our asynchronous scraper function that takes a list of URLs to scrape and 2 functions (or function names) that will be called for successes or failures. This is our abstract scraping executor.
  3. Further, we define our parsing callbacks. When product scrape succeeds, parseProduct() will be called which will extract data from the HTML and append it to the $results storage variable.
  4. We also do the same thing with parseDirectory() which will be called when a directory scrape succeeds and scrape all found products.
  5. We also need a common failure handler which is our logFailure() function. Ideally, in production, we want to implement some sort of retry functionality or store failures in the database to retry later (for now, let's just log them)
  6. Finally, we finish everything off with a tiny script that executes our logic. We define start_urls which contains URLs to product directories and schedules the entire scrape logic.

If we run this script we should see output something like this:

scraped 20 results in 9.25 seconds
[
    {
        "title": "Unsplash 5.0",
        "subtitle": "Free (do whatever you want) high-resolution photos.",
        "votes": "7,003",
        "tags": [
            "Web App",
            "Design Tools",
            "Photography"
        ]
    },
    {
        "title": "Sublime Text 3.0",
        "subtitle": "The long awaited version 3 of the popular code editor",
        "votes": "5,579",
        "tags": [
            "Linux",
            "Windows",
            "Mac"
        ]
    },
    ...

ScrapFly API in PHP

Web scraping with PHP can be surprisingly straight forward however scaling up PHP scrapers can still be difficult and this is where Scrapfly can lend a hand!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Let's take a quick look at how can we enable ScrapFly middleware in a PHP web scraper:

use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;

$SCRAPFLY_KEY = 'YOUR_API_KEY';

// function that create ScrapFly request for url
// see more on:
// https://scrapfly.io/docs/scrape-api/getting-started?language=php
function scrapflyRequest($url, array $config = [])
{
    global $SCRAPFLY_KEY;
    $query = array_merge([
        'key' => $SCRAPFLY_KEY,
        'url' => urlencode($url),
    ], $config);
    $req = new Request(
        'GET',
        'https://api.scrapfly.io/scrape?' . http_build_query($query),
        ['query' => $query]
    );
    var_dump($req);
    return $req;
}

$client = new Client();
$response = $client->send(
    scrapflyRequest('https://www.producthunt.com/products/slack#slack', [
        // will use browser to render page with javascript: https://scrapfly.io/docs/scrape-api/javascript-rendering
        'render_js' => 'true',
        // select proxy location: https://scrapfly.io/docs/scrape-api/proxy
        'country' => 'us',
        // use custom proxy pools like residential or mobile proxies: https://scrapfly.io/docs/scrape-api/proxy
        // 'proxy_pool' => 'public_residential_proxy',
        // use anti bot bypass: https://scrapfly.io/docs/scrape-api/anti-scraping-protection
        'asp' => 'true',
        // return DNS data: https://scrapfly.io/docs/scrape-api/dns
        'dns' => 'true',
        // return SSL data: https://scrapfly.io/docs/scrape-api/ssl
        'ssl' => 'true',
        // debug request: https://scrapfly.io/docs/scrape-api/debug
        'debug' => 'true',
    ])
);
$data = json_decode($response->getBody()->getContents());
var_dump($data->result->content);

In this example, we're making a simple request to https://www.producthunt.com/products/slack#slack through ScrapFly with special options like proxy location, javascript rendering and many more! Using ScrapFly allows us to focus on creating web scrapers rather than various connectivity issues and spider blocks - give it a go!

FAQ

Let's wrap this article up with some frequently asked questions regarding web scraping in PHP:

Can headless browsers be used in PHP scrapers?

Yes, php-webdriver can be used as a Selenium client to launch a real web browser and retrieve web data using web browser actions instead of Guzzle HTTP client we used today.

What's the difference between Crawling and Scraping?

Web crawling involves a few extra components that help the scraper to discover web pages. In this tutorial, we've covered scraping as we provided URLs to scrape directly. On the other hand, a web crawler would be a program that can find product URLs by itself by exploring the given website.

Summary

In this extensive introduction article, we've taken an overview look at basic web scraping in PHP. We quickly introduced ourselves to HTTP protocol and HTML tree structure. Further, we've taken a look at two most popular web scraping libraries: Guzzle which is a modern http client and DomCrawler which allows us to parse data from HTML documents in either XPATH or CSS selectors.
Finally, we wrapped everything up with some examples and small product data scraper of https://www.producthunt.com/.

That's just the beginning of your web scraping journey. We hadn't covered a lot of challenges in web scraping like access blocking, proxies, dynamic content and many scaling techniques - there's still a lot to learn, but this introduction should be a good starting point.

To wrap this up, we'll take a look at ScrapFly's middleware service, which automatically resolves common web scraping issues like blocking and dynamic data rendering - give it a shot for free!