Web Scraping With PHP 101

article feature image

Web scraping is becoming increasingly popular way to bootstrap projects with publicly available data and even though often, web scrapers are written in Javascript or Python other languages are often just as capable.

In PHP web scraping tutorial we'll take a deep dive in available scraping tools, best practices and tips and tricks. Finally, we'll finish it off with a real life example project by scraping product information from https://www.producthunt.com/.

So what is web scraping?
As you might have guessed, web scraping is essentially public data collection and there are thousands of reasons why one might want to collect this public data, ranging from finding potential employees to competitive intelligence.

We at ScrapFly did extensive research into web scraping applications see our web scraping use cases article.

Web Scraping Scene In PHP

PHP is well known for being one of the most popular server side web languages, which means it's a great for embedded real time scrapers! Not only that, PHP runs on many systems and is easily accessible.

When it comes to basic web scraping we essentially need two tools: a HTTP client and a HTML parser. Both of these tools are available in PHP in the form of several community libraries.
In this tutorial, we'll focus on two libraries in particular:

  • Guzzle - HTTP client library that helps us retrieve web page contents.
  • DomCrawler - HTML parsing client that helps us extract exact details we want from full web page HTML documents.

We'll split this tutorial into two parts, each reflecting one of these tools. First, let's start off by taking a look at Guzzle and how can we use it to effectively scrape thousands of HTML documents.

Making Requests

PHP offers numerous HTTP clients, however the two most commonly used ones are: The standard library's curl client and most popular community client called Guzzle.
There are many differences between these two clients but when it comes to web scraping the main ones are:

  • Guzzle offers better user experience
    Since there's a lot of exception, retrying and general failure management in web scraping having a more accessible API means more stable, better web scrapers.
  • Guzzle offers async support
    We'll talk more about synchronous vs asynchronous code but essentially asynchronous client allows us to make multiple requests in parallel making our web scraper much faster!
  • Guzzle doesn't support newer http2 or http3(QUIC) protocols
    Currently there are 3 active http protocols: http1.1, http2 and http3(aka QUIC). While performance gains for web scraping in http2/3 aren't significant, http1 connections can result in web scraper being blocked. That being said, we'll cover some ways to get around this later in the article.

So, to summarize, Guzzle is easier to use and often faster while curl library is more feature rich but more difficult to use and harder to optimize. We'll stick with Guzzle for the time being, but before we take it for a spin we should familiarize with http protocol itself, at least the bits that are important in web scraping.

HTTP Protocol Fundamentals

To collect data from a public resource, we need to establish connection with it first. Most of the web is served over http protocol which is rather simple: we (the client) send a request for a specific document to the website (the server), once the server processes our request it replies with the requested document - a very straight forward exchange!

illustration of a standard http exchange

illustration of a standard http exchange

As you can see in this illustration: we send a request object which is consists of method (aka type), location and headers, in turn we receive a response object which consists of status code, headers and document content itself. Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.

Understanding Requests and Responses

When it comes to web scraping we don't exactly need to know every little detail about http requests and responses, however it's good to have a general overview and to know which parts of this protocol are especially useful in web scraping. Let's take a look at exactly that!

Request Method

Http requests are conveniently divided into few types that perform distinct function:

  • GET requests are intended to request a document.
  • POST requests are intended to request a document by sending a document.
  • HEAD requests are intended to request documents meta information.
  • PATCH requests are intended to update a document.
  • PUT requests are intended to either create a new document or update it.
  • DELETE requests are intended to delete a document.

When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET and POST type requests. To add, HEAD requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check its metadata whether it's worth the effort.

Request Location

To understand what is resource location, first we should take a quick look at URL's structure itself:

illustration showing general URL structure

Example of a URL structure

Here, we can visualize each part of a URL: we have protocol which when it comes to http is either http or https, then we have host which is essentially address of the server, and finally we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up PHP's interactive shell (php -a) and let it figure it out for you:

php > var_dump(parse_url("https://www.domain.com/path/to/resource?arg1=true&arg2=false"));
array(4) {
  'scheme' =>
  string(4) "http"
  'host' =>
  string(14) "www.domain.com"
  'path' =>
  string(17) "/path/to/resource"
  'query' =>
  string(20) "arg1=true&arg2=false"
}

Request Headers

While it might appear like request headers are just minor metadata details in web scraping, they are extremely important. Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.

Let's take a look at some of the most important headers and what they mean:

User-Agent is an identity header that tells the server who's requesting the document.

# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

Whenever you visit a web page in your web browser identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, we don't want to be denied content, so we have to blend in by faking our user agent to look like that one of a browser.

There are many online databases that contain latest user-agent strings of various platforms, like user agent database by whatismyborwser.com

Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.

Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:

text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of the scraped website/webapp.

These are a few of most important observations, for more see extensive full documentation over at MDN's standard http header documentation

Response Status Code

Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate). Let's take a quick look at status codes that are most relevant to web scraping:

  • 200 range codes generally mean success!
  • 300 range codes tend to mean redirection - in other words if we request content at /product1.html it might be moved to a new location /products/1.html and server would inform us about that.
  • 400 range codes mean request is malformed or denied. Our web scraper could be missing some headers, cookies or authentication details.
  • 500 range codes typically mean server issues. The website might be unavailable right now or is purposefully disabling access to our web scraper.

For more on http status codes, see HTTP status documentation at MDN

Response Headers

When it comes to web scraping, response headers provide some important information for connection functionality and efficiency. For example, Set-Cookie header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag, Last-Modified are intended to help client with caching to optimize resource usage.

Finally, just like with request headers, headers prefixed with an X- are custom web functionality headers.


We took a brief overlook of core HTTP components, and now it's time we give it a go and see how HTTP works in practical PHP!

Making GET Requests

Now that we're familiar with the HTTP protocol and how it's used in web scraping let's take a look of how we access it in PHP. In this section, we'll be using Guzzle HTTP client and explore how it's used in common web scraping tasks.

First, we need to create a Client object, also referred to as a Connection Pooling session or a HTTP persistent connection session. We'll be using this object to handle our configuration and send out requests:

use GuzzleHttp\Client;

$client = new Client();
$url = 'https://httpbin.org/html';
$response = $client->get($url);
//                   ^^^ Here we're using GET request but similarly we can use HEAD or POST
printf("POST request to %s", $url);
printf("status: %s\n", $response->getStatusCode());
printf("headers: %s\n", json_encode($response->getHeaders(), JSON_PRETTY_PRINT));
printf("body: %s", $response->getBody()->getContents());
// alternative to print full response structure use:
var_dump($response);

Here we're using https://httpbin.org/ HTTP testing service to retrieve a simple HTML page. When run, this script should print out the status code (e.g. 200), the headers(meta information) and the body(document data).

Making POST requests

Sometimes our web scraper might need to submit some sort of forms to retrieve HTML results. For example, search queries often use POST requests with query details as JSON values:

use GuzzleHttp\Client;
$client = new Client();
$url = 'https://httpbin.org/post';
$response = $client->post(
    'https://httpbin.org/post',
    ['json' => ['query' => 'foobar', 'page' => 2]]
//   ^^^^^ using json argument we can pass an associative array which will be sent as a json type POST request
//  alternatively we can use form type request:
//  ['form_params' => ['query' => 'foobar', 'page' => 2]]
);
printf("POST request to %s", $url);
printf("status: %s\n", $response->getStatusCode());
printf("headers: %s\n", json_encode($response->getHeaders(), JSON_PRETTY_PRINT));
printf("body: %s", $response->getBody()->getContents());

Guzzle is smart enough to convert our PHP's associative array into correct JSON or form values for form submission. Based on json argument, it'll prepare request with appropriate Content-Type/Length headers and convert body value from associative array to either JSON or a form object.

Ensuring Headers

As we've covered before our requests must provide metadata about themselves which helps server determine what content to return. Often, this metadata can be used to identify web scrapers and block them. Modern web browsers automatically include specific metadata details with every request so if we wish to not stand out as a web scraper we should replicate this behavior.

Primarily, User-Agent and Accept headers are often dead giveaways so when creating our Client we can set them to values a normal Chrome browser would use:

$client = new Client([
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    ]
]);

This will ensure that every request client is making will include these default headers.

Note that this is just the tip of the iceberg when it comes to bot blocking and request headers, however just setting User-Agent and Accept headers should make us much harder to detect!

Now that we know how to properly make requests using Guzzle let's take a look at how can we make them much faster by using asynchronous code structure.

Asynchronous Requests

Since HTTP protocol is data exchange protocol between two parties there's a lot of waiting involved. In other words when our client sends a request it needs to wait for it to travel all the way to the server and comeback which stalls our program. Why should our program sit idly and wait for request to travel around the globe? This is called an IO (input/output) block.

The main way to deal with IO blocks in PHP is to use asynchronous promises or callbacks. In other words, when we make a request the HTTP client returns us a "promise" object that will turn into content sometime in the future. This allows us to concurrently schedule multiple requests which makes our web scraper significantly faster!

Let's take a look at a synchronous code making 10 requests:

use GuzzleHttp\Client;

$client = new Client();

$_start = microtime(true);
// Array of 10 urls:
$urls = array_fill(0, 50, 'https://httpbin.org/html');
// Create promise objects from urls array:
$responses = array_map(
  function ($url) use ($client) {
    return $client->get($url);
  },
  $urls
);
printf('finished %d requests in %.2f seconds\n', count($responses), microtime(true) - $_start);

Here we are making 10 requests to https://httpbin.org/html and if we run the code it would take around 5 seconds to complete. It doesn't sound like much but this scales almost linearly: if we make 100 requests that 50 seconds, 1000 request - will be over 8 minutes!

Instead, let's use asynchronous programming with promises:

use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;

// initiate http client
$client = new Client([
    // Client config allows us to set fail conditions
    // for example we can set request timeout options:
    'connect_timeout' => 5,
    'timeout'         => 2.00,
    // we can also specify that 400/500 requests will be considered as failures as well:
    'http_errors'     => false,
]);

// create 10 Request objects:
$urls = array_fill(0, 50, 'https://httpbin.org/html');
$requests = array_map(function ($url) {
    return new Request('GET', $url);
}, $urls);

// define are callbacks:
// This will be called for every successful response
function handleSuccess(Response $response, $index)
{
    global $urls;
    printf("success: %s\n", $urls[$index]);
}

function handleFailure($reason, $index)
{
    global $urls;
    printf(
        "failed: %s, \n  reason: %s\n",
        $urls[$index],
        $reason,
    );
}

// scrape our requests
$_start = microtime(true);
$pool = new Pool($client, $requests, [
    // we can set concurrency limit to prevent scraping too fast which might cause our scraper to be blocked
    'concurrency' => 20,
    'fulfilled' => 'handleSuccess',
    'rejected' => 'handleFailure',
]);
$pool->promise()->wait();
printf('finished %d requests in %.2f seconds\n', count($urls), microtime(true) - $_start);

Here, we have reworked our code from synchronous code to promise + callback/errorback structure. We are creating 10 Request objects and passing them to a request pool which will send them all together.
We also provide 2 functions to our pool: what to do with each successful request and what to do with each failed request. Ideally, we'd want to log/retry failed ones and parse data from good ones.

Here, the same 10 request finish in 1-2 seconds which is at least 5 times faster than our synchronous example from before. When making thousands of requests the difference can be even higher, often a hundred times faster!


In this section we've covered how can we retrieve HTML documents and how can we do it quickly while avoiding being blocked. Next, let's take a look how can we extract data from HTML and finally put everything together into one cohesive example.

Parsing HTML Content

HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about it is that it's intended to be machine-readable text content, which is great news for web scraping as we can easily parse the data with code!

HTML is a tree type structure that lends easily to parsing. For example, let's take this simple HTML content:

<head>
    <title>My Website</title>
</head>
<body>
    <h1>Welcome to my website!</h1>
    <div class="content">
        <p>This is my website</p>
        <p>Isn't it great?</p>
    </div>
</body>

Here we see an extremely basic HTML document that a simple website might serve. You can already see the tree like structure just by indentation of the text, but we can even go further and illustrate it:

illustration of a html node tree

example of a HTML node tree. Note that branches are ordered (left-to-right)

This tree structure is brilliant for web scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under <head> and under <title> nodes. In other words - if we wanted to extract 1000 titles for 1000 different pages, we would write a rule to find head->title->text for every one of them.

When it comes to HTML parsing, there are two standard ways to write these rules: CSS selectors and XPATH selectors - let's dive further and see how can we use them to parse web scrapped data!

Using DomCrawler

We'll be using DomCrawler as our HTML document parser, and it supports both CSS and XPATH selectors. We have an extensive tutorial on xpath and css selectors which fully applies to PHP and DomCrawler so we'll not be exploring XPATH syntax in this article, but we'll be sticking with it for our HTML parsing.

Let's start off with a simple XPATH selector based parsing example:

use Symfony\Component\DomCrawler\Crawler;

// example html document
$html = <<<'HTML'
<head>
    <title>My Website</title>
</head>
<body>
    <div class="content">
        <h1>First blog post</h1>
        <p>Just started this blog!</p>
        <a href="https://scrapfly.io/blog">Checkout My Blog</a>
    </div>
</body>
HTML;

// first we build our Crawler tree
$tree = new Crawler($html);
// then we can run xpaths against it:
printf($tree->filterXPath('//a/@href')->text());
// https://scrapfly.io/blog

In the example above, we defined an example HTML document, built a tree object (Crawler) and used a simple XPATH selector to extract the href attribute of the first link.

However, often CSS selectors can be a more elegant solution. For this we can install optional dependency css-selector which provides CSS selector support to our Crawler object as well:

printf($tree->filter('a::attr(href)')->text());
// https://scrapfly.io/blog

There's much more to DomCrawler than just XPATH and CSS selectors but for web scraping we're mostly interested in these two features. Now that we're familiar with them, let's build a real web scraper!

Putting It All Together: Example Project

It's time to put everything we've learned into an example PHP website scraper. In this section we'll be scraping https://www.producthunt.com/ which essentially is a technical product directory where people submit and discuss new tech products.
We'll write a simple PHP scraper that collects product data and go through all parts to solidify our knowledge.

Our scraper should find product urls (e.g. https://www.producthunt.com/posts/slack) from a product directory (e.g. https://www.producthunt.com/topics/developer-tools) and scrape each product for fields: title, subtitle, votes and tags:

producthunt.com parsing illustration

Let's see the full scraping script and then take a look at individual actions/components:

use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
use GuzzleHttp\Psr7\Response;
use Symfony\Component\DomCrawler\Crawler;

// initiate http client
$client = new Client([
    'connect_timeout' => 10,
    'timeout'         => 10.00,
    'http_errors'     => true,
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    ]
]);
// global storage where all results will be added to:
$results = [];

// First we define our main scraping loop:
function scrape($urls, $callback, $errback)
{
    // create 10 Request objects:
    $requests = array_map(function ($url) {
        return new Request('GET', $url);
    }, $urls);
    global $client;
    $pool = new Pool($client, $requests, [
        'concurrency' => 5,
        'fulfilled' => $callback,
        'rejected' => $errback,
    ]);
    $pool->promise()->wait();
}


// Then, we define our callbacks:
// 1. This will be called for every product scrape:
function parseProduct(Response $response, $index)
{
    $tree = new Crawler($response->getBody()->getContents());
    $result = [
        // we can use xpath selectors:
        'title' => $tree->filterXpath('//h1')->text(),
        'subtitle' => $tree->filterXpath('//h2')->text(),
        // or css selectors:
        'votes' => $tree->filter("span[class*='bigButtonCount']")->text(),
        // to get multiple elements we need to use each() mapping:
        'tags' => $tree->filterXpath(
            "//div[contains(@class,'topicPriceWrap')]
            //a[contains(@href, '/topics/')]/text()"
        )->each(function ($node, $i) {
            return $node->text();
        }),
    ];
    global $results;
    array_push($results, $result);
}
// 2. This will be called for every directory scrape:
function parseDirectory(Response $response, $index)
{
    $tree = new Crawler($response->getBody()->getContents());
    $urls = $tree->filter("div[class*='item'] a[class*=comments]")->each(
        function ($node, $i) {
            return 'https://www.producthunt.com' . $node->attr('href');
        }
    );
    scrape(
        $urls,
        'parseProduct',
        'logFailure',
    );
}


// 3. This will be called for every failed request be it product or directory:
function logFailure($reason, $index)
{
    printf("failed: %s\n", $reason);
}

// Finally, we can define our scrape logic and run the scraper:
$start_urls = [
    // define urls where to find product urls, like topic directory:
    "https://www.producthunt.com/topics/developer-tools",
];

$_start = microtime(true);
scrape($start_urls, 'parseDirectory', 'logFailure');
printf('scraped %d results in %.2f seconds', count($results), microtime(true) - $_start);
echo '\n';
echo json_encode($results, JSON_PRETTY_PRINT);

This looks pretty lengthy, so let's break it down and take look at individual steps we're doing here:

  1. We establish our global Client object which will handle all connections
  2. Then we defined our asynchronous scraper function that takes urls to scrape and 2 functions (or function names) that will be called if URL scrape is successful or failure. This is our abstract scraping executor.
  3. Further, we define our parsing callbacks. When product scrape succeeds, parseProduct() will be called which will extract data from the HTML and append it to $results storage variable.
  4. We also do the same thing with parseDirectory() which will be called when directory scrape succeeds and scrape all found products.
  5. We also need a common failure handler which is our logFailure() function. Ideally in production we want to implement some sort of retry functionality or store failures to database to retry later but for now let's just log them.
  6. Finally, we finish everything off with a tiny script that executes our logic. We define start_urls which contains URLs to product directories and schedule entire scrape logic.

If we run this script we should see output something like:

scraped 20 results in 9.25 seconds
[
    {
        "title": "Unsplash 5.0",
        "subtitle": "Free (do whatever you want) high-resolution photos.",
        "votes": "7,003",
        "tags": [
            "Web App",
            "Design Tools",
            "Photography"
        ]
    },
    {
        "title": "Sublime Text 3.0",
        "subtitle": "The long awaited version 3 of the popular code editor",
        "votes": "5,579",
        "tags": [
            "Linux",
            "Windows",
            "Mac"
        ]
    },
    ...

Summary

In this extensive introduction article, we've taken an overview look at basic web scraping in PHP. We quickly introduced ourselves to HTTP protocol and HTML tree structure. Further, we've taken a look at two most popular web scraping libraries: Guzzle which is a modern http client and DomCrawler which allows us to parse data from HTML documents in either XPATH or CSS selectors.
Finally, we wrapped everything up with some examples and small product data scraper of https://www.producthunt.com/.

That's just the beginning of your web scraping journey. We hadn't covered a lot of challenges in web scraping like access blocking, proxies, dynamic content and many scaling techniques - there's still a lot to learn, but this introduction should be a good starting point!

To wrap this up, we'll take a look at ScrapFly's middleware service, which automatically resolves these issues that are out of scope of this tutorial. ScrapFly can render dynamic content and get around various web scraper blocks with no extra user effort, which makes producing production-level web scrapers a breeze!

ScrapFly API in PHP

ScrapFly offers a middleware service which can solve a lot of web scraping challenges for you. Let's take a quick look at ScrapFly's solutions and how can we apply them in our PHP web scraper:

Javascript Rendering
Since PHP cannot render javascript embedded in the HTML body often our web scraper would see different results compared to our browser. To solve this we either need to reverse engineer javascript behavior in our PHP code or use a browser emulator to render javascript before we start parsing it in our code.
ScrapFly offers a javascript rendering service which uses an automated browser to fully render the web page and then pass the contents to our scraper!

Anti Scraping Protection Solutions
While PHP offers great connection tools they are a bit lacking in the modern web context thus PHP web scrapers can be detected by various anti-bot measures. To web scrape pages protected by captcha or some anti-bot solutions ScrapFly middleware also provides anti scraping protection solution.

Smart Proxies
Some web content can only be accessed only in specific countries, meaning our web scraper has to use a proxy connection to access this content. ScrapFly provides various proxy options like country selection and smart proxy pools which selects the right proxies for outgoing requests to prevent being blocked.

Let's take a quick look how can we enable ScrapFly middleware in a PHP web scraper:

use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;

$SCRAPFLY_KEY = 'YOUR_API_KEY';

// function that create ScrapFly request for url
// see more on:
// https://scrapfly.io/docs/scrape-api/getting-started?language=php
function scrapflyRequest($url, array $config = [])
{
    global $SCRAPFLY_KEY;
    $query = array_merge([
        'key' => $SCRAPFLY_KEY,
        'url' => urlencode($url),
    ], $config);
    $req = new Request(
        'GET',
        'https://api.scrapfly.io/scrape?' . http_build_query($query),
        ['query' => $query]
    );
    var_dump($req);
    return $req;
}

$client = new Client();
$response = $client->send(
    scrapflyRequest('https://www.producthunt.com/posts/slack', [
        // will use browser to render page with javascript: https://scrapfly.io/docs/scrape-api/javascript-rendering
        'render_js' => 'true',
        // select proxy location: https://scrapfly.io/docs/scrape-api/proxy
        'country' => 'us',
        // use custom proxy pools like residential or mobile proxies: https://scrapfly.io/docs/scrape-api/proxy
        // 'proxy_pool' => 'public_residential_proxy',
        // use anti bot bypass: https://scrapfly.io/docs/scrape-api/anti-scraping-protection
        'asp' => 'true',
        // return DNS data: https://scrapfly.io/docs/scrape-api/dns
        'dns' => 'true',
        // return SSL data: https://scrapfly.io/docs/scrape-api/ssl
        'ssl' => 'true',
        // debug request: https://scrapfly.io/docs/scrape-api/debug
        'debug' => 'true',
    ])
);
$data = json_decode($response->getBody()->getContents());
var_dump($data->result->content);

In this example, we're making a simple request to https://www.producthunt.com/posts/slack through ScrapFly with special options like proxy location, javascript rendering and many more! Using ScrapFly allows us to focus on creating web scrapers rather than various connectivity issues and spider blocks - give it a go!