Web Scraping With PHP 101
Introduction to web scraping with PHP. How to handle http connections, parse html files for data, best practices, tips and an example project.
PHP's Guzzle is a popular HTTP client used when web scraping with PHP
and proxies are an integral part of web scraping so here's a quick introduction on how to use proxies with Guzzle:
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
// Proxy pattern is:
// scheme://username:password@IP:PORT
// For example:
// no auth HTTP proxy:
$my_proxy = "http://160.11.12.13:1020";
// proxy with authentication
$my_proxy = "http://my_username:my_password@160.11.12.13:1020";
// Note: that username and password should be url encoded if they contain URL sensitive characters like "@":
$my_proxy = 'http://'.urlencode('foo@bar.com').':'.urlencode('password@123').'@160.11.12.13:1020';
$client = new Client([
// Base URI is used with relative requests
'base_uri' => 'https://httpbin.dev',
// You can set any number of default request options.
'timeout' => 2.0,
'proxy' => [
'http' => $my_proxy, // This proxy will be applied to all 'http' URLs
'https' => $my_proxy, // This proxy will be applied to all 'https' URLs
'https://httpbin.dev' => $my_proxy, // This proxy will be applied only to 'https://httpbin.dev'
]
]);
$response = $client->request('GET', '/ip');
$body = $response->getBody();
print($body);
Guzzle does not support SOCKS proxies and the only available options are php's curl library or buzz.
Note that Guzzle proxy can also be set through the standard *_PROXY
environment variables:
$ export HTTP_PROXY="http://160.11.12.13:1020"
$ export HTTPS_PROXY="http://160.11.12.13:1020"
$ export ALL_PROXY="socks://160.11.12.13:1020"
When web scraping, it's best to rotate proxies for each request. For that see our article: How to Rotate Proxies in Web Scraping
This knowledgebase is provided by Scrapfly data APIs, check us out! 👇