Scraping Tools and Languages
Web scraping can be done in almost any programming language, and we cover hands-on introductions to the most popular ones:
Python [recommended]
The most popular and accessible language for web scraping. Best overall choice, many great libraries and built-in tools.
Typescript [recommended]
Second most popular choice for web scraping. Strong web dev ecosystem but weaker web scraping packaging.
PHP
Classic web backend language has all the right tools for scraping but lacks data tooling and can be difficult to work with.
R
Popular choice for statisticians and data scientists. Strong data processing ecosystem but weaker web scraping packages.
Ruby
In many ways similar to Python with strong web development ecosystem but weaker web scraping packages.
While web scraping can be done in almost any language not every language is equally fit for this diverse niche. Primarily, the existing library support like a modern HTTP client, browser control client, HTML and JSON parsers are important for successful scraping. For this, Python and Javascript (Typescript) are generally considered to be the best overall options.
Scrapfly's Python and Typescript SDK's come with batteries included and handle all the hard parts for you!
Web Scraping Libraries
Web scraping covers a variety of niches based on scraped target though we can divide web scraping libraries into a few primary categories:
HTTP Clients
HTTP clients are used to make HTTP requests to the target website. They are the most basic building block of web scraping and are used to fetch the HTML of the target page.
HTTP clients are also used to communicate with Scrapfly API. Though, Scrapfly uses its own special HTTP client that is optimized for scraping.
Browser Control Clients
Alternative to HTTP clients we have browser automation clients that allow to control real web browsers for scraping. This is useful for scraping dynamic web pages that require Javascript to display the desired data.
It's not without its downsides, however. Browsers are extremely resource intensive and complex which can be difficult to manage at scale.
There are multiple browser automation tools available though these 3 are the most popular ones that are used in web scraping:
Playwright
Selenium
Classic choice with the biggest web scraping community.
Puppeteer
Predecessor to Playwright. Only available NodeJS with a sizable scraping community around it.
Scrapfly's Javascript Rendering and Javascript Scenario features are the next evolution of browser automation.
HTML Parsers
HTML parsers are used to parse the scraped HTML data to extract the desired data. Parsing is not only used to process the results but also for parts of web scraping logic like finding page links to follow when crawling or indicating what elements scraper should interact with when using interaction features (like clicking buttons).
There are two primary technologies used for HTML parsing which are their own unique query languages:
CSS Selectors
Used to select HTML elements for style application but can also be used to select elements for web automation.
XPath
More powerful version of CSS Selectors. Often used where more complex data selection is needed.
Alternatively, many HTML parsing clients implement native methods and functions that perform very similarly to XPath or CSS Selectors:
Introduction to BeautifulSoup
The most popular HTML parsing library for Python.
It has its own native methods like find
and
find_all
as well as CSS selector support.
Scrapfly Python and Typescript SDKs include access to both XPath and CSS selectors through the scrape_response.selector
property.
JSON Parsers
JSON is becoming an increasingly popular web data format and many modern scraped pages contain JSON data that is used to render the page. Often these scraped dataset are big and complex requiring a dedicated JSON parser to extract the desired data.
There are many ways to process JSON but tools that mirror HTML parsing techniques like CSS Selectors and XPath are the most popular ones in web scraping:
JMesPath
With client support for almost every programming language jmpespath can reshape, cleanup and parse JSON
JsonPath
Inspire by XPath jsonpath mirrors many of the same features but for JSON instead of HTML.
Utility
There are too many powerful utility libraries that benefit web scrapers but here are some categories to keep an eye out for and some examples:
- URL formatting — creating, mixing and modifying URLs can get surprisingly complex.
- Regular Expression helpers and extension — regex is a powerful tool for text processing but can be difficult to use.
- Data Parsing utilities — there's a lot of free form data in web scraping which can be difficult to navigate without proper tooling.
Scraping Frameworks
While web scraping frameworks are becoming less popular as modern web is becoming more complex it can still be a good choice for many web scraping projects.
Scrapy is by far the most popular framework for web scraping at scale though here are some things to look out for when evaluating a framework:
- Modern HTTP client — http2 support, async support, proxy support, etc.
- Active development — web scraping is a rapidly evolving subject
- Concurrency — as web scraping is an IO bound task concurrency is a must for scaling any web scraping task.
Scrapfly SDK includes scrapy integration powering up basic scrapy spiders with all of the Scrapfly functionality!
While there's a lot of research to do when choosing the right web scraping environment we can pick it up as we go. Further on we'll be using Python and Typescript for our examples, but before that let's take a look at some common web scraping terms.