In today’s web-driven world, data is the cornerstone of every major application and decision-making process. Web scraping provides developers with the tools to access and harness that data.
In this article, we’ll explore jsoup, a popular Java library for parsing and scraping web content. Whether you're a beginner or an experienced developer, this guide will provide the foundations and best practices for using jsoup effectively.
What is Jsoup?
At its core, jsoup is a Java library designed for parsing, manipulating, and extracting data from HTML documents. It allows developers to work with web content as if they were using a browser's developer tools. With its intuitive API, jsoup simplifies tasks like data extraction, HTML manipulation, and even cleanup, making it a go-to tool for many Java developers.
Installing Jsoup
Getting started with jsoup is straightforward. Add jsoup as a dependency to your project using a build tool like Maven or Gradle:
To install jsoup using Maven, add the following to your pom.xml file:
The .connect() method returns a Connection intance which has a .response() method that gives you access to the HTTP response details. You can use .statusCode() to check the response status to handle errors gracefully:
if (connection.response().statusCode() == 200) {
System.out.println("Success!");
} else {
System.out.println("Failed to connect.");
}
By combining these methods, you can mimic browser behavior to scrape data effectively.
Jsoup Example Scraper
To illustrate jsoup’s capabilities, let’s scrape product data from the first page of web-scraping.dev/products.
For this example we'll use gradle to manage our dependencies, create a new project and add the following dependencies to your build.gradle.kts file:
Scrape each product URL for product name and price
Collect all results and display them
Our jsoup java scraper should look something like this:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.ArrayList;
import java.util.HashMap;
public class JsoupScraper {
public static HashMap<String, String> scrapeProduct(String url) throws Exception {
// Scrape a single product page from web-scraping.dev
Document doc = Jsoup.connect(url).get();
HashMap<String, String> productData = new HashMap<>();
productData.put("title", doc.select("h3").text());
productData.put("price", doc.select(".product-price").text());
productData.put("price_full", doc.select(".product-price-full").text());
productData.put("url", url);
return productData;
}
public static void main(String[] args) throws Exception {
// Fetch the product directory page
Document doc = Jsoup.connect("https://web-scraping.dev/products").get();
// This is where we'll store our results
ArrayList<HashMap<String, String>> products = new ArrayList<>();
// Iterate through product elements, find product url and scrape each product
Elements productElements = doc.select(".products .product");
for (Element product : productElements) {
// Get the product URL
String url = product.select("h3 > a").attr("href");
System.out.println("Scraping product: " + url);
// Scrape each product and store result
HashMap<String, String> productData = scrapeProduct(url);
products.add(productData);
}
// Pretty print the product data
System.out.println("Product Data:");
for (HashMap<String, String> product : products) {
System.out.println(product);
}
}
}
Example Output
$ gradle run
> Task :run
Scraping product: https://web-scraping.dev/product/1
Scraping product: https://web-scraping.dev/product/2
Scraping product: https://web-scraping.dev/product/3
Scraping product: https://web-scraping.dev/product/4
Scraping product: https://web-scraping.dev/product/5
Product Data:
{price=$9.99, price_full=$12.99, title=Box of Chocolate Candy Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Hiking Boots for Outdoor Adventures Kids' Light-Up Sneakers Blue Energy Potion, url=https://web-scraping.dev/product/1}
{price=$4.99, price_full=, title=Dark Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Cat-Ear Beanie Running Shoes for Men Classic Leather Sneakers, url=https://web-scraping.dev/product/2}
{price=$4.99, price_full=, title=Teal Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Dragon Energy Potion Women's High Heel Sandals Dark Red Energy Potion Running Shoes for Men, url=https://web-scraping.dev/product/3}
{price=$4.99, price_full=, title=Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Dark Red Energy Potion Cat-Ear Beanie, url=https://web-scraping.dev/product/4}
{price=$4.99, price_full=, title=Blue Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Hiking Boots for Outdoor Adventures Classic Leather Sneakers, url=https://web-scraping.dev/product/5}
Above is our jsoup scraper that scraped 5 products with their titles and prices though to break this down a bit further let's take a look at each HTML parsing capability of jsoup.
Parsing HTML with jsoup
Jsoup's Java HTML parser can be used to parse and modify scraped HTML content.
Finding data with CSS Selectors
Jsoup's .select() method takes in a CSS Selector to find elements in the HTML document. For example, to find the first css selector match select().first() can be used:
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
// find all images using css selector for matching elements with "product-img" class
Elements images = doc.select(".product-img");
// print only the first one using first()
System.out.println(images.first());
// prints: <img src="https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp" class="img-responsive product-img active">
Selecting attributes and values
Extract attributes and inner text using .text() and .attr():
To get the text content of an html element using jsoup, the text() method can be used:
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
Elements variants = doc.select(".variants .variant");
// text of first variants
System.out.println(variants.first().text());
// prints: orange, small
// or text of all variants
System.out.println(variants.text());
// prints: orange, small orange, medium orange, large cherry, small cherry, medium cherry, large
To get the value of an html attribute set on an element, the attr() method can be used:
These utilities enhance your ability to manage and present HTML effectively.
Jsoup Limitations
Despite its strengths, jsoup has some limitations that developers should consider:
Lack of HTTP/2 support: Jsoup only supports basic HTTP/1.1 requests. For HTTP/2 and advanced networking capabilities, consider using libraries like OkHttp. Okhttp is a popular http client for Java, check out our comprehensive guide on okhttp to learn more about its capabilities.
No headless browser functionality: Jsoup doesn’t execute JavaScript, which limits its ability to scrape dynamic web pages. Tools like Selenium or Puppeteer can help in these scenarios.
Detectability: Jsoup’s requests can be easily identified as non-human by websites, making it less ideal for scraping heavily protected content.
For advanced use cases, combining jsoup with tools like OkHttp or Scrapfly can help overcome these challenges.
Power Up with Scrapfly
Jsoup is great for small to medium-scale scraping tasks when scrping static pages. However, it falls short when it comes to javascript rendered content and scraper blocking due to IP blocks or bot detection.
Here is a simple example of how you can use okhttp with Scrapfly's Scraping API.
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.Response;
import java.io.IOException;
public class OkHttpExample {
public static void main(String[] args) {
OkHttpClient client = new OkHttpClient();
HttpUrl.Builder urlBuilder = HttpUrl.parse("https://api.scrapfly.io/scrape")
.newBuilder();
// Required parameters: your API key and URL to scrape
urlBuilder.addQueryParameter("key", "YOUR_API_KEY");
urlBuilder.addQueryParameter("url", "https://web-scraping.dev/product/1");
// Optional parameters:
// enable anti scraping protection bypass
urlBuilder.addQueryParameter("asp", "true");
// use proxies of a specific countries
urlBuilder.addQueryParameter("country", "US,CA,DE");
// enable headless browser
urlBuilder.addQueryParameter("render_js", "true");
// see more on scrapfly docs: https://scrapfly.io/docs/scrape-api/getting-started#spec
// Building and send request
String url = urlBuilder.build().toString();
Request request = new Request.Builder()
.url(url)
.build();
try (Response response = client.newCall(request).execute()) {
if (response.isSuccessful()) {
System.out.println("Response Body: " + response.body().string());
System.out.println("Status Code: " + response.code());
} else {
System.out.println("Request Failed: " + response.code());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
FAQ
Can jsoup capture website screenshots?
No, jsoup cannot capture website screenshots. For such needs, you’ll require a headless browser like Selenium or a specialized API. Consider using Scrapfly’s Screenshot API, which simplifies capturing full-page images with minimal setup.
Does jsoup handle JavaScript-rendered content?
No, jsoup cannot execute JavaScript or interact with dynamic content. It works only with static HTML. To scrape JavaScript-rendered pages, you’ll need tools like Selenium or Puppeteer, or services like Scrapfly, which offer JavaScript execution capabilities.
Does jsoup support multi-threaded scraping?
Jsoup itself doesn’t provide built-in multi-threading, but you can use Java’s concurrency utilities (e.g., ExecutorService) to scrape multiple pages simultaneously. Just ensure you manage thread safety and network limits to avoid being blocked by the target website.
Summary
Jsoup is a versatile and lightweight library for scraping and parsing web content. It excels at handling static HTML and provides utilities for cleaning, prettifying, and manipulating content. While it has limitations, combining it with tools like OkHttp or Scrapfly unlocks advanced capabilities, making it a powerful addition to any web scraping toolkit.
Whether you’re building a basic scraper or a robust data pipeline, jsoup provides the flexibility and functionality to get started quickly. Experiment with its features and extend its capabilities with complementary tools to suit your needs. Happy scraping!