Web Scraping and HTML Parsing with Jsoup and Java

Web Scraping and HTML Parsing with Jsoup and Java

In today’s web-driven world, data is the cornerstone of every major application and decision-making process. Web scraping provides developers with the tools to access and harness that data.

In this article, we’ll explore jsoup, a popular Java library for parsing and scraping web content. Whether you're a beginner or an experienced developer, this guide will provide the foundations and best practices for using jsoup effectively.

What is Jsoup?

At its core, jsoup is a Java library designed for parsing, manipulating, and extracting data from HTML documents. It allows developers to work with web content as if they were using a browser's developer tools. With its intuitive API, jsoup simplifies tasks like data extraction, HTML manipulation, and even cleanup, making it a go-to tool for many Java developers.

Installing Jsoup

Getting started with jsoup is straightforward. Add jsoup as a dependency to your project using a build tool like Maven or Gradle:

To install jsoup using Maven, add the following to your pom.xml file:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.16.1</version>
</dependency>

To install jsoup using Gradle, add the following to the dependencies block of your build.gradle.kts file:

implementation 'org.jsoup:jsoup:1.16.1'

Once installed, you’re ready to explore jsoup’s powerful scraping and parsing capabilities.

Scraping with Jsoup

Jsoup doesn't only provide methods to parse HTML, but it also provides a simple and robust way to connect to web pages and extract HTML directly.

You can use the .connect() method that Jsoup provides to get the HTML content of a website and parse it to extract data from it.

Document doc = Jsoup.connect("https://example.com").get();

The .connect() method also allows you to customize requests by specifying headers, cookies, and HTTP methods:

Connection connection = Jsoup.connect("https://example.com")
                            .method(Connection.Method.GET)
                            .userAgent("Mozilla/5.0")
                            .header("Authorization", "Bearer token")
                            .cookie("session_id", "abc123");

The .connect() method returns a Connection intance which has a .response() method that gives you access to the HTTP response details. You can use .statusCode() to check the response status to handle errors gracefully:

if (connection.response().statusCode() == 200) {
  System.out.println("Success!");
} else {
  System.out.println("Failed to connect.");
}

By combining these methods, you can mimic browser behavior to scrape data effectively.

Jsoup Example Scraper

To illustrate jsoup’s capabilities, let’s scrape product data from the first page of web-scraping.dev/products.

For this example we'll use gradle to manage our dependencies, create a new project and add the following dependencies to your build.gradle.kts file:

plugins {
    id("java")
	id("application")
}

repositories {
    mavenCentral()
}

dependencies {
    implementation("org.jsoup:jsoup:1.16.1")
}

application {
    mainClass.set("JsoupScraper")
}

Then let's create a small scraper under /src/main/java/JsoupScraper.java file that will:

  • Scrape web-scraping.dev/products page and find all product URLs
  • Scrape each product URL for product name and price
  • Collect all results and display them

Our jsoup java scraper should look something like this:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.ArrayList;
import java.util.HashMap;

public class JsoupScraper {

    public static HashMap<String, String> scrapeProduct(String url) throws Exception {
        // Scrape a single product page from web-scraping.dev
        Document doc = Jsoup.connect(url).get();
        HashMap<String, String> productData = new HashMap<>();

        productData.put("title", doc.select("h3").text());
        productData.put("price", doc.select(".product-price").text());
        productData.put("price_full", doc.select(".product-price-full").text());
        productData.put("url", url);

        return productData;
    }

    public static void main(String[] args) throws Exception {
        // Fetch the product directory page
        Document doc = Jsoup.connect("https://web-scraping.dev/products").get();

        // This is where we'll store our results
        ArrayList<HashMap<String, String>> products = new ArrayList<>();

        // Iterate through product elements, find product url and scrape each product
        Elements productElements = doc.select(".products .product");
        for (Element product : productElements) {
            // Get the product URL
            String url = product.select("h3 > a").attr("href");
            System.out.println("Scraping product: " + url);
            // Scrape each product and store result
            HashMap<String, String> productData = scrapeProduct(url);
            products.add(productData);
        }

        // Pretty print the product data
        System.out.println("Product Data:");
        for (HashMap<String, String> product : products) {
            System.out.println(product);
        }
    }
}
Example Output
$ gradle run

> Task :run
Scraping product: https://web-scraping.dev/product/1
Scraping product: https://web-scraping.dev/product/2
Scraping product: https://web-scraping.dev/product/3
Scraping product: https://web-scraping.dev/product/4
Scraping product: https://web-scraping.dev/product/5
Product Data:
{price=$9.99, price_full=$12.99, title=Box of Chocolate Candy Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Hiking Boots for Outdoor Adventures Kids' Light-Up Sneakers Blue Energy Potion, url=https://web-scraping.dev/product/1}
{price=$4.99, price_full=, title=Dark Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Cat-Ear Beanie Running Shoes for Men Classic Leather Sneakers, url=https://web-scraping.dev/product/2}
{price=$4.99, price_full=, title=Teal Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Dragon Energy Potion Women's High Heel Sandals Dark Red Energy Potion Running Shoes for Men, url=https://web-scraping.dev/product/3}
{price=$4.99, price_full=, title=Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Dark Red Energy Potion Cat-Ear Beanie, url=https://web-scraping.dev/product/4}
{price=$4.99, price_full=, title=Blue Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Hiking Boots for Outdoor Adventures Classic Leather Sneakers, url=https://web-scraping.dev/product/5}

Above is our jsoup scraper that scraped 5 products with their titles and prices though to break this down a bit further let's take a look at each HTML parsing capability of jsoup.

Parsing HTML with jsoup

Jsoup's Java HTML parser can be used to parse and modify scraped HTML content.

Finding data with CSS Selectors

Jsoup's .select() method takes in a CSS Selector to find elements in the HTML document. For example, to find the first css selector match select().first() can be used:

Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
// find all images using css selector for matching elements with "product-img" class
Elements images = doc.select(".product-img");
// print only the first one using first()
System.out.println(images.first());
// prints: <img src="https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp" class="img-responsive product-img active">

Selecting attributes and values

Extract attributes and inner text using .text() and .attr():

To get the text content of an html element using jsoup, the text() method can be used:

Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
Elements variants = doc.select(".variants .variant");
// text of first variants
System.out.println(variants.first().text());
// prints: orange, small

// or text of all variants
System.out.println(variants.text());
// prints: orange, small orange, medium orange, large cherry, small cherry, medium cherry, large

To get the value of an html attribute set on an element, the attr() method can be used:

Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
Elements images = doc.select(".product-img");
System.out.println(images.first().attr("src"));
// prints: https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp

Changing the DOM

Jsoup also allows modifications of the DOM using .text() and .attr():

The .text() method accepts a string argument that makes it alter the inner text of the HTML element.

doc.select("h1").first().text("Updated Title");

The .attr() method also takes a second string argument that get passed as the value attribute value in the HTML.

doc.select("img").first().attr("src", "new-image.jpg");

This versatility lets you work with HTML dynamically, much like in a browser.

Jsoup Utilities

Jsoup comes equipped with handy utilities to simplify common HTML tasks.

Cleanup HTML

Use Jsoup.clean() to sanitize HTML, removing unsafe tags and attributes:

String cleanHtml = Jsoup.clean("<script>alert(1)</script><p>Safe content</p>", Safelist.basic());

Prettify HTML

Format raw HTML for readability using:

doc.outputSettings().prettyPrint(true);
System.out.println(doc.html());

Escape and Unescape HTML

Handle special characters with Entities.escape() and Entities.unescape():

String escaped = Entities.escape("<div>Content</div>");
String unescaped = Entities.unescape("&lt;div&gt;Content&lt;/div&gt;");

These utilities enhance your ability to manage and present HTML effectively.

Jsoup Limitations

Despite its strengths, jsoup has some limitations that developers should consider:

  • Lack of HTTP/2 support: Jsoup only supports basic HTTP/1.1 requests. For HTTP/2 and advanced networking capabilities, consider using libraries like OkHttp. Okhttp is a popular http client for Java, check out our comprehensive guide on okhttp to learn more about its capabilities.
  • No headless browser functionality: Jsoup doesn’t execute JavaScript, which limits its ability to scrape dynamic web pages. Tools like Selenium or Puppeteer can help in these scenarios.
  • Detectability: Jsoup’s requests can be easily identified as non-human by websites, making it less ideal for scraping heavily protected content.

For advanced use cases, combining jsoup with tools like OkHttp or Scrapfly can help overcome these challenges.

Power Up with Scrapfly

Jsoup is great for small to medium-scale scraping tasks when scrping static pages. However, it falls short when it comes to javascript rendered content and scraper blocking due to IP blocks or bot detection.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

scrapfly middleware

Here is a simple example of how you can use okhttp with Scrapfly's Scraping API.

import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.Response;

import java.io.IOException;

public class OkHttpExample {
    public static void main(String[] args) {
        OkHttpClient client = new OkHttpClient();
        HttpUrl.Builder urlBuilder = HttpUrl.parse("https://api.scrapfly.io/scrape")
                .newBuilder();
        // Required parameters: your API key and URL to scrape
        urlBuilder.addQueryParameter("key", "YOUR_API_KEY");
        urlBuilder.addQueryParameter("url", "https://web-scraping.dev/product/1");
        // Optional parameters:
        // enable anti scraping protection bypass
        urlBuilder.addQueryParameter("asp", "true");
        // use proxies of a specific countries
        urlBuilder.addQueryParameter("country", "US,CA,DE");
        // enable headless browser
        urlBuilder.addQueryParameter("render_js", "true");
        // see more on scrapfly docs: https://scrapfly.io/docs/scrape-api/getting-started#spec

        // Building and send request
        String url = urlBuilder.build().toString();
        Request request = new Request.Builder()
                .url(url)
                .build();
        try (Response response = client.newCall(request).execute()) {
            if (response.isSuccessful()) {
                System.out.println("Response Body: " + response.body().string());
                System.out.println("Status Code: " + response.code());
            } else {
                System.out.println("Request Failed: " + response.code());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

FAQ

Can jsoup capture website screenshots?

No, jsoup cannot capture website screenshots. For such needs, you’ll require a headless browser like Selenium or a specialized API. Consider using Scrapfly’s Screenshot API, which simplifies capturing full-page images with minimal setup.

Does jsoup handle JavaScript-rendered content?

No, jsoup cannot execute JavaScript or interact with dynamic content. It works only with static HTML. To scrape JavaScript-rendered pages, you’ll need tools like Selenium or Puppeteer, or services like Scrapfly, which offer JavaScript execution capabilities.

Does jsoup support multi-threaded scraping?

Jsoup itself doesn’t provide built-in multi-threading, but you can use Java’s concurrency utilities (e.g., ExecutorService) to scrape multiple pages simultaneously. Just ensure you manage thread safety and network limits to avoid being blocked by the target website.

Summary

Jsoup is a versatile and lightweight library for scraping and parsing web content. It excels at handling static HTML and provides utilities for cleaning, prettifying, and manipulating content. While it has limitations, combining it with tools like OkHttp or Scrapfly unlocks advanced capabilities, making it a powerful addition to any web scraping toolkit.

Whether you’re building a basic scraper or a robust data pipeline, jsoup provides the flexibility and functionality to get started quickly. Experiment with its features and extend its capabilities with complementary tools to suit your needs. Happy scraping!

Related Posts

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup

Guide to Parsel - the Best HTML Parsing in Python

Learn to extract data from websites with Parsel, a Python library for HTML parsing using CSS selectors and XPath.

JSONL vs JSON

Learn the differences between JSON and JSONLines, their use cases, and efficiency. Why JSONLines excels in web scraping and real-time processing