[Blog](https://scrapfly.io/blog)   /  [data-parsing](https://scrapfly.io/blog/tag/data-parsing)   /  [Web Scraping and HTML Parsing with Jsoup and Java](https://scrapfly.io/blog/posts/web-scraping-java-jsoup-html-parsing)   # Web Scraping and HTML Parsing with Jsoup and Java

 by [Mostafa](https://scrapfly.io/blog/author/mostafa) Dec 11, 2024 10 min read [\#data-parsing](https://scrapfly.io/blog/tag/data-parsing) [\#java](https://scrapfly.io/blog/tag/java) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fweb-scraping-java-jsoup-html-parsing "Share on LinkedIn")    

 
In today’s web-driven world, data is the cornerstone of every major application and decision-making process. Web scraping provides developers with the tools to access and harness that data.

In this article, we'll explore [jsoup](https://jsoup.org/), a popular Java library for parsing and scraping web content. Whether you're a beginner or an experienced developer, this guide will provide the foundations and best practices for using jsoup effectively.

## Key Takeaways

Learn Java web scraping with jsoup library for HTML parsing, data extraction, and web content manipulation with robust Java development workflows.

- Use jsoup for efficient HTML parsing and data extraction from web pages with Java using CSS selectors and XPath
- Configure HTTP connections and authentication for secure web scraping operations with custom headers and cookies
- Implement error handling and retry logic for reliable web scraping workflows with status code checking
- Use specialized tools like ScrapFly for automated Java web scraping with anti-blocking features and proxy rotation
- Implement proper data validation and cleaning for structured data extraction workflows with HTML sanitization
- Configure Maven and Gradle dependencies for jsoup integration in Java projects with proper version management

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.


## What is Jsoup?

At its core, jsoup is a Java library designed for **parsing**, **manipulating**, and **extracting** data from **HTML documents**. It allows developers to work with web content as if they were using a browser's developer tools. With its intuitive API, jsoup simplifies tasks like data extraction, HTML manipulation, and even cleanup, making it a go-to tool for many Java developers.

## Installing Jsoup

Getting started with jsoup is straightforward. Add jsoup as a dependency to your project using a build tool like **Maven** or **Gradle**:

To install jsoup using Maven, add the following to your `pom.xml` file:

xml```xml
<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.16.1</version>
</dependency>
```


To install jsoup using Gradle, add the following to the dependencies block of your `build.gradle.kts` file:

```
implementation 'org.jsoup:jsoup:1.16.1'
```


Once installed, you’re ready to explore jsoup’s powerful scraping and parsing capabilities.

## Scraping with Jsoup

Jsoup doesn't only provide methods to parse HTML, but it also provides a simple and robust way to connect to web pages and extract HTML directly.

You can use the `.connect()` method that Jsoup provides to get the HTML content of a website and parse it to extract data from it.

java```java
Document doc = Jsoup.connect("https://example.com").get();
```


The `.connect()` method also allows you to customize requests by specifying headers, cookies, and HTTP methods:

java```java
Connection connection = Jsoup.connect("https://example.com")
                            .method(Connection.Method.GET)
                            .userAgent("Mozilla/5.0")
                            .header("Authorization", "Bearer token")
                            .cookie("session_id", "abc123");
```


The `.connect()` method returns a `Connection` intance which has a `.response()` method that gives you access to the HTTP response details. You can use `.statusCode()` to check the response status to handle errors gracefully:

java```java
if (connection.response().statusCode() == 200) {
  System.out.println("Success!");
} else {
  System.out.println("Failed to connect.");
}
```


By combining these methods, you can mimic browser behavior to scrape data effectively.

## Jsoup Example Scraper

To illustrate jsoup’s capabilities, let’s scrape product data from the first page of [web-scraping.dev/products](https://web-scraping.dev/products).

For this example we'll use gradle to manage our dependencies, create a new project and add the following dependencies to your `build.gradle.kts` file:

java```java
plugins {
    id("java")
    id("application")
}

repositories {
    mavenCentral()
}

dependencies {
    implementation("org.jsoup:jsoup:1.16.1")
}

application {
    mainClass.set("JsoupScraper")
}
```


Then let's create a small scraper under `/src/main/java/JsoupScraper.java` file that will:

- Scrape [web-scraping.dev/products](https://web-scraping.dev/products) page and find all product URLs
- Scrape each product URL for product name and price
- Collect all results and display them

Our jsoup java scraper should look something like this:

java```java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.ArrayList;
import java.util.HashMap;

public class JsoupScraper {

    public static HashMap<String, String> scrapeProduct(String url) throws Exception {
        // Scrape a single product page from web-scraping.dev
        Document doc = Jsoup.connect(url).get();
        HashMap<String, String> productData = new HashMap<>();

        productData.put("title", doc.select("h3").text());
        productData.put("price", doc.select(".product-price").text());
        productData.put("price_full", doc.select(".product-price-full").text());
        productData.put("url", url);

        return productData;
    }

    public static void main(String[] args) throws Exception {
        // Fetch the product directory page
        Document doc = Jsoup.connect("https://web-scraping.dev/products").get();

        // This is where we'll store our results
        ArrayList<HashMap<String, String>> products = new ArrayList<>();

        // Iterate through product elements, find product url and scrape each product
        Elements productElements = doc.select(".products .product");
        for (Element product : productElements) {
            // Get the product URL
            String url = product.select("h3 > a").attr("href");
            System.out.println("Scraping product: " + url);
            // Scrape each product and store result
            HashMap<String, String> productData = scrapeProduct(url);
            products.add(productData);
        }

        // Pretty print the product data
        System.out.println("Product Data:");
        for (HashMap<String, String> product : products) {
            System.out.println(product);
        }
    }
}
```


Example Outputshell```shell
$ gradle run

> Task :run
Scraping product: https://web-scraping.dev/product/1
Scraping product: https://web-scraping.dev/product/2
Scraping product: https://web-scraping.dev/product/3
Scraping product: https://web-scraping.dev/product/4
Scraping product: https://web-scraping.dev/product/5
Product Data:
{price=$9.99, price_full=$12.99, title=Box of Chocolate Candy Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Hiking Boots for Outdoor Adventures Kids' Light-Up Sneakers Blue Energy Potion, url=https://web-scraping.dev/product/1}
{price=$4.99, price_full=, title=Dark Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Cat-Ear Beanie Running Shoes for Men Classic Leather Sneakers, url=https://web-scraping.dev/product/2}
{price=$4.99, price_full=, title=Teal Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Dragon Energy Potion Women's High Heel Sandals Dark Red Energy Potion Running Shoes for Men, url=https://web-scraping.dev/product/3}
{price=$4.99, price_full=, title=Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Dark Red Energy Potion Cat-Ear Beanie, url=https://web-scraping.dev/product/4}
{price=$4.99, price_full=, title=Blue Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Hiking Boots for Outdoor Adventures Classic Leather Sneakers, url=https://web-scraping.dev/product/5}
```


Above is our jsoup scraper that scraped 5 products with their titles and prices though to break this down a bit further let's take a look at each HTML parsing capability of jsoup.

## Parsing HTML with jsoup

Jsoup's Java HTML parser can be used to parse and modify scraped HTML content.

### Finding data with CSS Selectors

Jsoup's `.select()` method takes in a [CSS Selector](https://scrapfly.io/blog/posts/parsing-html-with-css) to find elements in the HTML document. For example, to find the first css selector match `select().first()` can be used:

java```java
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
// find all images using css selector for matching elements with "product-img" class
Elements images = doc.select(".product-img");
// print only the first one using first()
System.out.println(images.first());
// prints: <img src="https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp" class="img-responsive product-img active">
```


### Selecting attributes and values

Extract attributes and inner text using `.text()` and `.attr()`:

To get the text content of an html element using jsoup, the `text()` method can be used:

java```java
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
Elements variants = doc.select(".variants .variant");
// text of first variants
System.out.println(variants.first().text());
// prints: orange, small

// or text of all variants
System.out.println(variants.text());
// prints: orange, small orange, medium orange, large cherry, small cherry, medium cherry, large
```


To get the value of an html attribute set on an element, the `attr()` method can be used:

java```java
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get();
Elements images = doc.select(".product-img");
System.out.println(images.first().attr("src"));
// prints: https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp
```


### Changing the DOM

Jsoup also allows modifications of the DOM using `.text()` and `.attr()`:

The `.text()` method accepts a string argument that makes it alter the inner text of the HTML element.

java```java
doc.select("h1").first().text("Updated Title");
```


The `.attr()` method also takes a second string argument that get passed as the value attribute value in the HTML.

java```java
doc.select("img").first().attr("src", "new-image.jpg");
```


This versatility lets you work with HTML dynamically, much like in a browser.

## Jsoup Utilities

Jsoup comes equipped with handy utilities to simplify common HTML tasks.

#### Cleanup HTML

Use `Jsoup.clean()` to sanitize HTML, removing unsafe tags and attributes:

java```java
String cleanHtml = Jsoup.clean("<script>alert(1)</script><p>Safe content</p>", Safelist.basic());
```


#### Prettify HTML

Format raw HTML for readability using:

java```java
doc.outputSettings().prettyPrint(true);
System.out.println(doc.html());
```


#### Escape and Unescape HTML

Handle special characters with `Entities.escape()` and `Entities.unescape()`:

java```java
String escaped = Entities.escape("<div>Content</div>");
```


java```java
String unescaped = Entities.unescape("&lt;div&gt;Content&lt;/div&gt;");
```


These utilities enhance your ability to manage and present HTML effectively.

## Jsoup Limitations

Despite its strengths, jsoup has some limitations that developers should consider:

- **Lack of HTTP/2 support:** Jsoup only supports basic HTTP/1.1 requests. For HTTP/2 and advanced networking capabilities, consider using libraries like [OkHttp](https://square.github.io/okhttp/). Okhttp is a popular http client for Java, check out our [comprehensive guide on okhttp](https://scrapfly.io/blog/posts/guide-to-okhttp-java-kotlin) to learn more about its capabilities.
- **No headless browser functionality:** Jsoup doesn’t execute JavaScript, which limits its ability to scrape dynamic web pages. Tools like Selenium or Puppeteer can help in these scenarios.
- **Detectability:** Jsoup’s requests can be easily identified as non-human by websites, making it less ideal for scraping heavily protected content.

For advanced use cases, combining jsoup with tools like **OkHttp** or **Scrapfly** can help overcome these challenges.

## Power Up with Scrapfly

Jsoup is great for small to medium-scale scraping tasks when scrping static pages. However, it falls short when it comes to javascript rendered content and scraper blocking due to IP blocks or bot detection.

Check out [Scrapfly's web scraping API](https://scrapfly.io/web-scraping-api) for all the details.


Here is a simple example of how you can use okhttp with Scrapfly's Scraping API.

java```java
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.Response;

import java.io.IOException;

public class OkHttpExample {
    public static void main(String[] args) {
        OkHttpClient client = new OkHttpClient();
        HttpUrl.Builder urlBuilder = HttpUrl.parse("https://api.scrapfly.io/scrape")
                .newBuilder();
        // Required parameters: your API key and URL to scrape
        urlBuilder.addQueryParameter("key", "YOUR_API_KEY");
        urlBuilder.addQueryParameter("url", "https://web-scraping.dev/product/1");
        // Optional parameters:
        // enable anti scraping protection bypass
        urlBuilder.addQueryParameter("asp", "true");
        // use proxies of a specific countries
        urlBuilder.addQueryParameter("country", "US,CA,DE");
        // enable headless browser
        urlBuilder.addQueryParameter("render_js", "true");
        // see more on scrapfly docs: https://scrapfly.io/docs/scrape-api/getting-started#spec

        // Building and send request
        String url = urlBuilder.build().toString();
        Request request = new Request.Builder()
                .url(url)
                .build();
        try (Response response = client.newCall(request).execute()) {
            if (response.isSuccessful()) {
                System.out.println("Response Body: " + response.body().string());
                System.out.println("Status Code: " + response.code());
            } else {
                System.out.println("Request Failed: " + response.code());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}
```


## FAQ

Can jsoup capture website screenshots?No, jsoup cannot capture website screenshots. For such needs, you’ll require a headless browser like Selenium or a specialized API. Consider using **Scrapfly’s Screenshot API**, which simplifies capturing full-page images with minimal setup.


Does jsoup handle JavaScript-rendered content?No, jsoup cannot execute JavaScript or interact with dynamic content. It works only with static HTML. To scrape JavaScript-rendered pages, you’ll need tools like Selenium or Puppeteer, or services like Scrapfly, which offer JavaScript execution capabilities.


Does jsoup support multi-threaded scraping?Jsoup itself doesn’t provide built-in multi-threading, but you can use Java’s concurrency utilities (e.g., ExecutorService) to scrape multiple pages simultaneously. Just ensure you manage thread safety and network limits to avoid being blocked by the target website.


## Summary

Jsoup is a versatile and lightweight library for scraping and parsing web content. It excels at handling static HTML and provides utilities for cleaning, prettifying, and manipulating content. While it has limitations, combining it with tools like OkHttp or Scrapfly unlocks advanced capabilities, making it a powerful addition to any web scraping toolkit.

Whether you’re building a basic scraper or a robust data pipeline, jsoup provides the flexibility and functionality to get started quickly. Experiment with its features and extend its capabilities with complementary tools to suit your needs. Happy scraping!


    Table of Contents- [Key Takeaways](#key-takeaways)
- [What is Jsoup?](#what-is-jsoup)
- [Installing Jsoup](#installing-jsoup)
- [Scraping with Jsoup](#scraping-with-jsoup)
- [Jsoup Example Scraper](#jsoup-example-scraper)
- [Parsing HTML with jsoup](#parsing-html-with-jsoup)
- [Finding data with CSS Selectors](#finding-data-with-css-selectors)
- [Selecting attributes and values](#selecting-attributes-and-values)
- [Changing the DOM](#changing-the-dom)
- [Jsoup Utilities](#jsoup-utilities)
- [Jsoup Limitations](#jsoup-limitations)
- [Power Up with Scrapfly](#power-up-with-scrapfly)
- [FAQ](#faq)
- [Summary](#summary)
 
    Join the Newsletter  Get monthly web scraping insights 

 
Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 
## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fweb-scraping-java-jsoup-html-parsing) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fweb-scraping-java-jsoup-html-parsing) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fweb-scraping-java-jsoup-html-parsing) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fweb-scraping-java-jsoup-html-parsing) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fweb-scraping-java-jsoup-html-parsing) 


 ## Related Articles

 [  

 crawling seo 

### What is Googlebot User Agent String?

Learn about Googlebot user agents, how to verify them, block unwanted crawlers, and optimize your site for better indexi...

 
 ](https://scrapfly.io/blog/posts/what-are-googlebot-user-agent-strings) [  

 http python 

### How to Effectively Use User Agents for Web Scraping

In this article, we’ll take a look at the User-Agent header, what it is and how to use it in web scraping. We'll also ge...

 
 ](https://scrapfly.io/blog/posts/user-agent-header-in-web-scraping) [  

 python ai 

### What is Parsing? From Raw Data to Insights

Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python par...

 
 ](https://scrapfly.io/blog/posts/what-is-parsing-turning-data-into-insights) 

  ## Related Questions

- [ Q How to Set User Agent With cURL? ](https://scrapfly.io/blog/answers/how-to-set-curl-user-agent)
- [ Q What are some ways to parse JSON datasets in Python? ](https://scrapfly.io/blog/answers/what-are-some-ways-to-parse-json-datasets-in-python)
 
  
 Extract structured data with AI, **1,000 free credits** [Start Free](https://scrapfly.io/register)