How to turn HTML to text in Python?

by scrapecrow Oct 31, 2022

When web scraping, we might need to represent scrape HTML data as plain text. For this we can use BeautifulSoup's get_text() method which extracts all visible HTML text and most importantly ignores invisible details such as <script> elements:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<body>
    <article>
    <h1>Article title</h1>
    <p>first paragraph and a <a>link</a></p>
    <script>var invisible="javascript variable";</script>
    </article>
</body>
""")
# if possible it's best to restrict html to a specific element
element = soup.find('article')
text = element.get_text()
print(text)
"""
Article title
first paragraph and a link
"""

Related Articles

How to Parse Web Data with Python and Beautifulsoup

Beautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how to use it, what is it good for and explore a real -life web scraping example.

BEAUTIFULSOUP
DATA-PARSING
PYTHON
How to Parse Web Data with Python and Beautifulsoup

Guide to List Crawling: Everything You Need to Know

In-depth look at list crawling - how to extract valuable data from list-formatted content like tables, listicles and paginated pages.

CRAWLING
BEAUTIFULSOUP
PYTHON
Guide to List Crawling: Everything You Need to Know

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup

DATA-PARSING
PYTHON
Ultimate Guide to JSON Parsing in Python

Guide to Parsel - the Best HTML Parsing in Python

Learn to extract data from websites with Parsel, a Python library for HTML parsing using CSS selectors and XPath.

DATA-PARSING
PARSEL
Guide to Parsel - the Best HTML Parsing in Python

JSONL vs JSON

Learn the differences between JSON and JSONLines, their use cases, and efficiency. Why JSONLines excels in web scraping and real-time processing

DATA-PARSING
JSONL vs JSON

Web Scraping and HTML Parsing with Jsoup and Java

Learn how to harness the power of jsoup, a lightweight and efficient Java library for web scraping and HTML parsing.

DATA-PARSING
JAVA
Web Scraping and HTML Parsing with Jsoup and Java