How to turn HTML to text in Python?

When web scraping, we might need to represent scrape HTML data as plain text. For this we can use BeautifulSoup's get_text() method which extracts all visible HTML text and most importantly ignores invisible details such as <script> elements:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<body>
    <article>
    <h1>Article title</h1>
    <p>first paragraph and a <a>link</a></p>
    <script>var invisible="javascript variable";</script>
    </article>
</body>
""")
# if possible it's best to restrict html to a specific element
element = soup.find('article')
text = element.get_text()
print(text)
"""
Article title
first paragraph and a link
"""

Related Posts

Quick Intro to Parsing JSON with JMESPath in Python

Introduction to JMESPath - JSON query language which is used in web scraping to parse JSON datasets for scrape data.

How to Scrape Hidden Web Data

The visible HTML doesn't always represent the whole dataset available on the page. In this article, we'll be taking a look at scraping of hidden web data. What is it and how can we scrape it using Python?

How to Ensure Web Scrapped Data Quality

Ensuring consitent web scrapped data quality can be a difficult and exhausting task. In this article we'll be taking a look at two populat tools in Python - Cerberus and Pydantic - and how can we use them to validate data.