How to turn HTML to text in Python?

When web scraping, we might need to represent scrape HTML data as plain text. For this we can use BeautifulSoup's get_text() method which extracts all visible HTML text and most importantly ignores invisible details such as <script> elements:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<body>
    <article>
    <h1>Article title</h1>
    <p>first paragraph and a <a>link</a></p>
    <script>var invisible="javascript variable";</script>
    </article>
</body>
""")
# if possible it's best to restrict html to a specific element
element = soup.find('article')
text = element.get_text()
print(text)
"""
Article title
first paragraph and a link
"""
Question tagged: Data Parsing, Beautifulsoup

Related Posts

Web Scraping Simplified - Scraping Microformats

In this short intro we'll be taking a look at web microformats. What are microformats and how can we take advantage in web scraping? We'll do a quick overview and some examples in Python using extrcut library.

Quick Intro to Parsing JSON with JSONPath in Python

JSONPath is a path expression language for JSON. It is used to query data from JSON datasets and it is similar to XPath query language for XML documents. Parsing HTML

Quick Intro to Parsing JSON with JMESPath in Python

Introduction to JMESPath - JSON query language which is used in web scraping to parse JSON datasets for scrape data.