How to turn HTML to text in Python?

by scrapecrow Oct 31, 2022

When web scraping, we might need to represent scrape HTML data as plain text. For this we can use BeautifulSoup's get_text() method which extracts all visible HTML text and most importantly ignores invisible details such as <script> elements:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<body>
    <article>
    <h1>Article title</h1>
    <p>first paragraph and a <a>link</a></p>
    <script>var invisible="javascript variable";</script>
    </article>
</body>
""")
# if possible it's best to restrict html to a specific element
element = soup.find('article')
text = element.get_text()
print(text)
"""
Article title
first paragraph and a link
"""

How to Parse Web Data with Python and Beautifulsoup

Beautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how to use it, what is it good for and explore a real -life web scraping example.

BEAUTIFULSOUP

DATA-PARSING

PYTHON

How to Scrape Ticketmaster Event Data

Learn how to scrape Ticketmaster for event data including concerts, venues, dates, and ticket information using Python. Complete guide with code examples and anti-blocking techniques.

How to Scrape Mouser.com

Learn how to scrape Mouser.com electronic component data including prices, specifications, and inventory using Python. Complete guide with code examples and anti-blocking techniques.

How to Scrape Zoro.com

Learn how to scrape Zoro.com product data including prices, specifications, and inventory using Python. Complete guide with code examples and anti-blocking techniques.