Web Scraping With Go
Learn web scraping with Golang, from native HTTP requests and HTML parsing to a step-by-step guide to using Colly, the Go web crawling package.
When web scraping, we might need to represent scrape HTML data as plain text. For this we can use BeautifulSoup's get_text()
method which extracts all visible HTML text and most importantly ignores invisible details such as <script>
elements:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<body>
<article>
<h1>Article title</h1>
<p>first paragraph and a <a>link</a></p>
<script>var invisible="javascript variable";</script>
</article>
</body>
""")
# if possible it's best to restrict html to a specific element
element = soup.find('article')
text = element.get_text()
print(text)
"""
Article title
first paragraph and a link
"""
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇