JSONL vs JSON
Learn the differences between JSON and JSONLines, their use cases, and efficiency. Why JSONLines excels in web scraping and real-time processing
When web scraping, we might need to represent scrape HTML data as plain text. For this we can use BeautifulSoup's get_text()
method which extracts all visible HTML text and most importantly ignores invisible details such as <script>
elements:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<body>
<article>
<h1>Article title</h1>
<p>first paragraph and a <a>link</a></p>
<script>var invisible="javascript variable";</script>
</article>
</body>
""")
# if possible it's best to restrict html to a specific element
element = soup.find('article')
text = element.get_text()
print(text)
"""
Article title
first paragraph and a link
"""
This knowledgebase is provided by Scrapfly data APIs, check us out! 👇