Web Scraping With Go
Learn web scraping with Golang, from native HTTP requests and HTML parsing to a step-by-step guide to using Colly, the Go web crawling package.
To save an HTML table to an Excel spreadsheet we can use Python with BeautifulSoup4 and xlsxwriter + HTTP client like requests.
$ pip install bs4 xlsxwriter requests
Then, we can scrape the web page, find table data using bs4 and write it to .xlsx
file using `xlsxwriter``:
from bs4 import BeautifulSoup
import requests
import xlsxwriter
# 1. Retrieve HTML and create BeautifulSoup object
response = requests.get("https://www.w3schools.com/html/html_tables.asp")
soup = BeautifulSoup(response.text)
# 2. Find the table and extract headers and rows:
table = soup.find('table', {"id": "customers"})
header = []
rows = []
for i, row in enumerate(table.find_all('tr')):
if i == 0:
header = [el.text.strip() for el in row.find_all('th')]
else:
rows.append([el.text.strip() for el in row.find_all('td')])
# 3. save to it a XLSX file:
workbook = xlsxwriter.Workbook('output.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write_row(0, 0, header)
for i, row in enumerate(rows):
worksheet.write_row(i+1, 0, row)
workbook.close()
BeautifulSoup is a very powerful HTML parser giving us full control when it comes to parsing HTML tables. Unlike many automated scripts we can direct it to extract HTML table values from any table structure.
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇