How to Power-Up LLMs with Web Scraping and RAG
In depth look at how to use LLM and web scraping for RAG applications using either LlamaIndex or LangChain.
Python has many different HTTP clients that can be used for web scraping. However, not all of them support HTTP2 which can be vital in avoiding web scraper blocking.
Here are the most popular HTTP clients that support HTTP2:
import httpx
with httpx.Client(http2=True) as client:
response = client.get("https://httpbin.dev/anything")
import h2.connection
import h2.config
config = h2.config.H2Configuration()
conn = h2.connection.H2Connection(config=config)
conn.send_headers(stream_id=stream_id, headers=headers)
conn.send_data(stream_id, data)
socket.sendall(conn.data_to_send())
events = conn.receive_data(socket_data)
So, it's best to stick to httpx
for HTTP2 though if you have a complex use case h2
can be adapted to extendible libraries like twisted
.
For more on HTTPX in web scraping see our hands-on introduction article which covers everything you need to know when it comes to web scraping
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇