     [Blog](https://scrapfly.io/blog)   /  [crawling](https://scrapfly.io/blog/tag/crawling)   /  [Creating Search Engine for any Website using Web Scraping](https://scrapfly.io/blog/posts/search-engine-using-web-scraping)   # Creating Search Engine for any Website using Web Scraping

 by [Bernardas Alisauskas](https://scrapfly.io/blog/author/bernardas) Apr 10, 2026 17 min read [\#crawling](https://scrapfly.io/blog/tag/crawling) [\#data-parsing](https://scrapfly.io/blog/tag/data-parsing) [\#seo](https://scrapfly.io/blog/tag/seo) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fsearch-engine-using-web-scraping "Share on LinkedIn")    

 

 

   

In this tutorial we'll take a look at how to create a search engine for any website by web scraping it's data, parsing and indexing sections of interest and wrapping it all up with intuitive GUI.

We'll be using [lunr.js](https://lunrjs.com/) javascript search engine to display our search index and for data scraping and index generation we'll be using Python. The same techniques can be applied to scrape search engines themselves, like [Google](https://scrapfly.io/blog/posts/how-to-scrape-google) or [Bing](https://scrapfly.io/blog/posts/how-to-scrape-bing-search-using-python), to build your own SERP data collection tools.

As an example projects, we'll scrape [ScrapFly's documentation website](https://scrapfly.io/docs) and create a search engine for these each documentation page. In this exercise we'll learn about crawling, index section parsing from HTML documents and how to put all of this together as a search engine. Let's dive in!

## Key Takeaways

Master search engine development with web scraping techniques, data indexing, and JavaScript search functionality for comprehensive website search solutions.

- Implement web scraping with Python to crawl and extract content from target websites for search indexing
- Configure lunr.js JavaScript search engine for client-side full-text search functionality
- Parse HTML documents and extract relevant content sections for search index generation
- Build responsive web interfaces with search functionality and result display
- Use specialized tools like ScrapFly for automated content collection with anti-blocking features
- Apply advanced crawling techniques with depth limits and URL discovery for comprehensive content collection

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.







## What is Lunrjs?

Lunr.js is a small, full-text search library for use in the browser. It indexes JSON documents and provides a simple search interface for retrieving documents that best match text queries.



Since lunr is using JSON documents for it's index and runs easily in the browser from a single json file we can easily integrate it with web scraping!

We'll scrape HTML data from our source, parse it to JSON structure and feed that into Lunrjs front-end: creating our very own search engine!

## Data Scraper

To collect our search engine data we'll first have to write a scraper which we'll use to retrieve the data for indexing. In this example, we'll use ScrapFly documentation pages: <https://scrapfly.io/docs>

For our scraper we'll use Python with few community packages:

- [httpx](https://pypi.org/project/httpx/) package for our HTTP connections
- [parsel](https://pypi.org/project/parsel/) package for parsing HTML data for values we want to be considered in our index.
- [lunr](https://pypi.org/project/lunr/) package for building our lunr index.
- [loguru](https://pypi.org/project/loguru/) \[optional\] package for easy, pretty logs so we can follow along easier.

We can install these libraries via `pip` command:

shell```shell
$ pip install httpx parsel lunr loguru
```



### Using Crawling

For collecting index data we'll be using web scraping technique called crawling. Crawling is essentially a web scraping loop where our program continuously collects documents, finds more urls to scrape and repeats the process until nothing new is found.



Scrapfly

#### Extract structured data automatically?

Scrapfly's Extraction API uses AI to turn any webpage into structured data — no selectors needed.

[Try Free →](https://scrapfly.io/register)In Python we can illustrate this process using our `httpx` and `parsel` tools:

python```python
import asyncio
from typing import List

import httpx
from loguru import logger as log
from parsel import Selector


def find_urls(resp: httpx.Response, xpath: str) -> set:
    """find crawlable urls in a response from an xpath"""
    found = set()
    urls = Selector(text=resp.text).xpath(xpath).getall()
    for url in urls:
        url = httpx.URL(resp.url).join(url.split("#")[0])
        if url.host != resp.url.host:
            log.debug(f"skipping url of a different hostname: {url.host}")
            continue
        found.add(str(url))
    return found


async def crawl(url, follow_xpath: str, session: httpx.AsyncClient, max_depth=10) -> List[httpx.Response]:
    """Crawl source with provided follow rules"""
    urls_seen = set()
    urls_to_crawl = [url]
    all_responses = []
    depth = 0
    while urls_to_crawl:
        # first we want to protect ourselves from accidental infinite crawl loops
        if depth > max_depth:
            log.error(
                f"max depth reached with {len(urls_to_crawl)} urls left in the crawl queue"
            )
            break
        log.info(f"scraping: {len(urls_to_crawl)} urls")
        responses = await asyncio.gather(*[session.get(url) for url in urls_to_crawl])
        found_urls = set()
        for resp in responses:
            all_responses.append(resp)
            found_urls = found_urls.union(find_urls(resp, xpath=follow_xpath))
        # find more urls to crawl that we haven't visited before:
        urls_to_crawl = found_urls.difference(urls_seen)
        urls_seen = urls_seen.union(found_urls)
        depth += 1
    log.info(f"found {len(all_responses)} responses")
    return all_responses
```



In the example above we, provide our crawler with crawling rules and starting point. The asynchronous recursive scraper keeps scraping urls until it finds everything it can.

Let's run it against our example target - ScrapFly docs:

python```python
# Example use:
async def run():
    limits = httpx.Limits(max_connections=3)
    headers = {"User-Agent": "ScrapFly Blog article"}
    async with httpx.AsyncClient(limits=limits, headers=headers) as session:
        responses = await crawl(
            # our starting point url
            url="https://scrapfly.io/docs", 
            # xpath to discover urls to crawl
            follow_xpath="//ul[contains(@class,'nav')]//li/a/@href",
            session=session,
        )

if __name__ == "__main__":
    asyncio.run(run())
```



We can see that this crawler quickly generates us 23 pages:

shell```shell
2022-05-26 | INFO     | __main__:crawl:33 - scraping: 1 urls
2022-05-26 | INFO     | __main__:crawl:33 - scraping: 22 urls
2022-05-26 | INFO     | __main__:crawl:43 - found 23 responses
```



With pages collected we can start parsing them for data we'll use in our search engine.

### HTML cleanup

Since we are using HTML value in our index we get the advantages of seeing rich text like highlights, links and pictures, however for that we have to clean up our HTML data to prevent polluting our index with unnecessary values.

python```python
from httpx import Response
from parsel import Selector
from urllib.parse import urljoin

def get_clean_html_tree(
    resp: Response, remove_xpaths=(".//figure", ".//*[contains(@class,'carousel')]")
):
    """cleanup HTML tree from domain specific details like classes"""
    sel = Selector(text=resp.text)
    for remove_xp in remove_xpaths:
        for rm_node in sel.xpath(remove_xp):
            rm_node.remove()
    allowed_attributes = ["src", "href", "width", "height"]
    for el in sel.xpath("//*"):
        for k in list(el.root.attrib):
            if k in allowed_attributes:
                continue
            el.root.attrib.pop(k)
        # turn all link to absolute
        if el.root.attrib.get("href"):
            el.root.attrib["href"] = urljoin(str(resp.url), el.root.attrib["href"])
        if el.root.attrib.get("src"):
            el.root.attrib["src"] = urljoin(str(resp.url), el.root.attrib["src"])
    return sel
```



Here, we have our cleanup function which will remove unnecessary HTML node attributes. For quality search engine it's important to sanitize our data to prevent false positives.

### Index Sections

With our HTMLs collected we can start parsing this data for search index. For this part let's split each page into sections we'll use for index generation. This will let us create much better index for our search engine.

For sections, let's split the page by headers:



we'll divide every page in sections separated by headingSo, a single page will produce several index targets. This is great for our web search engine as headings usually indicate topics or subtopics and usually we can link directly to this part of the page using the `#` artifact syntax.
For example, when we visit <https://scrapfly.io/docs/project#introduction> our browser automatically scrolls to Introduction heading. This is controlled by HTML node's `id` attribute.

To split HTML by sections we can use a simple parsing algorithm in Python:

python```python
def parse(responses: List[Response]) -> List[dict]:
    """parse responses for index documents"""
    log.info(f"parsing documents from {len(responses)} responses")
    documents = []
    for resp in responses:
        sel = get_clean_html_tree(resp)

        sections = []
        # some pages might have multiple article bodies:
        for article in sel.xpath("//article"):
            section = []
            for node in article.xpath("*"):
                # separate page by <hX> nodes
                if re.search(r"h\d", node.root.tag) and len(section) > 1:
                    sections.append(section)
                    section = [node]
                else:
                    section.append(node)
            if section:
                sections.append(section)

        page_title = sel.xpath("//h1/text()").get("").strip()
        for section in sections:
            data = {
                "title": f"{page_title} | "
                + "".join(section[0].xpath(".//text()").getall()).strip(),
                "text": "".join(s.get() for s in section[1:]).strip(),
            }
            url_with_id_pointer = (
                str(resp.url) + "#" + (section[0].xpath("@id").get() or data["title"])
            )
            data["location"] = url_with_id_pointer
            documents.append(data)
    return documents
```



The parsing code above, splits our html tree by any heading element like `h1`, `h2` etc. Further, let's parse these sections into index documents that are made up of title which is the `hX` node text and html body of the section.

## Building Index

We have our scraper and parser ready - it's time to build our index. Our index will consist of JSON documents which are article sections we've extracted in previously:

json```json
[
  {
    "title": "title of the section",
    "text": "html value of the section",
  },
...
]
```



There are few ways to build our `lunrjs` index, but the simplest one is to use `lunr` python package:

python```python
import json
import lunr

def build_index(docs: List[dict]):
    """build lunrjs index from provided list of documents"""
    log.info(f"building index from {len(docs)} documents")
    config = {
        "lang": ["en"],
        "min_search_length": 1,
    }
    page_dicts = {"docs": docs, "config": config}
    idx = lunr(
        ref="location",
        fields=("title", "text"),
        documents=docs,
        languages=["en"],
    )
    page_dicts["index"] = idx.serialize()
    return json.dumps(page_dicts, sort_keys=True, separators=(",", ":"), indent=2)
```



This function takes in list of documents and generates lunr index. Let's give a shot!

## Putting everything together

We defined all of our components:

- crawler which collects HTML documents.
- parser which parses each HTML document by section.
- index builder which turns section documents into one lunrjs JSON index.

Our final project code (click to show)python```python
import asyncio
import json
import re
from typing import List
from urllib.parse import urljoin

import httpx
from httpx import Response
from loguru import logger as log
from lunr import lunr
from parsel import Selector


def find_urls(resp: httpx.Response, xpath: str) -> set:
    """find crawlable urls in a response from an xpath"""
    found = set()
    urls = Selector(text=resp.text).xpath(xpath).getall()
    for url in urls:
        url = httpx.URL(resp.url).join(url.split("#")[0])
        if url.host != resp.url.host:
            log.debug(f"skipping url of a different hostname: {url.host}")
            continue
        found.add(str(url))
    return found


async def crawl(url, follow_xpath: str, session: httpx.AsyncClient, max_depth=10) -> List[httpx.Response]:
    """Crawl source with provided follow rules"""
    urls_seen = set()
    urls_to_crawl = [url]
    all_responses = []
    depth = 0
    while urls_to_crawl:
        # first we want to protect ourselves from accidental infinite crawl loops
        if depth > max_depth:
            log.error(
                f"max depth reached with {len(urls_to_crawl)} urls left in the crawl queue"
            )
            break
        log.info(f"scraping: {len(urls_to_crawl)} urls")
        responses = await asyncio.gather(*[session.get(url) for url in urls_to_crawl])
        found_urls = set()
        for resp in responses:
            all_responses.append(resp)
            found_urls = found_urls.union(find_urls(resp, xpath=follow_xpath))
        # find more urls to crawl that we haven't visited before:
        urls_to_crawl = found_urls.difference(urls_seen)
        urls_seen = urls_seen.union(found_urls)
        depth += 1
    log.info(f"found {len(all_responses)} responses")
    return all_responses


def get_clean_html_tree(
    resp: Response, remove_xpaths=(".//figure", ".//*[contains(@class,'carousel')]")
):
    """cleanup HTML tree from domain specific details like classes"""
    sel = Selector(text=resp.text)
    for remove_xp in remove_xpaths:
        for rm_node in sel.xpath(remove_xp):
            rm_node.remove()
    allowed_attributes = ["src", "href", "width", "height"]
    for el in sel.xpath("//*"):
        for k in list(el.root.attrib):
            if k in allowed_attributes:
                continue
            el.root.attrib.pop(k)
        # turn all link to absolute
        if el.root.attrib.get("href"):
            el.root.attrib["href"] = urljoin(str(resp.url), el.root.attrib["href"])
        if el.root.attrib.get("src"):
            el.root.attrib["src"] = urljoin(str(resp.url), el.root.attrib["src"])
    return sel


def parse(responses: List[Response]) -> List[dict]:
    """parse responses for index documents"""
    log.info(f"parsing documents from {len(responses)} responses")
    documents = []
    for resp in responses:
        sel = get_clean_html_tree(resp)

        sections = []
        # some pages might have multiple article bodies:
        for article in sel.xpath("//article"):
            section = []
            for node in article.xpath("*"):
                # separate page by <hX> nodes
                if re.search(r"h\d", node.root.tag) and len(section) > 1:
                    sections.append(section)
                    section = [node]
                else:
                    section.append(node)
            if section:
                sections.append(section)

        page_title = sel.xpath("//h1/text()").get("").strip()
        for section in sections:
            data = {
                "title": f"{page_title} | "
                + "".join(section[0].xpath(".//text()").getall()).strip(),
                "text": "".join(s.get() for s in section[1:]).strip(),
            }
            url_with_id_pointer = (
                str(resp.url) + "#" + (section[0].xpath("@id").get() or data["title"])
            )
            data["location"] = url_with_id_pointer
            documents.append(data)
    return documents


def build_index(docs: List[dict]):
    """build lunrjs index from provided list of documents"""
    log.info(f"building index from {len(docs)} documents")
    config = {
        "lang": ["en"],
        "min_search_length": 1,
    }
    page_dicts = {"docs": docs, "config": config}
    idx = lunr(
        ref="location",
        fields=("title", "text"),
        documents=docs,
        languages=["en"],
    )
    page_dicts["index"] = idx.serialize()
    return json.dumps(page_dicts, sort_keys=True, separators=(",", ":"), indent=2)


async def run():
    """
    example run function:
    establishes http session, crawls html documents, 
    turns them into index documents and compiles lunr index
    """
    limits = httpx.Limits(max_connections=3)
    timeout = httpx.Timeout(20.0)
    headers = {"User-Agent": "ScrapFly Blog article"}
    async with httpx.AsyncClient(
        limits=limits, headers=headers, timeout=timeout
    ) as session:
        responses = await crawl(
            # our starting point url
            url="https://scrapfly.io/docs",
            # xpath to discover urls to crawl
            follow_xpath="//ul[contains(@class,'nav')]//li/a/@href",
            session=session,
        )
        documents = parse(responses)
        with open("search_index.json", "w") as f:
            f.write(build_index(documents))


if __name__ == "__main__":
    asyncio.run(run())
```



If we run this code `search_index.json` will be generated. It's not much use to us without a front-end so let's put one up!

### Front End Explorer

For our search front-end we put together a simple viewer for the purpose of this tutorial and it can be [found on our github](https://github.com/Granitosaurus/simple-lunrjs-display)

Let's clone this front-end and give it our generated index:

shell```shell
# create project directory
$ cd docs-search
# clone our front-end
$ git clone https://github.com/Granitosaurus/simple-lunrjs-display
$ cd simple-lunrjs-display
# replace search_index.json with the one we generated
$ cp ../search_index.json .
# start a http server to see our search engine live!
$ python -m http.server --bind 127.0.0.1
```



Now, if we go to `http://127.0.0.1:8000` we can explore our search!



## Use ScrapFly to Handle Dynamic Pages and Blocking

When crawling pages we often encounter two issues:

- To see full content of a dynamic page we need render javascript
- Some pages might block our scraper from accessing it.

This is where Scrapfly can help you to scale up your dynamic page scrapers!



Let's modify our scraper to use ScrapFly API. For this, we'll be using [scrapfly-sdk](https://scrapfly.io/docs/sdk/python) python package and ScrapFly's [anti scraping protection bypass](https://scrapfly.io/docs/scrape-api/anti-scraping-protection) and [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering) features.
First, let's install `scrapfly-sdk` using pip:

shell```shell
$ pip install scrapfly-sdk
```



Then all we have to do is replace our `httpx` client with `ScrapflyClient`:

```python

import asyncio
import json
import re
from typing import List
from urllib.parse import urljoin
from httpx import URL
<p>from requests import Response
from loguru import logger as log
from lunr import lunr
from parsel import Selector
from scrapfly import ScrapflyClient, ScrapeConfig</p>
<p>def find_urls(resp: Response, xpath: str) -> set:
"""find crawlable urls in a response from an xpath"""
found = set()
urls = Selector(text=resp.text).xpath(xpath).getall()
for url in urls:
url = urljoin(resp.url, url.split("#")[0])
if URL(url).host != URL(resp.url).host:
log.debug(f"skipping url of a different hostname: {url.host}")
continue
found.add(str(url))
return found</p>
<p>async def crawl(url, follow_xpath: str, session: ScrapflyClient, max_depth=10) -> List[Response]:
"""Crawl source with provided follow rules"""
urls_seen = set()
urls_to_crawl = [url]
all_responses = []
depth = 0
while urls_to_crawl:</p>
<h1 id="first-we-want-to-protect-ourselves-from-accidental-infinite-crawl-loops">first we want to protect ourselves from accidental infinite crawl loops</h1>
<div class="code-block"><button class="code-copy-btn" title="Copy code"><i class="fas fa-copy"></i></button>    if depth > max_depth:
        log.error(
            f"max depth reached with {len(urls_to_crawl)} urls left in the crawl queue"
        )
        break
    log.info(f"scraping: {len(urls_to_crawl)} urls")
    responses = await session.concurrent_scrape([
        ScrapeConfig(
            url=url, 
            # to render javascript for dynamic pages
            render_js=True,
            # enable anti bot protection bypass to avoid blocking
            asp=True
        )
        for url in urls_to_crawl
    ])
    responses = [scrapfly_response.upstream_result_into_response() for scrapfly_response in responses]
    found_urls = set()
    for resp in responses:
        all_responses.append(resp)
        found_urls = found_urls.union(find_urls(resp, xpath=follow_xpath))
    # find more urls to crawl that we haven't visited before:
    urls_to_crawl = found_urls.difference(urls_seen)
    urls_seen = urls_seen.union(found_urls)
    depth += 1
log.info(f"found {len(all_responses)} responses")
return all_responses</div>
<p>def get_clean_html_tree(
resp: Response, remove_xpaths=(".//figure", ".//<em>[contains(@class,'carousel')]")
):
"""cleanup HTML tree from domain specific details like classes"""
sel = Selector(text=resp.text)
for remove_xp in remove_xpaths:
for rm_node in sel.xpath(remove_xp):
rm_node.remove()
allowed_attributes = ["src", "href", "width", "height"]
for el in sel.xpath("//</em>"):
for k in list(el.root.attrib):
if k in allowed_attributes:
continue
el.root.attrib.pop(k)</p>
<h1 id="turn-all-link-to-absolute">turn all link to absolute</h1>
<div class="code-block"><button class="code-copy-btn" title="Copy code"><i class="fas fa-copy"></i></button>    if el.root.attrib.get("href"):
        el.root.attrib["href"] = urljoin(str(resp.url), el.root.attrib["href"])
    if el.root.attrib.get("src"):
        el.root.attrib["src"] = urljoin(str(resp.url), el.root.attrib["src"])
return sel</div>
<p>def parse(responses: List[Response]) -> List[dict]:
"""parse responses for index documents"""
log.info(f"parsing documents from {len(responses)} responses")
documents = []
for resp in responses:
sel = get_clean_html_tree(resp)</p>
<div class="code-block"><button class="code-copy-btn" title="Copy code"><i class="fas fa-copy"></i></button>    sections = []
    # some pages might have multiple article bodies:
    for article in sel.xpath("//article"):
        section = []
        for node in article.xpath("*"):
            # separate page by <hX> nodes
            if re.search(r"h\d", node.root.tag) and len(section) > 1:
                sections.append(section)
                section = [node]
            else:
                section.append(node)
        if section:
            sections.append(section)

    page_title = sel.xpath("//h1/text()").get("").strip()
    for section in sections:
        data = {
            "title": f"{page_title} | "
            + "".join(section[0].xpath(".//text()").getall()).strip(),
            "text": "".join(s.get() for s in section[1:]).strip(),
        }
        url_with_id_pointer = (
            str(resp.url) + "#" + (section[0].xpath("@id").get() or data["title"])
        )
        data["location"] = url_with_id_pointer
        documents.append(data)
return documents</div>
<p>def build_index(docs: List[dict]):
"""build lunrjs index from provided list of documents"""
log.info(f"building index from {len(docs)} documents")
config = {
"lang": ["en"],
"min_search_length": 1,
}
page_dicts = {"docs": docs, "config": config}
idx = lunr(
ref="location",
fields=("title", "text"),
documents=docs,
languages=["en"],
)
page_dicts["index"] = idx.serialize()
return json.dumps(page_dicts, sort_keys=True, separators=(",", ":"), indent=2)</p>
<p>async def run():
"""
example run function:
establishes http session, crawls html documents, 
turns them into index documents and compiles lunr index
"""
with ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY", max_concurrency=2) as session:
responses = await crawl(</p>
<h1 id="our-starting-point-url">our starting point url</h1>
<div class="code-block"><button class="code-copy-btn" title="Copy code"><i class="fas fa-copy"></i></button>        url="https://scrapfly.io/docs",
        # xpath to discover urls to crawl
        follow_xpath="//ul[contains(@class,'nav')]//li/a/@href",
        session=session,
    )
    documents = parse(responses)
    with open("search_index.json", "w") as f:
        f.write(build_index(documents))</div>
<p>if <strong>name</strong> == "<strong>main</strong>":
asyncio.run(run())
</p>
```

## Summary

In this tutorial we used Python and [lunrjs](https://lunrjs.com/) framework to create a search engine from web scraped data. We've started by writing a crawler for our source which scrapes all HTML data recursively. Then, we learned about index creation by parsing HTML document data into sections which we later fed into our lunrjs index generator to prebuild our index.

Using search engines like this is a great way to present web scraped data and create personal knowledge bases. It's even easier when using ScrapFly API to render dynamic pages and avoid scraper blocking so give it a shot!



 

    Table of Contents- [Key Takeaways](#key-takeaways)
- [What is Lunrjs?](#what-is-lunrjs)
- [Data Scraper](#data-scraper)
- [Using Crawling](#using-crawling)
- [HTML cleanup](#html-cleanup)
- [Index Sections](#index-sections)
- [Building Index](#building-index)
- [Putting everything together](#putting-everything-together)
- [Front End Explorer](#front-end-explorer)
- [Use ScrapFly to Handle Dynamic Pages and Blocking](#use-scrapfly-to-handle-dynamic-pages-and-blocking)
- [Summary](#summary)
 
    Join the Newsletter  Get monthly web scraping insights 

 

  



Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 

## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fsearch-engine-using-web-scraping) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fsearch-engine-using-web-scraping) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fsearch-engine-using-web-scraping) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fsearch-engine-using-web-scraping) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fsearch-engine-using-web-scraping) 



 ## Related Articles

 [  

 python scrapeguide 

### How to Scrape Google Search Results in 2026

In this scrape guide we'll be taking a look at how to scrape Google Search - the biggest index of public web. We'll cov...

 

 ](https://scrapfly.io/blog/posts/how-to-scrape-google) [  

 python 

### How to Crawl the Web with Python

Introduction to web crawling with Python. What is web crawling? How it differs from web scraping? And a deep dive into c...

 

 ](https://scrapfly.io/blog/posts/crawling-with-python) [     

 python ai 

### Build a Documentation Chatbot That Works on ANY Website

Build an AI chatbot from any docs site using Scrapfly Crawler API, LangChain, and Streamlit. Works on Cloudflare-protect...

 

 ](https://scrapfly.io/blog/posts/build-a-documentation-chatbot-that-works-on-any-website) 

  ## Related Questions

- [ Q How to select any element using wildcard in XPath? ](https://scrapfly.io/blog/answers/how-to-select-elements-of-any-name-using-wildcards-in-xpath)
- [ Q How to select last element in XPath? ](https://scrapfly.io/blog/answers/how-to-select-last-element-in-xpath)
- [ Q How to find all links using BeautifulSoup and Python? ](https://scrapfly.io/blog/answers/how-to-find-all-links-using-beautifulsoup)
 
  



   



 Extract structured data with AI, **1,000 free credits** [Start Free](https://scrapfly.io/register)