[Blog](https://scrapfly.io/blog) / [api](https://scrapfly.io/blog/tag/api) / [How to Turn Web Scrapers into Data APIs](https://scrapfly.io/blog/posts/how-to-turn-web-scrapers-into-data-apis) # How to Turn Web Scrapers into Data APIs by [Bernardas Alisauskas](https://scrapfly.io/blog/author/bernardas) Apr 10, 2026 14 min read [\#api](https://scrapfly.io/blog/tag/api) [\#project](https://scrapfly.io/blog/tag/project) [\#python](https://scrapfly.io/blog/tag/python) [ ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-turn-web-scrapers-into-data-apis "Share on LinkedIn") Delivering web scraped data in real-time can be a real challenge. Though to turn a Python web scraper into a data-on-demand API we need very few ingredients. In this practical tutorial, we'll build a simple [Yahoo Finance](https://finance.yahoo.com/) scraper and wrap it around in a web API powered by [FastAPI](https://fastapi.tiangolo.com/) to scrape and deliver thousands of stock data details on demand! ## Key Takeaways Master web scraper api development with advanced Python techniques, FastAPI integration, and real-time data delivery for comprehensive scraping services. - Implement FastAPI web services with asynchronous request handling for high-performance scraping APIs - Configure real-time data scraping with caching and webhook support for efficient data delivery - Handle concurrent requests and rate limiting for scalable scraping API endpoints - Implement proper error handling and retry logic for reliable API responses - Use specialized tools like ScrapFly for automated API management with anti-blocking features - Apply advanced techniques for background task processing, webhook delivery, and cache management **Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime. ## Why FastAPI? [FastAPI](https://fastapi.tiangolo.com/) is an **asynchronous** API framework in Python which makes it ideal for a web scraping API for this very reason. By employing the async code structure we can have really fast programs with very little overhead in very little code. In this article our full yahoo stock data API with caching and webhook support will be done in less than 100 lines of code! ## Setup In this tutorial, as our API we'll be using [FastAPI](https://fastapi.tiangolo.com/) as our API server. To server our API we'll be using [uvicorn](https://pypi.org/project/uvicorn/) and to understand what's going on easier we'll be using a rich logging package [loguru](https://pypi.org/project/loguru/). For scraping, we'll be using [httpx](https://pypi.org/project/httpx/) as our HTTP client and [parsel](https://pypi.org/project/parsel/) as our HTML parser. All of these tools are Python packages and can be installed using the `pip` console command: shell```shell $ pip install fastapi uvicorn httpx parsel loguru ``` With our tools ready let's take a look at FastAPI basics. ## FastAPI Quick Start FastAPI has some amazing [documentation resources](https://fastapi.tiangolo.com/tutorial/) but for our scraper service, we only need the very basics. To start we'll be working in a single python module `main.py`: python```python # main.py from fastapi import FastAPI from loguru import logger as log # create API app object app = FastAPI() # attach route to our API app @app.get("/scrape/stock/{symbol}") async def scrape_stock(symbol: str): return { "stock": symbol, "price": "placeholder", } ``` Above, we definite the first details of our data API with a single route: `/scrape/stock/`. This route has a single parameter - symbol - that indicates what stock should be scraped. For example, Apple's stock's symbol is `AAPL`. We can run our API and see our placeholder results: shell```shell $ uvicorn main:app --reload ``` Now, if we go to `http://127.0.0.1:8000/scrape/stock/aapl` we should see our placeholder results as JSON: python```python import httpx print(httpx.get("http://127.0.0.1:8000/scrape/stock/aapl").json()) { "stock": "AAPL", "price": "placeholder", } ``` So, every time someone connects to this API endpoint our `scrape_stock` function will be called and the results will be returned to the client. Now, let's make it scrape stuff. ## Scraping Yahoo Finance To scrape Yahoo Finance we'll generate the URL that leads to the stock data page, download the HTML data and parse it using a few clever XPath selectors. For example, let's take a look at Apple's stock data page [yahoo.com/quote/AAPL/](https://finance.yahoo.com/quote/AAPL/) - we can see that the URL pattern is `finance.yahoo.com/quote/`. So, as long as we know the company's stock exchange symbol we can scrape any stock's data. We'll scrape price and summary detailsLet's add the scraper logic to our API: python```python import asyncio import httpx from time import time from fastapi import FastAPI app = FastAPI() stock_client = httpx.AsyncClient(timeout=httpx.Timeout(10.0)) async def scrape_yahoo_finance(symbol): """scrapes stock data from yahoo finance""" log.info(f"{symbol}: scraping data") response = await stock_client.get( f"https://finance.yahoo.com/quote/{symbol}?p={symbol}" ) sel = Selector(response.text) parsed = {} # parse summary data tables: rows = sel.xpath('//div[re:test(@data-test,"(left|right)-summary-table")]//td[@data-test]') for row in rows: label = row.xpath("@data-test").get().split("-value")[0].lower() value = " ".join(row.xpath(".//text()").getall()) parsed[label] = value # parse price: parsed["price"] = sel.css( f'fin-streamer[data-field="regularMarketPrice"][data-symbol="{symbol}"]::attr(value)' ).get() # add meta field for when this was scraped parsed["_scraped_on"] = time() return parsed @app.get("/scrape/stock/{symbol}") async def scrape_stock(symbol: str): symbol = symbol.upper() return await scrape_yahoo_finance(symbol) # on API start - open up our scraper's http client connections @app.on_event("startup") async def app_startup(): await stock_client.__aenter__() # on API close - close our scraper's http client connections @app.on_event("shutdown") async def app_shutdown(): await stock_client.__aexit__() ``` We've updated our API with a scraper function and attached it to our API endpoint. Let's take a look at the results now: json```json $ curl http://127.0.0.1:8000/scrape/stock/aapl { "prev_close": "156.90", "open": "157.34", "bid": "0.00 x 1000", "ask": "0.00 x 1000", "days_range": "153.67 - 158.74", "fifty_two_wk_range": "129.04 - 182.94", "td_volume": "101,696,790", "average_volume_3month": "75,324,652", "market_cap": "2.47T", "beta_5y": "1.23", "pe_ratio": "25.41", "eps_ratio": "6.05", "earnings_date": "Oct 26, 2022 - Oct 31, 2022", "dividend_and_yield": "0.92 (0.59%)", "ex_dividend_date": "Aug 05, 2022", "one_year_target_price": "182.01", "price": "153.72", "_scraped_on": 1663838493.6148243 } ``` We've turned our scraper into a data API in just a few lines of code though we do have one glaring problem: API load. What if multiple clients requests scraping of the same stock within a short amount of time? We'll be wasting our resources on redundant scrapes. To fix this, let's take a look at how we can add simple caching. ## Caching Scrapers There are many ways to cache API contents, from enabling caching in our web server to using dedicated caching tools like Redis or Memchache. However, for small API projects we can easily handle this within Python ourselves with very few modifications: python```python import asyncio import httpx from time import time from fastapi import FastAPI app = FastAPI() stock_client = httpx.AsyncClient(timeout=httpx.Timeout(10.0)) STOCK_CACHE = {} # NEW: establish global cache storage CACHE_TIME = 60 # NEW: define how long do we want to keep cache in seconds async def scrape_yahoo_finance(symbol): """scrapes stock data from yahoo finance""" # NEW: check cache before we commit to scraping cache = STOCK_CACHE.get(symbol) if cache and time() - CACHE_TIME < cache["_scraped_on"]: log.debug(f"{symbol}: returning cached item") return cache log.info(f"{symbol}: scraping data") response = await stock_client.get( f"https://finance.yahoo.com/quote/{symbol}?p={symbol}" ) sel = Selector(response.text) parsed = {} rows = sel.xpath('//div[re:test(@data-test,"(left|right)-summary-table")]//td[@data-test]') for row in rows: label = row.xpath("@data-test").get().split("-value")[0].lower() value = " ".join(row.xpath(".//text()").getall()) parsed[label] = value parsed["price"] = sel.css( f'fin-streamer[data-field="regularMarketPrice"][data-symbol="{symbol}"]::attr(value)' ).get() parsed["_scraped_on"] = time() # NEW: store successful results to cache STOCK_CACHE[symbol] = parsed return parsed @app.get("/scrape/stock/{symbol}") async def scrape_stock(symbol: str): symbol = symbol.upper() return await scrape_yahoo_finance(symbol) @app.on_event("startup") async def app_startup(): await stock_client.__aenter__() # NEW: optionally we can clear expired cache every minute to prevent # memory build up. async def clear_expired_cache(period=60.0): while True: global STOCK_CACHE log.debug(f"clearing expired cache") STOCK_CACHE = { k: v for k, v in STOCK_CACHE.items() if time() - CACHE_TIME < v["_scraped_on"] } await asyncio.sleep(period) clear_cache_task = asyncio.create_task(clear_expired_cache()) @app.on_event("shutdown") async def app_shutdown(): await stock_client.__aexit__() ``` Above, we've updated our scrape function to first check the global cache before scraping the live target. For this, we use a simple Python dictionary to store scraped data by the symbol. Instead of scraping on demand the API can pick results that were generated recentlyPython dictionaries are extremely fast and efficient so we can cache thousands or even millions of results in the memory of our API. We also added a repeating asyncio task `clear_expired_cache()` which deletes expired cache values every minute. We did this by using `asyncio.create_task()` function which turns any asynchronous function (coroutine) into a background task object. Now, if our API gets requests for the same stock data multiple times it'll only scrape the data once and serve cache to everyone else: python```python import asyncio import httpx from time import time async def many_concurrent_api_calls(n): _start = time() async with httpx.AsyncClient(timeout=httpx.Timeout(10.0)) as client: _start_one = time() await client.get("http://127.0.0.1:8000/scrape/stock/aapl") print(f"completed first API scrape in {time() - _start_one:.2f} seconds") results = await asyncio.gather(*[ client.get("http://127.0.0.1:8000/scrape/stock/aapl") for i in range(n) ]) print(f"completed {n-1} API scrapes in {time() - _start:.2f} seconds") # run 1000 API calls asyncio.run(many_concurrent_api_calls(1_000)) # will print: # completed first API scrape in 1.23 seconds # completed 999 API scrapes in 2.59 seconds ``` Caching techniques will vary by application but for long and expensive tasks like web scraping, we can see great benefits even from simple designs like this one. For more advanced caching in FastAPI see [fastapi-cache](https://github.com/long2ice/fastapi-cache) extension. ## Webhooks for Long Scrapes Some scrape tasks can take many seconds or even minutes to complete which would timeout or block connecting client. One way to handle this is to provide a webhook support. In other words, webhooks essentially promise that the API will complete the task and call back with the results later. Instead of returning data directly the API sends it to the webhookWe can extend our API to include this feature by replacing our `scrape_stock` route: python```python ... async def with_webhook(cor, webhook, retries=3): """execute corotine and send it to a webhook""" result = await cor async with httpx.AsyncClient( headers={"User-Agent": "scraper webhook"}, timeout=httpx.Timeout(timeout=15.0), ) as client: for i in range(retries): try: response = await client.post(webhook, json=result) return except Exception as e: log.exception(f"Failed to send a webhook {i}/{retries}") await asyncio.sleep(5) # wait between retries log.error(f"Failed to reach webhook in {retries} retries") @app.get("/scrape/stock/{symbol}") async def scrape_stock(symbol: str, webhook: Optional[str] = None): symbol = symbol.upper() scrape_cor = scrape_yahoo_finance(symbol) if webhook: # run scrape coroutine in the background task = asyncio.create_task(with_webhook(scrape_cor, webhook)) return {"success": True, "webhook": webhook} else: return await scrape_cor ``` In the update above, if `?webhook` parameter is supplied to our API it'll instantly return `{"success": True, "webhook": webook}` and schedule a background task that'll scrape results and post them to the clients webhook address. To test this out we can webhook testing service such as [webhook.site](https://webhook.site/): python```python from urllib.parse import urlencode import httpx url = "http://127.0.0.1:8000/scrape/stock/aapl" + urlencode({ "webhook": "https://webhook.site/da0a5cbc-0a62-4bd5-90f4-c8af73e5df5c" # change this to yours }) response = httpx.get(url) print(response.json()) # {'success': True, 'webhook': 'https://webhook.site/da0a5cbc-0a62-4bd5-90f4-c8af73e5df5c'} ``` When running this, we'll see API returning instantly and if we check our webhooks.site session we'll see the data delivered there: We can see incoming webhook with the scraped dataWebhooks are a great way to not only deal with long scraping tasks but also ensure that our API will scale in the future as we can simply start more processes to handle webhook tasks if needed. ## Final Scraper Code To wrap this article up let's take a look at entire code of our scraper API: python```python import asyncio from time import time from typing import Optional import httpx from fastapi import FastAPI from loguru import logger as log from parsel import Selector app = FastAPI() stock_client = httpx.AsyncClient(timeout=httpx.Timeout(10.0)) STOCK_CACHE = {} CACHE_TIME = 10 async def scrape_yahoo_finance(symbol): """scrapes stock data from yahoo finance""" cache = STOCK_CACHE.get(symbol) if cache and time() - CACHE_TIME < cache["_scraped_on"]: log.debug(f"{symbol}: returning cached item") return cache log.info(f"{symbol}: scraping data") response = await stock_client.get( f"https://finance.yahoo.com/quote/{symbol}?p={symbol}" ) sel = Selector(response.text) parsed = {} for row in sel.xpath( '//div[re:test(@data-test,"(left|right)-summary-table")]//td[@data-test]' ): label = row.xpath("@data-test").get().split("-value")[0].lower() value = " ".join(row.xpath(".//text()").getall()) parsed[label] = value parsed["price"] = sel.css( f'fin-streamer[data-field="regularMarketPrice"][data-symbol="{symbol}"]::attr(value)' ).get() parsed["_scraped_on"] = time() STOCK_CACHE[symbol] = parsed return parsed async def with_webhook(cor, webhook, retries=3): result = await cor async with httpx.AsyncClient( headers={"User-Agent": "scraper webhook"}, timeout=httpx.Timeout(timeout=15.0), ) as client: for i in range(retries): try: response = await client.post(webhook, json=result) return except Exception as e: log.exception(f"Failed to send a webhook {i}/{retries}") await asyncio.sleep(5) # wait between retries @app.get("/scrape/stock/{symbol}") async def scrape_stock(symbol: str, webhook: Optional[str] = None): symbol = symbol.upper() scrape_cor = scrape_yahoo_finance(symbol) if webhook: # run scrape coroutine in the background task = asyncio.create_task(with_webhook(scrape_cor, webhook)) return {"success": True, "webhook": webhook} else: return await scrape_cor # we can clear cache every once in a while to prevent memory build up # @app.on_event("startup") # @repeat_every(seconds=5.0) # 1 hour async def clear_expired_cache(period=5.0): while True: global STOCK_CACHE _initial_len = len(STOCK_CACHE) log.debug(f"clearing expired cache, current len {_initial_len}") STOCK_CACHE = { k: v for k, v in STOCK_CACHE.items() if v["_scraped_on"] == "scraping" or time() - CACHE_TIME < v["_scraped_on"] } log.debug(f"cleared {_initial_len - len(STOCK_CACHE)}") await asyncio.sleep(period) @app.on_event("startup") async def app_startup(): await stock_client.__aenter__() clear_cache_task = asyncio.create_task(clear_expired_cache()) @app.on_event("shutdown") async def app_shutdown(): await stock_client.__aexit__() ``` ## ScrapFly Web scraper logic can quickly outgrow our API. If we're scraping difficult targets that ban scrapers and require complex connections and proxy usage soon our data API will do more scraping than API'ing. We can abstract all of these complexities away to ScrapFly! For example, we could replace our scraper code to use ScrapFly through [python sdk](https://scrapfly.io/docs/sdk/python): python```python ... from scrapfly import ScrapflyClient, ScrapeConfig stock_client = ScrapflyClient(key="YOUR SCRAPFLY KEY") async def scrape_yahoo_finance(symbol): """scrapes stock data from yahoo finance""" cache = STOCK_CACHE.get(symbol) if cache and time() - CACHE_TIME < cache["_scraped_on"]: log.debug(f"{symbol}: returning cached item") return cache log.info(f"{symbol}: scraping data") result = await stock_client.async_scrape(ScrapeConfig( url=f"https://finance.yahoo.com/quote/{symbol}?p={symbol}", # we can select proxy country country="US", # or proxy type proxy_pool="public_residential_pool", # enable anti scraping protection bypass asp=True, # or real headless browser javascript rendering render_js=True, # we can enable caching cache=True, )) parsed = {} for row in result.selector.xpath( '//div[re:test(@data-test,"(left|right)-summary-table")]//td[@data-test]' ): label = row.xpath("@data-test").get().split("-value")[0].lower() value = " ".join(row.xpath(".//text()").getall()) parsed[label] = value parsed["price"] = result.selector.css( f'fin-streamer[data-field="regularMarketPrice"][data-symbol="{symbol}"]::attr(value)' ).get() parsed["_scraped_on"] = time() STOCK_CACHE[symbol] = parsed return parsed ... ``` Now our data API can scrape content in real-time without being blocked or throttled. ScrapFly also offers built-in [cache](https://scrapfly.io/docs/scrape-api/cache) and [webhook](https://scrapfly.io/docs/scrape-api/webhook) functionalities - just like we've covered in this tutorial which makes it an easy integration to real time scraping APIs. ## Summary In this practical tutorial, we've taken a quick dive into data APIs that scrape data in real time. We've covered how to set up FastAPI, wrote up a simple Yahoo Finance stock data scraper then wrapped everything together into a single, cohesive data API with caching and webhook support. We've done that in just a few lines of actual code by taking advantage of existing, powerful web API ecosystem in Python! Table of Contents Table of Contents- [Key Takeaways](#key-takeaways) - [Why FastAPI?](#why-fastapi) - [Setup](#setup) - [FastAPI Quick Start](#fastapi-quick-start) - [Scraping Yahoo Finance](#scraping-yahoo-finance) - [Caching Scrapers](#caching-scrapers) - [Webhooks for Long Scrapes](#webhooks-for-long-scrapes) - [Final Scraper Code](#final-scraper-code) - [ScrapFly](#scrapfly) - [Summary](#summary) Join the Newsletter Get monthly web scraping insights Scale Your Web Scraping Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits. No credit card required 1,000 free API credits Anti-bot bypass included [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) Not ready? Get our newsletter instead. ## Explore this Article with AI [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-turn-web-scrapers-into-data-apis) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-turn-web-scrapers-into-data-apis) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-turn-web-scrapers-into-data-apis) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-turn-web-scrapers-into-data-apis) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-turn-web-scrapers-into-data-apis) ## Related Articles [ python nodejs ### How to Capture and Convert a Screenshot to PDF Quick guide on how to effectively capture web screenshots as PDF documents ](https://scrapfly.io/blog/posts/screenshot-to-pdf) [ python scrapeguide ### How to Scrape Real Estate Property Data using Python Introduction to scraping real estate property data. What is it, why and how to scrape it? We'll also list dozens of popu... ](https://scrapfly.io/blog/posts/how-to-scrape-real-estate-property-data-using-python) [ python data-parsing ### How to Ensure Web Scrapped Data Quality Ensuring consitent web scrapped data quality can be a difficult and exhausting task. In this article we'll be taking a l... ](https://scrapfly.io/blog/posts/how-to-ensure-web-scrapped-data-quality) ## Related Questions - [ Q How to turn HTML to text in Python? ](https://scrapfly.io/blog/answers/how-to-turn-html-to-text-in-python) Scale your web scraping effortlessly, **1,000 free credits** [Start Free](https://scrapfly.io/register)