     [Blog](https://scrapfly.io/blog)   /  [ai](https://scrapfly.io/blog/tag/ai)   /  [How to Build a Web Scraping Agent with Gemini](https://scrapfly.io/blog/posts/gemini-for-webscraping)   # How to Build a Web Scraping Agent with Gemini

 by [Hisham Medhat](https://scrapfly.io/blog/author/hisham) Jun 16, 2026 21 min read [\#ai](https://scrapfly.io/blog/tag/ai) [\#api](https://scrapfly.io/blog/tag/api) [\#python](https://scrapfly.io/blog/tag/python) [\#scrapeguide](https://scrapfly.io/blog/tag/scrapeguide) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fgemini-for-webscraping "Share on LinkedIn")    

 

 

         

   **Web Scraping API**Scrape any website with anti-bot bypass, proxy rotation, and JS rendering.

 

 [ Learn More  ](https://scrapfly.io/products/web-scraping-api) [  Docs ](https://scrapfly.io/docs/scrape-api/getting-started) 

 

 

Gemini can read a page and return clean structured data, but it cannot reliably fetch one. Point the URL Context tool at a React app or a page behind Cloudflare, and you get a 403 or an empty JSON shell.

The fix is short. Install the Scrapfly CLI and drop a SKILL.md file in a Gemini skill folder. Gemini then calls it where URL Context fails. We cover the CLI, API, and MCP surfaces, a Python pipeline, and pagination. Let's get started.

[Guide to Understanding and Developing LLM AgentsExplore how LLM agents transform AI, from text generators into dynamic decision-makers with tools like LangChain for automation, analysis &amp; more!](https://scrapfly.io/blog/posts/practical-guide-to-llm-agents)



## Key Takeaways

- **Gemini extracts, but it cannot fetch.** URL Context caps at 20 URLs and 34MB per page.
- **Pick the surface by job.** CLI to prototype, API for backends, MCP for multi-tool.
- **A one-file skill is the fastest setup.** Drop the Scrapfly SKILL.md in a skills folder.
- **One command runs an autonomous agent.** `scrapfly agent --provider gemini` returns JSON.
- **The Python pipeline is short.** Shell out to `scrapfly`, then extract via `response_schema`.
- **Scrapfly is the fetch layer Gemini lacks.** Its Web Scraping API clears 90+ bot systems.

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.







## Can Gemini Do Web Scraping?

Gemini helps with web scraping when you use it for reasoning and structured extraction. It still needs a fetch layer for JavaScript rendering, anti-bot challenges, and proxy routing.

The model interprets page structure, infers fields, and returns typed JSON when you set `response_mime_type="application/json"` with a `response_schema`. You describe the data you want, and Gemini produces it from markdown or HTML you've already retrieved.

A production pipeline splits the work into two stages. Scrapfly handles access, rendering, anti-bot protection, and proxy routing, then returns clean markdown. Gemini reads that markdown and turns it into structured JSON.

The Anti-Scraping Protection (ASP) flag opens the targets Gemini's native fetcher fails on. Cloudflare, DataDome, Akamai, PerimeterX, and Kasada all switch on with one parameter.

Developers reach for this pattern for product listings, competitor research, news monitoring, documentation indexing, and turning unstructured pages into typed datasets. What Gemini brings to the extract side:

- HTML or markdown to JSON extraction with natural-language schemas via `response_schema`
- Multi-step agentic browsing in Gemini CLI, where a 1M-token context window keeps long sessions cheap on the free tier
- Multimodal input, so you can feed a screenshot and the rendered DOM together when text alone misses visual cues

Compared with ChatGPT or Claude, Gemini's free tier and long context make it a cheap prototyping choice. The right Gemini surface, however, depends on your stack.



## Gemini CLI vs Gemini API vs AI Studio for Web Scraping

Use the Gemini CLI for interactive agent work, and the Gemini API plus URL Context for programmatic Python pipelines on simple pages. Google AI Studio handles no-code experiments.

| Gemini interface | Best for | Setup complexity |
|---|---|---|
| Gemini CLI | Terminal agent workflows, ad-hoc scraping, skill-based pipelines | Low |
| Gemini API + URL Context | Backend Python services and scheduled jobs on simple pages | Medium |
| Google AI Studio | Non-developer use; out of scope for this guide | None |

### Gemini CLI (terminal agent, recommended path)

[Gemini CLI](https://github.com/google-gemini/gemini-cli) is Google's open-source terminal agent for shell tasks and tool use. It loads skills by semantic match. Drop a `SKILL.md` file in a discovery folder with a clear `description`, and Gemini routes matching prompts to it.

The fastest scraping path is installing the Scrapfly CLI and writing a one-file skill so Gemini knows when to call it. Install Gemini CLI with `npm install -g @google/gemini-cli`, then run `gemini` and pick OAuth on first launch for the free tier.

We'll set up the skill in the next section.

### Gemini API + URL Context tool (programmatic, for Python pipelines)

For backend services and scheduled jobs, call the Gemini API through the `google-genai` SDK. The URL Context tool is Gemini's built-in fetcher: pass URLs as a tool declaration and the model retrieves and analyzes them server-side.

URL Context works on plain documentation and undefended pages. The next section covers when it doesn't, and the structured-JSON extraction section below covers the Python pipeline path that wraps the Scrapfly CLI.

### Google AI Studio (no-code, out of scope)

[Google AI Studio](https://aistudio.google.com/) is the browser playground where you paste a URL into a chat and ask for structured data. The use case fits researchers and one-off lookups. For developer scraping, the rest of this guide assumes a CLI or SDK surface.



## Why Does Gemini's URL Context Tool Fail on Real Sites?

Gemini's URL Context tool retrieves pages through Google's own setup. It only works on publicly accessible URLs and has hard request and page-size caps. It does not handle JavaScript rendering or anti-bot protection well.

The tool can also return well-formed JSON over partial content with no error. Gemini CLI's built-in `web_fetch` has the same class of fetch limitations. On JavaScript-heavy pages, you're rolling the dice on whether the content you need is in the HTML at all.

In code, the Gemini API accepts a `url_context` tool. The model fetches the page, and your prompt analyzes whatever comes back:

python```python
from google import genai
from google.genai import types

client = genai.Client()
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="List the product names and prices on https://www.nike.com/w/mens-shoes",
    config=types.GenerateContentConfig(
        tools=[{"url_context": {}}],
    ),
)
print(response.text)
```



On a public docs page, this works. On a protected retailer, you get one-line summaries, refusals that mask a 403, or data the model guessed from the URL slug. Four failure modes stand out:

- **Hard documented limits.** URL Context caps at 20 URLs per request and 34MB per page per the [URL Context docs](https://ai.google.dev/gemini-api/docs/url-context). Past that, the tool truncates or refuses. Symptom: silent content drop on long pages, hard rejection on batch fetches.
- **Single-page apps return inconsistent content.** If your data loads after a `useEffect`, sometimes the rendered DOM appears in the response, sometimes only the HTML shell. Same request, different result, no diagnostic either way.
- **Protected sites distort the fetch.** Google's docs require public accessibility and exclude paywalled content. In practice, Cloudflare is the recognizable example, but DataDome, Akamai, PerimeterX, Kasada, and similar systems can also return a challenge page instead of usable content.
- **Throughput follows model quota.** URL Context counts fetched page content as input tokens, so a long product listing burns the same budget as feeding the page in directly. It is fine for light grounding, not for high-volume crawling.

Better prompts can't recover content that never reached the model. The next section wires in a production fetch layer that clears protected pages before Gemini sees them.



## How to Set Up Scrapfly as a Gemini CLI Skill

Install the Scrapfly CLI, drop a `SKILL.md` file in a Gemini skill folder, and prompt naturally. Gemini matches your request against the skill's `description`, calls the CLI, and reads the JSON envelope back. The whole setup takes about two minutes.

### Install the Scrapfly CLI in one command

The [Scrapfly CLI](https://scrapfly.io/products/scrapfly-cli) is a single static binary. On macOS or Linux, the install script downloads the right release for your platform:

bash```bash
curl -fsSL https://scrapfly.io/scrapfly-cli/install | sh
scrapfly config set-key scp-live-...
```



On Windows, grab `scrapfly-windows-amd64.zip` from the [Releases page](https://github.com/scrapfly/scrapfly-cli/releases) and add the unzipped folder to your `PATH`. An `npm install -D scrapfly-cli` package and platform-specific notes live in the [Scrapfly CLI docs](https://scrapfly.io/docs/cli).

You can also export `SCRAPFLY_API_KEY` instead of persisting it. Get a free key with API credits at [scrapfly.io/register](https://scrapfly.io/register).

Every Scrapfly CLI command returns the same JSON envelope by default. Here is a real scrape against the public test site:

bash```bash
scrapfly scrape https://web-scraping.dev/products --format markdown
```



json```json
{
  "success": true,
  "product": "scrape",
  "data": {
    "config": { "url": "https://web-scraping.dev/products", "render_js": false, "asp": false, ... },
    "context": { "cost": { "total": 1 }, "cache": { "state": "MISS" }, "proxy": { "country": "us", "network": "datacenter" }, ... },
    "result": { "status_code": 200, "duration": 2.76, "content_type": "application/markdown", "content": "web-scraping.dev product page 1 ...", ... },
    "uuid": "01KS4Z592P0VMJZY8WTC6YTTHX"
  }
}
```



The envelope shape is `{success, product, data | error}` on every command. Flags like `--content-only`, `--data-only`, and `--proxified` strip the envelope for piping into another tool, and `--pretty` swaps the JSON for a human summary.

### Scaffold the skill for Gemini CLI

Gemini CLI loads skills from four tiers. These are built-in, extension, user (`~/.gemini/skills/` or `~/.agents/skills/`), and workspace (`.gemini/skills/` or `.agents/skills/`). The `.agents/skills/` alias takes precedence within each tier.

Drop the official Scrapfly `scrapfly-cli/SKILL.md` into one of those folders and Gemini semantic-matches it like any other skill:

bash```bash
mkdir -p ~/.agents/skills/scrapfly
curl -fsSL https://raw.githubusercontent.com/scrapfly/skills/main/scrapfly-cli/SKILL.md \
  -o ~/.agents/skills/scrapfly/SKILL.md
```



The skill body covers the six usage paths, authentication, the JSON envelope contract, and every CLI verb grouped by intent. A minimal skill file looks like this:

markdown```markdown
---
name: scrapfly
description: |
  Scrape a URL, take a screenshot, extract structured data, or run an autonomous
  scraping agent using the Scrapfly CLI. Use whenever the user asks to fetch,
  parse, screenshot, crawl, or extract data from any public web page.
---

# Scrapfly skill body

Reads the official scrapfly-cli SKILL.md so Gemini knows every CLI verb
and when to use it. Replace this body with the file linked above.
```



For full reference, see the [Scrapfly CLI repository](https://github.com/scrapfly/scrapfly-cli) and the raw [Scrapfly CLI SKILL.md](https://raw.githubusercontent.com/scrapfly/skills/main/scrapfly-cli/SKILL.md).

### Trigger the skill from a natural-language prompt

Open Gemini CLI and prompt in plain English. A request like *"scrape web-scraping.dev/products and return the top 5 names and prices as JSON"* triggers the skill.

Gemini matches the skill's `description` and calls the Scrapfly CLI with `scrapfly scrape ... --format markdown --content-only`. It then parses the JSON envelope and returns structured data. No tool wiring code, no per-call config, no MCP transport.

With the skill installed, the next four sections cover what it opens up: autonomous one-liners, protected targets, Python pipelines, and paginated flows.



## How to Run an Autonomous Gemini Scraping Agent with the Scrapfly CLI

The Scrapfly CLI ships with an autonomous browser-agent mode. Pass `--provider gemini` and a task, and Gemini plans the scrape.

The agent drives a real Scrapfly cloud browser through tools like `open`, `snapshot`, `click`, `type`, `scroll`, and `done`, then returns the answer as JSON. You don't need a skill installed to try it.

Write a JSON Schema that pins the answer's shape. Save it as `schema.json` next to where you'll run the command:

json```json
{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "price": {"type": "string"}
  },
  "required": ["name", "price"]
}
```



Now run the agent with `--schema-file` pointing at the file:

bash```bash
export GEMINI_API_KEY=AIza...
scrapfly agent "Name and price of the first product on the page" \
  --url https://web-scraping.dev/products \
  --provider gemini \
  --model gemini-2.5-flash \
  --country us \
  --schema-file schema.json \
  --max-steps 10
```



A real run against the test site returns:

json```json
{
  "success": true,
  "product": "agent",
  "data": {
    "answer": {
      "name": "Box of Chocolate Candy",
      "price": "$24.99"
    },
    "provider": "gemini",
    "model": "gemini-2.5-flash",
    "steps": 3,
    "usage": { "input_tokens": 18147, "output_tokens": 58 },
    "stop_reason": "done"
  }
}
```



Provider auto-detects from the environment in order: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GEMINI_API_KEY`, then `OLLAMA_HOST`. Override with `--provider gemini` to pin Gemini, and pass a model with `--model gemini-2.5-flash`.

The CLI feeds the model an AXTree-style accessibility snapshot rather than raw HTML, which keeps the token budget tight across multi-step planning.

Each step is a model call, so a 3-step agent run on the free tier eats into the daily request quota.

For repeatable workflows, prefer the skill recipe from the previous section. It calls the lighter `scrapfly scrape` path and lets you control schema and prompt outside the agent loop.



## How to Scrape Anti-Bot Protected Sites with Gemini and Scrapfly ASP

Protected pages return challenge HTML, empty bodies, or 403s to Gemini's native fetcher, so the model never sees real content. This is the main reason to pair Gemini with Scrapfly.

The `--asp` flag routes the fetch through Anti-Scraping Protection and returns clean markdown that Gemini can analyze.

Scrapfly publishes a 99.99% success rate for its [Web Scraping API](https://scrapfly.io/web-scraping-api) across 90+ anti-bot systems. The proxy pool spans 130M+ IPs in 120+ countries.

The Gemini CLI skill calls this command on protected targets:

bash```bash
scrapfly scrape https://web-scraping.dev/products \
  --render-js --asp --country us \
  --format markdown --content-only
```



text```text
DONE 200 bytes=3956 format=text cost=6 took=9.481s
```



The pretty-format summary shows a 200 OK, a 3,956-byte markdown body, and a 9.5-second total including JavaScript render. The body contains product listings exactly as the page displays them.

ASP combines Curlium (byte-accurate Chrome HTTP and TLS) and Scrapium (an anti-detect Chromium with 30,000+ spoofed signals). It also handles residential proxies, fingerprint coherence, and server-side challenge solving.

Cloudflare is the recognizable example, but the same flag covers a long list of vendors:

- [Cloudflare bypass](https://scrapfly.io/blog/posts/how-to-bypass-cloudflare-anti-scraping) at a public 98% success rate
- [DataDome](https://scrapfly.io/blog/posts/how-to-bypass-datadome-anti-scraping) at 96%
- [Akamai](https://scrapfly.io/blog/posts/how-to-bypass-akamai-anti-scraping) at 97%
- [PerimeterX / HUMAN](https://scrapfly.io/blog/posts/how-to-bypass-perimeterx-human-anti-scraping) at 95%
- [Kasada](https://scrapfly.io/blog/posts/how-to-bypass-kasada-anti-scraping-waf) at around 94%

A broader walkthrough lives in our [anti-bot protection guide](https://scrapfly.io/blog/posts/how-to-bypass-anti-bot-protection). The CLI plus `--asp` is what gets Gemini the page. The next section moves to the Python pipeline path where the same fetch runs from a backend service.

[LangChain Web Scraping: Build AI Agents &amp; RAG ApplicationsLearn to integrate LangChain with Scrapfly for web scraping. Build AI agents and RAG applications that extract, process, and understand web data at scale.](https://scrapfly.io/blog/posts/langchain-web-scraping-complete-guide-scrapfly)



Scrapfly

#### Scale your web scraping effortlessly

Scrapfly handles proxies, browsers, and anti-bot bypass — so you can focus on data.

[Try Free →](https://scrapfly.io/register)## How to Extract Structured JSON from a Page with Gemini and the Gemini API

For Python backends and scheduled jobs, call the Gemini API with `google-genai` and shell out to the Scrapfly CLI for the fetch. Pass the returned markdown to `generate_content` with a `response_schema` for typed extraction.

The pipeline is short, runs end-to-end against any site Scrapfly can reach, and produces validated objects on every call.

python```python
import json
import os
import subprocess
from typing import List

from google import genai
from google.genai import types
from pydantic import BaseModel


class Product(BaseModel):
    name: str
    price: str
    description: str


class ProductList(BaseModel):
    products: List[Product]


def fetch_markdown(url: str) -> str:
    result = subprocess.run(
        [
            "scrapfly", "scrape", url,
            "--render-js", "--asp", "--country", "us",
            "--format", "markdown", "--content-only",
        ],
        capture_output=True, text=True, encoding="utf-8", check=True,
    )
    return result.stdout


def extract_products(markdown: str) -> ProductList:
    client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
    response = client.models.generate_content(
        model="gemini-2.5-flash-lite",
        contents=(
            "Extract every product on this page with name, price, and description.\n\n"
            f"{markdown}"
        ),
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=ProductList,
        ),
    )
    return ProductList.model_validate_json(response.text)


if __name__ == "__main__":
    md = fetch_markdown("https://web-scraping.dev/products")
    print(json.dumps(extract_products(md).model_dump(), indent=2))
```



A real run returns clean JSON in one pass:

json```json
{
  "products": [
    {"name": "Box of Chocolate Candy", "price": "24.99", "description": "Indulge your sweet tooth..."},
    {"name": "Dark Red Energy Potion", "price": "4.99", "description": "An energy drink as intense..."},
    {"name": "Teal Energy Potion", "price": "4.99", "description": "Experience a surge of vitality..."},
    {"name": "Red Energy Potion", "price": "4.99", "description": "An extraordinary energy drink..."},
    {"name": "Blue Energy Potion", "price": "4.99", "description": "Ignite your gaming sessions..."}
  ]
}
```



`--content-only` strips the JSON envelope so the markdown pipes cleanly into the next call.

Gemini matches the Pydantic `response_schema` exactly when you set `response_mime_type="application/json"`, so you get validated objects without post-processing.

This pipeline uses gemini-2.5-flash-lite for cheap, high-volume extraction; the rest of the guide uses gemini-2.5-flash. Reserve Pro for reasoning-heavy passes.

For pages where text misses visual cues, capture a [screenshot with Scrapfly's Screenshot API](https://scrapfly.io/screenshot-api) using `scrapfly screenshot <url> --asp`. Feed the PNG to Gemini alongside the markdown; Gemini 2.5 Flash is multimodal by default.

The persistent-session section below covers paginated flows where one URL isn't enough. For broader extraction patterns, see the [Scrapfly extraction docs](https://scrapfly.io/docs/extraction-api/getting-started).



## How to Scrape Paginated or Filtered Sites with a Persistent Browser Session

Some scraping jobs need multi-step browser flows: submitting filters, paginating through results, or waiting for client-side state to update. The Scrapfly CLI's `--session` flag keeps cookies, fingerprints, and connection state across calls.

The agent can then walk a public flow as one coherent visit. This section covers public multi-step flows; logged-in scraping is a terms-of-service question for the target site and out of scope here.

The example below scrapes the first two pages of an apparel category, sharing one session across both calls:

bash```bash
scrapfly scrape "https://web-scraping.dev/products?category=apparel" \
  --render-js --session demo-session-1 --country us \
  --format markdown --content-only --pretty

scrapfly scrape "https://web-scraping.dev/products?category=apparel&page=2" \
  --render-js --session demo-session-1 --country us \
  --format markdown --content-only --pretty
```



Both calls return the rendered listing with consistent cookies and a stable proxy IP. Page 1 contains Hiking Boots, High Heel Sandals, Running Shoes, and Light-Up Sneakers; page 2 picks up at the Cat-Ear Beanie and continues.

A typical Python loop drives the pages and hands each body to Gemini:

python```python
def scrape_pages(category: str, pages: int) -> list[str]:
    bodies = []
    for page in range(1, pages + 1):
        url = f"https://web-scraping.dev/products?category={category}&page={page}"
        bodies.append(fetch_markdown_session(url, session="apparel-search"))
    return bodies
```



Gemini CLI conversation memory is not the same as a browser session. Scrapfly's `--session` keeps cookies and the cloud browser warm across fetches. The model still needs the relevant task context in the prompt or skill instructions.

Sessions back onto the [Scrapfly Cloud Browser](https://scrapfly.io/browser-api), so pinning a `session` value reuses the same Chromium instance across requests. The [session resume docs](https://scrapfly.io/docs/cloud-browser-api/session-resume) cover longer-lived patterns.

Paginated flows are the last advanced fetch pattern; the next section shows the MCP alternative for clients that prefer that transport over a skill.



## How to Connect Gemini CLI to Scrapfly with MCP

For Gemini CLI's MCP mode, configure a Scrapfly MCP server in `~/.gemini/settings.json` and start the server locally with `scrapfly mcp serve`. The CLI exposes `scrape`, `screenshot`, `extract`, `crawl`, and `selector` over the [Model Context Protocol](https://scrapfly.io/blog/posts/what-is-mcp-understanding-the-model-context-protocol).

The skill recipe from earlier is faster for single-source scraping; MCP fits when you compose Scrapfly with other MCP servers in one client.

The minimal Gemini CLI config and serve command:

json```json
{
  "mcpServers": {
    "scrapfly": {
      "command": "scrapfly",
      "args": ["mcp", "serve"],
      "env": { "SCRAPFLY_API_KEY": "scp-live-..." }
    }
  }
}
```



bash```bash
scrapfly mcp serve
```



Gemini CLI reads `mcpServers` on startup and lists the Scrapfly tools alongside its built-ins. For a hosted server with no local process, point the config at [Scrapfly MCP Cloud](https://scrapfly.io/products/mcp-cloud) instead. To build your own MCP server, see our [MCP server guide](https://scrapfly.io/blog/posts/how-to-build-an-mcp-server-in-python-a-complete-guide).



## Which Gemini Scraping Method Should You Choose?

For most developer scraping jobs, use Gemini CLI plus the Scrapfly CLI skill. It has the fastest setup, is reusable across prompts, and runs on the Gemini free tier.

The Gemini API plus URL Context fits only simple, undefended pages. The API plus a Scrapfly CLI shell-out covers Python backends, and MCP fits multi-tool composition.

| Method | JavaScript rendering | Protected-site access | Structured output | Setup |
|---|---|---|---|---|
| Gemini API + URL Context only | Inconsistent | None | Schema valid over partial content | Lowest |
| Gemini CLI + Scrapfly CLI skill | Yes via `--render-js` | Yes via `--asp` (98% Cloudflare) | Yes via prompt or `--schema` | Low |
| Gemini API + Scrapfly CLI shell-out | Yes | Yes, same ASP stack | Yes via `response_schema` | Medium |
| Gemini MCP client + Scrapfly MCP server | Yes | Yes, same ASP stack | Yes | Medium |

The URL Context-only row is honest about its scope: it works for public documentation, blog posts, and undefended marketing pages. Past that, escalate.

Several Python tutorials in the current SERP demo against `toscrape.com` (a scrape-friendly sandbox) and call it shipped; that isn't a real-world test. Sites with even minimal protection break the demo.

The other three rows all share the same Scrapfly fetch layer underneath; the difference is which Gemini surface you wire it into.

## Powering Gemini Web Scraping with Scrapfly



Scrapfly provides web scraping, screenshot, and extraction APIs for data collection at scale. For Gemini-driven scraping, it removes three operational burdens: cloud browser management, residential proxy sourcing, and anti-bot bypass maintenance.

Key features for Gemini scraping:

- [Anti-Scraping Protection bypass](https://scrapfly.io/docs/scrape-api/anti-scraping-protection): defeats Cloudflare, DataDome, PerimeterX, Akamai, and 90+ other bot systems with one flag. Cited public success rate is 99.99% across the stack.
- [JavaScript rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering): runs single-page apps through real cloud browsers so Gemini sees the same DOM a user does.
- [Proxy rotation](https://scrapfly.io/docs/scrape-api/proxy): residential and datacenter pools across 130M+ IPs in 120+ countries with country and ASN level geo-targeting.
- [Format conversion](https://scrapfly.io/docs/scrape-api/getting-started#api_param_format): return pages as HTML, JSON, clean text, or LLM-ready Markdown, which is what `response_schema` extraction wants.
- [Session management](https://scrapfly.io/docs/scrape-api/session): keep cookies, headers, and IPs consistent across multi-step flows for paginated and filtered scraping.
- [Python SDK](https://scrapfly.io/docs/sdk/python) and the [integration library](https://scrapfly.io/docs/integration/getting-started) cover LangChain, LlamaIndex, n8n, and other agent platforms when you outgrow the CLI.

### Web Scraping API

Scrape any website with our powerful API. Anti-bot bypass, JavaScript rendering, and rotating proxies built-in.



[Try Web Scraping API](https://scrapfly.io/docs/scrape-api/getting-started)





## FAQ

What can I scrape with Gemini and Scrapfly?Public web pages, including product listings, news articles, documentation, search results, marketing pages, public PDFs, paginated listings, and JavaScript-rendered single-page apps. The pattern is for public web data, not for logged-in or paywalled content unless you have explicit authorization.







What is a Gemini CLI skill?A skill is a `SKILL.md` file with a `name` and `description` placed in a discovery folder such as `~/.agents/skills/`, `~/.gemini/skills/`, or a workspace `.gemini/skills/` directory. Gemini matches the `description` against your prompts and triggers the right skill, which is how the Scrapfly `scrapfly-cli/SKILL.md` plugs into any agent client.







When should I use Gemini CLI versus the Gemini API for scraping?Use Gemini CLI for interactive prototyping, ad-hoc extraction, and skill-based workflows where a single terminal session does the job. Use the Gemini API for scheduled backend pipelines that need logging, retries, and typed `response_schema` output in Python.







How does Gemini's URL Context tool compare to a real fetch layer?URL Context retrieves pages through Google's own setup with hard caps (20 URLs per request, 34MB per page) and patchy JavaScript handling. A real fetch layer like the Scrapfly CLI renders JavaScript, clears anti-bot challenges, and keeps session state. URL Context does none of that.







Can I scrape JavaScript-heavy sites with Gemini's native tools?Not consistently. Both URL Context and Gemini CLI's `web_fetch` sometimes execute JavaScript and sometimes don't, with no exposed diagnostic. The Scrapfly CLI with `--render-js` is the way to get a stable rendered page.







Is it legal to scrape with Gemini and Scrapfly?Scraping publicly accessible data is generally legal in most jurisdictions. Legality depends on the site's terms of service, the type of data, and your location. Stick to public pages, respect `robots.txt`, avoid personal data, and check the target's TOS before running large jobs.









## Summary

Gemini is good at turning markdown into structured JSON. The hard problem is access. Most real targets sit behind anti-bot vendors that block Gemini's native fetcher. The workflow only ships to production when you pair the model with a real fetch layer.

Scrapfly's Anti-Scraping Protection handles rendering, proxy routing, fingerprint coherence, and protected-site access. Without that access layer, the rest of the pipeline doesn't matter.

The escalation path is short. Start with the Gemini CLI plus Scrapfly CLI skill recipe for fast prototyping. Use `--asp` as the default on protected targets. Move to Python with `subprocess` and `response_schema` for backends.

Use `--session` for paginated and filtered flows. Pivot to MCP only when you compose Scrapfly with other MCP servers in one client.

Install the Scrapfly CLI, start a free trial with API credits, and drop the official `scrapfly-cli/SKILL.md` into `~/.agents/skills/scrapfly/`. Then run your first `scrapfly agent --provider gemini` command.



Legal Disclaimer and PrecautionsThis tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect:

- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens protected by GDPR.
- Do not repurpose *entire* public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow. For more you should consult a lawyer.

 

   Table of Contents















 

  Table of Contents- [Key Takeaways](#key-takeaways)
- [Can Gemini Do Web Scraping?](#can-gemini-do-web-scraping)
- [Gemini CLI vs Gemini API vs AI Studio for Web Scraping](#gemini-cli-vs-gemini-api-vs-ai-studio-for-web-scraping)
- [Gemini CLI (terminal agent, recommended path)](#gemini-cli-terminal-agent-recommended-path)
- [Gemini API + URL Context tool (programmatic, for Python pipelines)](#gemini-api-url-context-tool-programmatic-for-python-pipelines)
- [Google AI Studio (no-code, out of scope)](#google-ai-studio-no-code-out-of-scope)
- [Why Does Gemini's URL Context Tool Fail on Real Sites?](#why-does-gemini-s-url-context-tool-fail-on-real-sites)
- [How to Set Up Scrapfly as a Gemini CLI Skill](#how-to-set-up-scrapfly-as-a-gemini-cli-skill)
- [Install the Scrapfly CLI in one command](#install-the-scrapfly-cli-in-one-command)
- [Scaffold the skill for Gemini CLI](#scaffold-the-skill-for-gemini-cli)
- [Trigger the skill from a natural-language prompt](#trigger-the-skill-from-a-natural-language-prompt)
- [How to Run an Autonomous Gemini Scraping Agent with the Scrapfly CLI](#how-to-run-an-autonomous-gemini-scraping-agent-with-the-scrapfly-cli)
- [How to Scrape Anti-Bot Protected Sites with Gemini and Scrapfly ASP](#how-to-scrape-anti-bot-protected-sites-with-gemini-and-scrapfly-asp)
- [How to Extract Structured JSON from a Page with Gemini and the Gemini API](#how-to-extract-structured-json-from-a-page-with-gemini-and-the-gemini-api)
- [How to Scrape Paginated or Filtered Sites with a Persistent Browser Session](#how-to-scrape-paginated-or-filtered-sites-with-a-persistent-browser-session)
- [How to Connect Gemini CLI to Scrapfly with MCP](#how-to-connect-gemini-cli-to-scrapfly-with-mcp)
- [Which Gemini Scraping Method Should You Choose?](#which-gemini-scraping-method-should-you-choose)
- [Powering Gemini Web Scraping with Scrapfly](#powering-gemini-web-scraping-with-scrapfly)
- [Web Scraping API](#web-scraping-api)
- [FAQ](#faq)
- [Summary](#summary)
 
    Join the Newsletter  Get monthly web scraping insights 

 

  



Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 

## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fgemini-for-webscraping) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fgemini-for-webscraping) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fgemini-for-webscraping) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fgemini-for-webscraping) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fgemini-for-webscraping) 



 ## Related Articles

 [     

 python ai 

### How to Build a Web Scraping Agent with Claude

Learn how to build a reliable web scraping agent with Claude. Covers Claude Code skills, the Anthropic API, autonomous a...

 

 ](https://scrapfly.io/blog/posts/how-to-build-a-web-scraping-agent-with-claude) [  

 http python 

### Web Scraping with Python

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and ...

 

 ](https://scrapfly.io/blog/posts/web-scraping-with-python) [  

 python 

### Everything to Know to Start Web Scraping in Python Today

Complete introduction to web scraping using Python: http, parsing, AI, scaling and deployment.

 

 ](https://scrapfly.io/blog/posts/everything-to-know-about-web-scraping-python) 

  



   



 Scale your web scraping effortlessly, **1,000 free credits** [Start Free](https://scrapfly.io/register)