     [Blog](https://scrapfly.io/blog)   /  [ai](https://scrapfly.io/blog/tag/ai)   /  [How to Scrape ChatGPT Responses in 2026](https://scrapfly.io/blog/posts/how-to-scrape-chatgpt)   # How to Scrape ChatGPT Responses in 2026

 by [Ziad Shamndy](https://scrapfly.io/blog/author/ziad) Jun 16, 2026 20 min read [\#ai](https://scrapfly.io/blog/tag/ai) [\#python](https://scrapfly.io/blog/tag/python) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-chatgpt "Share on LinkedIn")    

 

 

         

The ChatGPT web UI returns structured citation data, source URLs, publication names, and snippets that the standard OpenAI Chat Completions API does not reproduce. That single gap is the reason AI brand visibility, competitive intelligence, and answer-consistency projects scrape `chatgpt.com` instead of calling the API.

This guide covers when scraping ChatGPT is the right call, the project setup with the Scrapfly Python SDK, two scrapers (single prompt and multi-turn conversation), and the blocking factors Scrapfly handles inside a single configuration block.

## Key Takeaways

- The OpenAI Chat Completions API is the right tool for text generation. Scraping `chatgpt.com` is the right tool when the citation field, source cards, or geo-aware answers are the data the project consumes.
- The three real blockers on chatgpt.com are Cloudflare Turnstile, browser and TLS fingerprinting, and the token-by-token SSE streaming response that has no clean completion event.
- The Scrapfly Python SDK handles all three with `asp=True`, a residential proxy pool, and a JS scenario that submits the prompt and waits for the send button to reappear.
- Multi-turn conversations are captured by intercepting the `backend-anon/f/conversation` XHR SSE stream and replaying its POST body with the conversation ID for each follow-up prompt.

[**View Source Code**github.com/scrapfly/scrapfly-scrapers/tree/main/chatgpt-scraper](https://github.com/scrapfly/scrapfly-scrapers/tree/main/chatgpt-scraper)

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.







## Why Scrape ChatGPT

Scrape ChatGPT's web UI when the data is what the consumer product renders, not the text the API returns. The two main targets are citation cards and web-search-grounded answers, and the dominant use case is AI brand visibility tracking.

The consumer ChatGPT interface renders structured source cards next to its answers when the model invokes web search. Each card carries a title, a URL, a publication name, and a snippet. The [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/chat-completions) does not reproduce those cards. The API response is a flat list of message choices with no citation array.

json```json
// OpenAI Chat Completions response (simplified)
{
  "choices": [
    {"message": {"role": "assistant", "content": "Here is a summary..."}}
  ]
}

// chatgpt.com UI scrape result
{
  "answer": "Here is a summary...",
  "links": ["https://example-publisher.com/article-1"],
  "sources": [
    {"title": "Article 1", "url": "https://example-publisher.com/article-1",
     "source": "example-publisher.com", "snippet": "..."}
  ]
}
```



Beyond citations, the UI also exposes web-search-grounded answers when ChatGPT invokes search, geolocation-aware responses that vary by IP, and the formatting a real user sees in the consumer product. These matter when the project measures how the consumer product behaves, not when it generates text inside your own application.

The dominant use case is AI brand visibility tracking. Scheduled prompts turn ChatGPT's citations into a share-of-voice metric against competitors. Adjacent pipelines include LLM benchmarking and research citation tracking.

Use the OpenAI API by default. Scrape the UI when the rendered citations, search-grounded answers, or geo-aware responses are the deliverable.



## Project Setup

The scraper runs on Python 3.10+ and depends on the [scrapfly-sdk](https://pypi.org/project/scrapfly-sdk/) package, which handles Cloudflare bypass, residential exit IPs, JS rendering, and the XHR interception we need for multi-turn conversations.

Install the dependencies.

shell```shell
pip install "scrapfly-sdk[all]" loguru
```



Grab a key from the [Scrapfly dashboard](https://scrapfly.io/dashboard). The free tier covers 1,000 API credits per month, enough to exercise both scrapers end to end.

shell```shell
export SCRAPFLY_KEY="your key from https://scrapfly.io/dashboard"
```



Create `chatgpt.py` with the shared imports and Scrapfly client.

python```python
import os
import json
import time
from urllib.parse import quote_plus
from uuid import uuid4
from typing import Dict, List, Optional, TypedDict

from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient

SCRAPFLY = ScrapflyClient(key=os.environ["SCRAPFLY_KEY"])

BASE_CONFIG = {
    "asp": True,
    "proxy_pool": "public_residential_pool",
    "country": "US",
    "debug": True,
}


class ChatgptMessage(TypedDict):
    role: str
    content: str


class ChatgptConversation(TypedDict):
    conversation_id: str
    messages: List[ChatgptMessage]
```



The `BASE_CONFIG` block is the single configuration both scrapers reuse.

- `asp=True` enables Scrapfly's [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection), which handles Cloudflare Turnstile, TLS fingerprinting, and JavaScript runtime evasions.
- `proxy_pool="public_residential_pool"` routes the request through residential IPs that look like real users.
- `country="US"` pins exit geography, since chatgpt.com behavior varies by region.
- `debug=True` enables the per-request [Scrapfly monitoring logs](https://scrapfly.io/docs/monitoring), useful while tuning the JS scenario.

The two scrapers below import this configuration and add the bits each one needs.



## How to Scrape Single ChatGPT Prompts

The `scrape_conversation` function sends one prompt and returns the rendered answer as markdown. It's the simplest scraping flow on `chatgpt.com`, useful for one-off answers, citation snapshots, and prototyping.

ChatGPT exposes the prompt via a URL parameter (`?prompt=...`), so the entire interaction can be driven without typing into the textarea. The friction is the page itself, which sits behind Cloudflare and renders the answer as a streaming SSE response with no obvious done event.

We solve both with a JS scenario that does five things in order.

1. Dismisses the Google one-tap credentials picker if it appears.
2. Closes the "stay logged out" toast banner.
3. Waits for the send button to render.
4. Clicks send to submit the prompt.
5. Waits for the send button to reappear once streaming finishes.

python```python
js_scenario = [
    {
        "click": {
            "ignore_if_not_visible": True,
            "selector": "#credentials-picker-container #close",
            "multiple": False,
            "ignore": True,
        }
    },
    {
        "click": {
            "ignore": True,
            "ignore_if_not_visible": True,
            "selector": "div[aria-live='polite'] button:first-of-type",
            "multiple": False,
        }
    },
    {
        "wait_for_selector": {
            "selector": "button[data-testid='send-button']",
            "timeout": 15000,
        }
    },
    {
        "condition": {
            "selector": "button[data-testid='send-button']",
            "selector_state": "not_existing",
            "action": "exit_failed",
        }
    },
    {
        "click": {
            "selector": "button[data-testid='send-button']",
            "ignore_if_not_visible": False,
            "multiple": False,
        }
    },
    {"wait": 10000},
    {
        "condition": {
            "selector": "button[data-testid='send-button']",
            "selector_state": "existing",
            "action": "exit_success",
        }
    },
    {"wait": 10000},
]
```



The two `condition` blocks are the streaming-completion trick. While the answer is being generated, the send button is replaced by a stop button, so the `selector_state: "existing"` check reads false. Once generation finishes, the send button comes back, the condition flips, and the scenario exits successfully. No fixed timeouts, no DOM polling from the client.

Now the scraper function itself.

python```python
async def scrape_conversation(prompt: str) -> str:
    url = f"https://chatgpt.com/?prompt={quote_plus(prompt)}"
    log.info("scraping conversation for prompt: {}", prompt)
    response = await SCRAPFLY.async_scrape(
        ScrapeConfig(
            url=url,
            format="markdown",
            format_options=["only_content"],
            render_js=True,
            js_scenario=js_scenario,
            rendering_wait=5000,
            **BASE_CONFIG,
        )
    )
    log.success("finished scraping ChatGPT for the prompt: {}", prompt)
    return response.content
```



A few notes on the parameters.

- `format="markdown"` with `format_options=["only_content"]` tells Scrapfly to strip the surrounding chrome and return the article body as clean markdown.
- `render_js=True` runs a headless browser server-side. Combined with `asp=True`, that's enough to clear Cloudflare without writing any anti-detect logic.
- `rendering_wait=5000` adds a five-second cushion after the JS scenario exits to let the DOM settle.

Run it with the same prompt the upstream `run.py` ships with.



Run the codepython```python
import asyncio

if __name__ == "__main__":
    result = asyncio.run(
        scrape_conversation("What's the capital of France? with Brief History of the city.")
    )
    print(result)
```







A trimmed version of the markdown output.markdown```markdown
The capital of France is **Paris**.

### Brief History of Paris

- **Origins (3rd century BC)** Paris began as a settlement of the Parisii, a Celtic tribe, on the Île de la Cité in the Seine River.
- **Roman Era (52 BC)** The Romans conquered the area and called it *Lutetia*. It became a regional center but remained relatively small.
- **Medieval Period** Paris grew into a major political and religious hub...
```







This is enough for one-off prompts and citation captures. For multi-turn conversations, where each follow-up depends on the previous answer, we need a different approach.



## How to Scrape ChatGPT Conversations

Multi-turn conversations need three things the single-prompt scraper does not. They need a stable session across requests, a way to read the assistant's reply structurally (not as rendered markdown), and a way to send follow-up prompts that share the conversation ID.

The Scrapfly SDK covers all three. `session="..."` pins cookies and storage across requests, `browser_data["xhr_call"]` exposes the SSE stream with the assistant's full reply, and same-session POSTs reuse the captured headers to continue the thread.

The flow has four steps.

1. Load `chatgpt.com/?prompt=<first>` with the same JS scenario from the single-prompt scraper.
2. Pull the `/backend-anon/f/conversation` XHR call from the response and parse its SSE body.
3. Extract `conversation_id` and `parent_message_id` from the parsed messages.
4. For each follow-up prompt, POST to `/backend-anon/conversation` on the same session with the conversation ID and parent message ID set.

### Parsing the SSE Stream

ChatGPT streams its replies as Server-Sent Events with three event shapes. The first is a seed event containing the full message object. The second is a list of JSON-Patch-like operations that append tokens. The third is a bare value event with a sticky path and operation inherited from the previous event.

The parser collapses all three into a flat list of role and content messages.

python```python
def parse_chatgpt_stream(raw_sse: str) -> Dict:
    """Parse a ChatGPT SSE stream body into a structured messages JSON object."""
    messages: Dict[str, dict] = {}
    conversation_id: Optional[str] = None
    current_id: Optional[str] = None
    last_o: Optional[str] = None
    last_p: Optional[str] = None

    def store(msg: dict) -> Optional[str]:
        msg_id = msg.get("id")
        if not msg_id:
            return None
        parts = msg.get("content", {}).get("parts") or [""]
        messages[msg_id] = {
            "role": msg.get("author", {}).get("role", ""),
            "content": parts[0] if isinstance(parts[0], str) else "",
        }
        return msg_id

    def append(path: Optional[str], op: Optional[str], val) -> None:
        if (
            op == "append"
            and isinstance(val, str)
            and path
            and "content/parts/0" in path
            and current_id in messages
        ):
            messages[current_id]["content"] += val

    for line in raw_sse.splitlines():
        line = line.strip()
        if not line.startswith("data:"):
            continue
        raw = line[len("data:"):].strip()
        if raw == "[DONE]":
            break
        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            continue
        if not isinstance(data, dict):
            continue

        if data.get("type") == "input_message":
            current_id = store(data.get("input_message", {})) or current_id
            conversation_id = conversation_id or data.get("conversation_id")
            continue

        # Inherit sticky path/op when the event omits them.
        last_o = data.get("o", last_o)
        last_p = data.get("p", last_p)
        v = data.get("v")

        if isinstance(v, dict) and "message" in v:
            current_id = store(v["message"]) or current_id
            conversation_id = (
                conversation_id
                or v.get("conversation_id")
                or v["message"].get("metadata", {}).get("conversation_id")
            )
        elif isinstance(v, list):
            for patch in v:
                append(patch.get("p"), patch.get("o"), patch.get("v"))
        else:
            append(last_p, last_o, v)

    parent_message_id = next(
        (mid for mid, m in reversed(messages.items()) if m["role"] == "assistant"),
        None,
    )
    result_messages: List[ChatgptMessage] = [
        {"role": m["role"], "content": m["content"]}
        for m in messages.values()
        if m["role"] and m["content"]
    ]

    return {
        "conversation_id": conversation_id,
        "parent_message_id": parent_message_id,
        "messages": result_messages,
    }
```



The `parent_message_id` is critical. ChatGPT uses it to thread follow-up prompts onto the existing conversation tree, so we pick the most recent assistant message and pass its id as the parent of the next user prompt.

### Building Follow-up POST Requests

For each prompt after the first, we POST to `/backend-anon/conversation` with a fresh `messages` array (one user message with a new UUID) and the captured `conversation_id` and `parent_message_id`.

python```python
def _build_post_request(
    prompt: str,
    conversation_id: str,
    parent_message_id: str,
    original_body: dict,
    headers: dict,
) -> dict:
    """Build the JSON body and headers for a ChatGPT /backend-anon/conversation POST."""
    new_body = original_body.copy()
    new_body["conversation_id"] = conversation_id
    new_body["parent_message_id"] = parent_message_id
    new_body["messages"] = [
        {
            "id": str(uuid4()),
            "author": {"role": "user"},
            "create_time": time.time(),
            "content": {"content_type": "text", "parts": [prompt]},
        }
    ]
    return {"headers": headers, "body": new_body}
```



The `original_body` is the body of the first conversation XHR, which carries fields like `model`, `timezone_offset_min`, and feature flags ChatGPT requires. Copying it preserves whatever variant the page used and avoids drift if OpenAI ships new fields.

### Driving the Full Conversation

`scrape_conversations` takes a list of prompts and walks them as a single multi-turn session.

python```python
async def scrape_conversations(prompt: List[str]) -> List[ChatgptConversation]:
    prompt_index = 0
    url = f"https://chatgpt.com/?prompt={quote_plus(prompt[prompt_index])}"
    session = "chatgpt-" + str(uuid4()).replace("-", "")
    conversations: List[ChatgptConversation] = []
    response = await SCRAPFLY.async_scrape(
        ScrapeConfig(
            url=url,
            session=session,
            render_js=True,
            js_scenario=js_scenario,
            rendering_wait=5000,
            **BASE_CONFIG,
        )
    )

    _xhr_calls = response.scrape_result["browser_data"]["xhr_call"]
    conversation_calls = [
        x
        for x in _xhr_calls
        if "backend-anon/f/conversation" in x["url"]
        and x.get("response", {})
        .get("content_type", "")
        .startswith("text/event-stream")
    ]
    for xhr in conversation_calls:
        if not xhr.get("response"):
            continue

        # Parse initial GET SSE stream
        parsed = parse_chatgpt_stream(xhr["response"]["body"])
        conversation_id = parsed.get("conversation_id")
        parent_message_id = parsed.get("parent_message_id")

        if conversation_id:
            conversations.append(
                {
                    "conversation_id": conversation_id,
                    "messages": parsed.get("messages", []),
                }
            )

        original_body = json.loads(xhr["body"])
        headers = xhr.get("headers", {}).copy()

        while prompt_index < len(prompt) - 1:
            prompt_index += 1
            post_request = _build_post_request(
                prompt[prompt_index],
                conversation_id,
                parent_message_id,
                original_body,
                headers,
            )
            post_response = await SCRAPFLY.async_scrape(
                ScrapeConfig(
                    url="https://chatgpt.com/backend-anon/conversation",
                    session=session,
                    method="POST",
                    body=json.dumps(post_request["body"]),
                    headers=post_request["headers"],
                    **BASE_CONFIG,
                )
            )

            # Parse POST SSE response and update state for next iteration
            post_parsed = parse_chatgpt_stream(post_response.content)
            if post_parsed.get("parent_message_id"):
                parent_message_id = post_parsed["parent_message_id"]

            if conversations and post_parsed.get("messages"):
                conversations[-1]["messages"].extend(post_parsed["messages"])

    return conversations
```



The function does three things in order. It scrapes the first prompt through the browser to capture the conversation seed, it filters `browser_data["xhr_call"]` for the SSE conversation stream, and then it loops through the remaining prompts as direct POSTs on the same `session` so cookies and the conversation thread stay consistent.



Run it with a chain of prompts.python```python
import asyncio

if __name__ == "__main__":
    prompts = [
        "what is the best web scraping service in 2026?",
        "Base on the previous answer, what is the best web scraping service you expect in 2027",
        "summarize the previous answer in 200 words",
    ]
    conversations = asyncio.run(scrape_conversations(prompts))
    print(json.dumps(conversations, indent=2)[:500])
```







A trimmed sample of the output.json```json
[
  {
    "conversation_id": "abc123...",
    "messages": [
      {"role": "user", "content": "what is the best web scraping service in 2026?"},
      {"role": "assistant", "content": "There isn't one single 'best' service..."},
      {"role": "user", "content": "Base on the previous answer..."},
      {"role": "assistant", "content": "Looking ahead to 2027..."},
      {"role": "user", "content": "summarize the previous answer in 200 words"},
      {"role": "assistant", "content": "In 2027, the web scraping market..."}
    ]
  }
]
```







The structure is symmetric across turns, which makes it directly consumable for AI brand visibility tracking, LLM benchmarking, and citation diffing pipelines.



Scrapfly

#### Scale your web scraping effortlessly

Scrapfly handles proxies, browsers, and anti-bot bypass — so you can focus on data.

[Try Free →](https://scrapfly.io/register)## How to Extract Structured Data with OpenAI and the Scrapfly CLI

The two scrapers above target `chatgpt.com` itself. The same Scrapfly fetch layer also slots into any pipeline that already calls `client.chat.completions.create()`, so the OpenAI API can do extraction on arbitrary pages while Scrapfly handles the Cloudflare bypass and JS rendering.

The pattern is short. Shell out to the [Scrapfly CLI](https://github.com/scrapfly/scrapfly-cli) for the fetch, pipe the returned markdown into the OpenAI Chat Completions API with a function-calling schema, and let the model emit the structured fields.

Install the CLI with one curl command and point it at the same key the SDK already uses.

shell```shell
curl -fsSL https://scrapfly.io/scrapfly-cli/install | sh
scrapfly config set-key $SCRAPFLY_KEY
```



The installer works on macOS and Linux. Windows users can install through `npm install -D scrapfly-cli` or download release artifacts from the GitHub repository. Every command returns a stable JSON envelope shaped like `{success, product, data | error}`, and the `--content-only` flag strips the envelope so the output drops straight into a Python string.

With the CLI installed, the full extraction pipeline fits in a single script. Scrapfly handles the fetch with `--asp` and `--render-js`. OpenAI handles extraction through function calling. The two stages are independent, so swapping the model or schema is a one-line change.

python```python
import subprocess
import json
from openai import OpenAI

# Step 1: Fetch the page as clean markdown through the Scrapfly CLI
result = subprocess.run(
    [
        "scrapfly", "scrape",
        "https://web-scraping.dev/product/1",
        "--asp", "--render-js",
        "--format", "markdown",
        "--content-only",
    ],
    capture_output=True,
    text=True,
)
markdown = result.stdout

# Step 2: Extract structured fields with OpenAI function calling
client = OpenAI(api_key="YOUR_OPENAI_KEY")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Extract the product details from this page:\n\n{markdown}",
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "extract_product",
            "description": "Extract product data from the page markdown",
            "parameters": {
                "type": "object",
                "properties": {
                    "name":     {"type": "string",  "description": "Product name"},
                    "price":    {"type": "string",  "description": "Product price with currency symbol"},
                    "in_stock": {"type": "boolean", "description": "Whether the product is in stock"},
                },
                "required": ["name", "price", "in_stock"],
            },
        },
    }],
    tool_choice={"type": "function", "function": {"name": "extract_product"}},
)

args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(json.dumps(args, indent=2))
# Output:
# {
#   "name": "Box of Chocolate Candy",
#   "price": "$9.99",
#   "in_stock": true
# }
```



Scrapfly returns LLM-ready markdown, not raw HTML, so no BeautifulSoup step sits between the fetch and the OpenAI call. The `--asp` flag carries the same Cloudflare bypass the `BASE_CONFIG` block uses for `chatgpt.com`, and `--render-js` covers any client-rendered SPA the prompt happens to target.

For one-off jobs where writing a script is overkill, the CLI ships a `scrapfly agent` command that runs a fully autonomous scraping loop and auto-detects the provider from `OPENAI_API_KEY`. The same extraction collapses into one shell command with no Python at all.

shell```shell
OPENAI_API_KEY=your-key scrapfly agent \
  "Name and price of the first product on the page" \
  --url https://web-scraping.dev/products \
  --schema '{"name": "string", "price": "string"}' \
  --provider openai
```



The CLI also runs as an MCP server through `scrapfly mcp serve` for teams already on Cursor, Cline, Windsurf, or Codex. See the [CLI documentation](https://scrapfly.io/docs/cli) for the full command reference.



## Bypass ChatGPT Blocking with Scrapfly

Three things break naive ChatGPT scrapers. Each one is solvable in a DIY stack, but each one adds operational cost when built from scratch. Scrapfly handles all of them inside the single `BASE_CONFIG` block both scrapers above already use.

Scrapfly's [Web Scraping API](https://scrapfly.io/web-scraping-api) is a single HTTP endpoint for collecting web data at scale, with a **99.99% success rate** across **130M+ proxies in 120+ countries**.

- [Anti-Scraping Protection bypass](https://scrapfly.io/docs/scrape-api/anti-scraping-protection) - automatically defeats Cloudflare, DataDome, PerimeterX, Akamai, and 90+ other bot systems.
- [Smart proxy rotation](https://scrapfly.io/docs/scrape-api/proxy) - residential and datacenter pools with country and ASN level geo-targeting.
- [JavaScript rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering) - render SPAs and dynamic pages through real cloud browsers.
- [Browser automation scenarios](https://scrapfly.io/docs/scrape-api/javascript-scenario) - scroll, click, fill forms, and wait for elements without managing a browser fleet.
- [Format conversion](https://scrapfly.io/docs/scrape-api/getting-started#api_param_format) - return pages as HTML, JSON, clean text, or LLM ready Markdown.
- [Session management](https://scrapfly.io/docs/scrape-api/session) - keep cookies, headers, and IPs consistent across multi step flows.
- [Smart caching](https://scrapfly.io/docs/scrape-api/getting-started#api_param_cache) - cache successful responses to cut cost on repeat scraping jobs.
- [Python](https://scrapfly.io/docs/sdk/python), [TypeScript](https://scrapfly.io/docs/sdk/typescript), [Scrapy](https://scrapfly.io/docs/sdk/scrapy), and [no-code integrations](https://scrapfly.io/docs/integration/getting-started) including Make, n8n, Zapier, LangChain, and LlamaIndex.

### Cloudflare Turnstile and Anti-Bot Detection

`chatgpt.com` sits behind [Cloudflare Turnstile](https://www.cloudflare.com/products/turnstile/) and can present interactive verification depending on the request profile. Vanilla [Playwright](https://scrapfly.io/blog/posts/web-scraping-with-playwright-and-python) or [Selenium](https://scrapfly.io/blog/posts/web-scraping-with-selenium-and-python) scripts trigger detection quickly, because the trust score reads TLS handshake fingerprints, JavaScript runtime properties, and behavioral signals, not just the User-Agent.

The `asp=True` flag activates Scrapfly's [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection), which runs a hardened browser, rotates TLS profiles to match real Chrome, and patches the JS runtime so `navigator.webdriver` and similar flags don't leak. The dedicated guide below covers the detection theory we are bypassing.

[How to Bypass Cloudflare When Web Scraping in 2026Cloudflare offers one of the most popular anti scraping service, so in this article we'll take a look how it works and how to bypass it.](https://scrapfly.io/blog/posts/how-to-bypass-cloudflare-anti-scraping)

### Login Walls and Geographic Restrictions

No-login availability on `chatgpt.com` varies by geography, browser state, and which features the prompt would invoke. Some requests land on the usable logged-out chat, others redirect straight to `/auth/login`.

The Scrapfly config covers both halves of the problem. The `country="US"` parameter pins exit geography to a region where logged-out access is typically available, and `proxy_pool="public_residential_pool"` routes through residential IPs that pass the trust checks.

For workloads that require an actual logged-in account, [Scrapfly Cloud Browser](https://scrapfly.io/docs/cloud-browser-api/getting-started) supports Human-in-the-Loop, where an operator completes the OAuth flow once in a live preview, and [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume) reconnects to the same authenticated browser on every subsequent run without storing credentials in code.



## FAQ

Is Scraping ChatGPT Against OpenAI's Terms of Service?OpenAI's consumer terms can restrict automated access to chatgpt.com, while the API is the sanctioned automation path. UI scraping carries account-blocking and compliance risk, so verify the current OpenAI terms before any production rollout.







Can I Scrape ChatGPT Without Logging In? Sometimes. Logged-out access depends on geography, browser state, and the features your prompt would trigger, so the same script can hit a usable chat one run and redirect to `/auth/login` the next. Treat no-login mode as a prototype path and plan for authenticated sessions in any production workload.







Does ChatGPT Use Cloudflare Turnstile? Yes. The domain sits behind Cloudflare and can present Turnstile or interactive verification depending on the request profile. Vanilla Playwright triggers detection quickly; a managed Cloud Browser or hardened anti-detect setup is what keeps a session viable. The [Cloudflare bypass guide](https://scrapfly.io/blog/posts/how-to-bypass-cloudflare-anti-scraping) covers the detection theory.







Can the OpenAI API Return Source Citations Like the ChatGPT Web UI?Not with full UI parity. Standard Chat Completions responses do not reproduce the consumer ChatGPT interface's rendered citation cards and source metadata, and the response shape is a flat list of message choices.







How Fast Can I Scrape ChatGPT Before Getting Rate-Limited?There is no stable public limit. Logged-out access can be throttled aggressively, and authenticated rates vary by account state, geography, and request patterns. Production systems need pacing, retries, session distribution, and clear failure handling rather than a fixed prompts-per-hour budget.









## Conclusion

The OpenAI Chat Completions API and the `chatgpt.com` web UI return different data. Use the API when the assistant's answer text is the product. Scrape the UI when citation cards, web-search-grounded answers, or AI brand visibility are.

The two scrapers in this guide cover both shapes of the workload. `scrape_conversation` sends one prompt through a JS scenario and returns the answer as markdown, useful for one-off captures. `scrape_conversations` walks a multi-turn chain by intercepting the conversation SSE stream and replaying its POST body, useful for benchmarking and citation diffing pipelines.

The blocking factors that make ChatGPT hard to scrape (Cloudflare Turnstile, browser fingerprinting, streaming completion) all collapse into the single `BASE_CONFIG` block the [Scrapfly Python SDK](https://scrapfly.io/docs/sdk/python) accepts. The same `asp=True` plus residential proxy pattern transfers directly to Perplexity, Claude, Gemini, and Grok with selector and endpoint changes only.



Legal Disclaimer and PrecautionsThis tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect:

- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens protected by GDPR.
- Do not repurpose *entire* public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow. For more you should consult a lawyer.

 

   Table of Contents















 

  Table of Contents- [Key Takeaways](#key-takeaways)
- [Why Scrape ChatGPT](#why-scrape-chatgpt)
- [Project Setup](#project-setup)
- [How to Scrape Single ChatGPT Prompts](#how-to-scrape-single-chatgpt-prompts)
- [How to Scrape ChatGPT Conversations](#how-to-scrape-chatgpt-conversations)
- [Parsing the SSE Stream](#parsing-the-sse-stream)
- [Building Follow-up POST Requests](#building-follow-up-post-requests)
- [Driving the Full Conversation](#driving-the-full-conversation)
- [How to Extract Structured Data with OpenAI and the Scrapfly CLI](#how-to-extract-structured-data-with-openai-and-the-scrapfly-cli)
- [Bypass ChatGPT Blocking with Scrapfly](#bypass-chatgpt-blocking-with-scrapfly)
- [Cloudflare Turnstile and Anti-Bot Detection](#cloudflare-turnstile-and-anti-bot-detection)
- [Login Walls and Geographic Restrictions](#login-walls-and-geographic-restrictions)
- [FAQ](#faq)
- [Conclusion](#conclusion)
 
    Join the Newsletter  Get monthly web scraping insights 

 

  



Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 

## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-chatgpt) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-chatgpt) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-chatgpt) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-chatgpt) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-chatgpt) 



 ## Related Articles

 [  

 blocking 

### How to Bypass Cloudflare When Web Scraping in 2026

Cloudflare offers one of the most popular anti scraping service, so in this article we'll take a look how it works and h...

 

 ](https://scrapfly.io/blog/posts/how-to-bypass-cloudflare-anti-scraping) [  

 headless-browser scaling 

### Web Scraping With Cloud Browsers

Introduction cloud browsers and their benefits and a step-by-step setup with self-hosted Selenium-grid cloud browsers.

 

 ](https://scrapfly.io/blog/posts/web-scraping-with-cloud-browsers) [  

 python headless-browser 

### How to Scrape Dynamic Websites Using Headless Web Browsers

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websit...

 

 ](https://scrapfly.io/blog/posts/scraping-using-browsers) 

  ## Related Questions

- [ Q How to save and load cookies in Python requests? ](https://scrapfly.io/blog/answers/save-and-load-cookies-in-requests-python)
- [ Q How to handle popup dialogs in Playwright? ](https://scrapfly.io/blog/answers/how-to-click-on-alert-dialog-in-playwright)
 
  



   



 Scale your web scraping effortlessly, **1,000 free credits** [Start Free](https://scrapfly.io/register)