🚀 We are hiring! See open positions

How to Scrape X.com (Twitter) in 2025

How to Scrape X.com (Twitter) in 2025

X.com (formerly Twitter) killed its free API in 2023. Since then, the platform has been rolling out defensive changes every 2-4 weeks that break DIY scrapers. Guest tokens expire, doc_ids rotate, and rate limits shift. If you're building a scraper from scratch today, you're signing up for 10-15 hours of monthly maintenance just to keep it working.

This guide shows you exactly what breaks and why, then introduces ScrapFly's maintained scraper as the only practical solution if you need X.com data at scale.

Key Takeaways

  • X.com's free API is gone; the paid version costs $42,000/year minimum
  • Guest tokens, doc_ids, and IP blocks break scrapers every 2-4 weeks
  • Manual scraper maintenance demands 10-15 hours per month to stay functional
  • ScrapFly monitors changes automatically and updates its scraper within 24 hours
  • Get started by cloning the working scraper and adding your API key

Latest X.com Scraper Code

Clone and run: https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape X.com?

X.com remains valuable for:

  • Real-time News: Monitor breaking stories and trending topics before they spread elsewhere
  • Market Signals: Track sentiment and announcements in finance and crypto communities
  • Brand Monitoring: See what people say about your product or company in real time
  • Competitor Research: Watch what competitors post and how audiences engage with their content

The data is publicly available and valuable. The problem is just getting it consistently.

For similar challenges with other platforms, see Instagram scraping, Reddit scraping, and YouTube scraping.

The X.com Scraping Problem

API Is Dead

In February 2023, X.com shut down free API access. The official paid API now starts at $42,000 per year for basic access to 100 tweets.

For anyone scraping at scale, the paid API is a non-starter. This forced developers to find alternatives.

Breaking Changes Timeline (2023-2025)

X.com has released significant defensive changes roughly every 2-4 weeks:

  • February 2023: Free API access ends. Paid tiers introduced with strict cost barriers
  • March 2023: New API rate limits severely restricted free tier usage; many apps stopped working
  • June 2023: Guest token acquisition methods changed; existing scrapers broke
  • August 2023: Rate limits reduced from 450 to 300 requests per hour; datacenter IP blocking increased
  • November 2023: GraphQL endpoint changes required doc_id updates across all query types
  • January 2024: Guest token format and expiration timing changed; TLS fingerprinting detection tightened
  • April 2024: doc_ids rotated again; anti-scraping headers added to responses
  • July 2024: Cookie validation requirements changed; session handling became more strict
  • October 2024: IP reputation scoring tightened; rotating proxies now flagged earlier
  • January 2025: Guest token binding to browser fingerprints implemented; datacenter IPs permanently banned

This isn't a stable target.

Three Things That Break Constantly

1. Guest Tokens

Every API call to X.com's GraphQL backend requires a guest token. These tokens:

  • Expire every 2-4 hours
  • Are tied to your IP address
  • Have acquisition methods that change every few weeks
  • Require new reverse engineering each time X.com shifts its approach

When a guest token expires, your scraper stops cold. Acquiring a new one becomes a game of catching up with X.com's latest obfuscation. See handling cookies and session management for related concepts.

2. doc_ids

X.com's GraphQL queries use doc_ids, which are identifiers that tell the backend which operation to execute. These:

  • Rotate every 2-4 weeks
  • Require tracking 8-12 different IDs simultaneously
  • Need reverse engineering from X.com's frontend JavaScript
  • Have no public documentation

Without current doc_ids, your queries fail silently or return empty results. Similar blocking techniques are documented in detecting anti-bot systems.

3. Rate Limits & Blocking

X.com enforces:

  • 300 requests per hour per IP address
  • Instant blocking of datacenter IPs (detected within 1-2 requests)
  • Increasing TLS fingerprinting checks to detect browser automation
  • Cookie validation that flags rotating proxy behavior

Even with all the tokens and doc_ids correct, you'll hit blocks faster than you think. For handling rate limits, see rate limiting async requests and avoiding IP blocking.

The Maintenance Reality

Running a DIY X.com scraper demands:

  • Monitoring: Watch X.com's API responses for failures
  • Reverse Engineering: Extract new doc_ids when they rotate
  • Guest Token Fixes: Update acquisition logic when X.com changes endpoints
  • Rate Limit Handling: Implement backoff strategies that actually work
  • Proxy Rotation: Maintain residential proxy pools and rotation logic

This typically requires 10-15 hours per month just to keep a scraper running. For teams with only one engineer, it becomes a permanent side job.

Public View Limitations

Scraping without authentication gives you limited data. You can get:

  • Public tweets and profiles
  • Follower counts and basic engagement metrics

You cannot get:

  • Private/protected account data
  • Detailed search results (limited without login)
  • User timelines (incomplete without authentication)
  • Bookmarks, lists, or other account-specific data

This is why many scrapers require login, which introduces its own set of problems (account suspension risk, session management complexity).

ScrapFly's Solution

Instead of maintaining a scraper, use one that's maintained for you.

Clone the working scraper:

$ git clone https://github.com/scrapfly/scrapfly-scrapers.git
$ cd scrapfly-scrapers/twitter-scraper
$ poetry install
$ poetry run python run.py

ScrapFly's scraper:

  • Updates within 24 hours of X.com changes
  • Handles guest tokens automatically (acquires and rotates without your code)
  • Includes residential proxies (built-in IP rotation)
  • Tracks doc_ids automatically (reverse-engineers and updates them behind the scenes)
  • Bypasses anti-bot measures (realistic browser fingerprints and request patterns)
  • Supports all public clients (profiles, tweets, search, threads, engagement)

You clone it, add your API key, and it works. When X.com changes, ScrapFly updates the code. You don't maintain anything.

What Data You Can Scrape

Profiles: Username, bio, follower count, verification status, profile picture URL.

Tweets: Text content, timestamps, media URLs, engagement counts (likes, retweets, replies).

Search Results: Tweets matching your query, ranked by relevance and recency.

Threads: Replies, quote tweets, and conversation chains connected to a parent tweet.

Engagement Metrics: Likes, retweets, replies, quotes, and bookmark counts.

Followers: User lists following a given account (where publicly available).

How X.com's API Works

X.com is a React application that loads minimal HTML, then uses JavaScript to fetch data via GraphQL queries.

Here's the flow:

  1. Load X.com page
  2. JavaScript initializes and requests a guest token
  3. Guest token is returned (valid for 2-4 hours)
  4. Page JavaScript makes GraphQL queries using the token
  5. Queries include doc_ids to identify which operation to execute
  6. Backend returns JSON data, frontend renders it

Without a guest token, you can't make queries. Without the right doc_id, your query doesn't match any backend operation. Without working residential proxies and rate limiting, X.com blocks your IP.

A scraper needs to replicate this flow: get token, craft queries with current doc_ids, handle rate limits, and rotate IPs intelligently.

Guest Tokens Explained

What they are: Temporary credentials that prove you're a user (not a bot).

Why they expire: X.com limits token lifetime to prevent token reuse and API abuse.

How they're tied to IP: X.com validates that requests using a token come from the same IP that requested it. Rotating IPs breaks token validity.

How ScrapFly handles them: Automatically acquires new tokens, maintains them per IP session, and rotates them before expiration. Your code simply calls the scraper while token management happens behind the scenes.

Code example (ScrapFly handles this internally):

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR_API_KEY")
result = client.scrape(ScrapeConfig(
    "https://x.com/Scrapfly_dev",
    render_js=True,  # Enable JavaScript rendering
    asp=True,        # Enable anti-scraping bypass
))

ScrapFly's SDK manages guest tokens, IP sessions, and retries automatically. You provide the URL; it handles the rest.

doc_ids Explained

What they are: Unique identifiers for GraphQL operations. Each query type (fetch profile, get tweets, search) has its own doc_id.

Why they change: X.com rotates doc_ids every 2-4 weeks to break reverse-engineered scrapers, and there's no pattern since they're essentially random identifiers.

How many are active: You typically need to track 8-12 doc_ids simultaneously:

  • User info queries
  • Tweet detail queries
  • Search queries
  • Timeline queries
  • Engagement queries

How to find them: Reverse-engineer X.com's JavaScript bundle, intercept GraphQL calls with browser dev tools, or monitor X.com's API patterns. It's manual work, and it repeats every few weeks.

ScrapFly's approach: Monitors X.com automatically and updates doc_ids within 24 hours of rotation. When you clone the scraper, all current doc_ids are already in place.

Scraping Profiles

Extract user profile information using ScrapFly:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR_API_KEY")

result = await client.async_scrape(ScrapeConfig(
    "https://x.com/Scrapfly_dev",
    render_js=True,
    asp=True,
    wait_for_selector="[data-testid='primaryColumn']"
))

# Capture XHR calls and extract profile data
xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
user_calls = [f for f in xhr_calls if "UserBy" in f["url"]]

for xhr in user_calls:
    data = json.loads(xhr["response"]["body"])
    profile = parse_profile(data["data"]["user"]["result"])
    print(json.dumps(profile, indent=2))

What you get:

  • Display name and username
  • Bio and description
  • Follower/following counts
  • Verification status
  • Media count and listed count
  • Profile banner and creation date
  • Website URL and location
Example Output
{
  "id": "VXNlcjoxMzEwNjIzMDgxMzAwNDAyMTc4",
  "rest_id": "1310623081300402178",
  "verified": true,
  "default_profile": true,
  "default_profile_image": false,
  "description": "API products for developers:\n- Web Scraping API: scrape any page\n- Screenshot API: screenshot any website\n- Extraction API: parse data using AI",
  "entities": {
    "description": {
      "urls": []
    },
    "url": {
      "urls": [
        {
          "display_url": "scrapfly.io",
          "expanded_url": "https://scrapfly.io",
          "url": "https://t.co/1Is3k6KzyM",
          "indices": [0, 23]
        }
      ]
    }
  },
  "fast_followers_count": 0,
  "favourites_count": 41,
  "followers_count": 281,
  "friends_count": 5,
  "has_custom_timelines": true,
  "is_translator": false,
  "listed_count": 3,
  "media_count": 38,
  "normal_followers_count": 281,
  "pinned_tweet_ids_str": ["1863616315174551787"],
  "possibly_sensitive": false,
  "profile_banner_url": "https://pbs.twimg.com/profile_banners/1310623081300402178/1601320645",
  "profile_interstitial_type": "",
  "statuses_count": 186,
  "translator_type": "none",
  "url": "https://t.co/1Is3k6KzyM",
  "withheld_in_countries": []
}
  

Link to full implementation: ScrapFly X.com Scraper GitHub

Scraping Tweets

Extract tweets and their metadata:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR_API_KEY")

result = await client.async_scrape(ScrapeConfig(
    "https://x.com/Scrapfly_dev/status/1664267318053179398",
    render_js=True,
    asp=True,
    wait_for_selector="[data-testid='tweet']"
))

# Capture XHR calls and extract tweet data
xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
tweet_calls = [f for f in xhr_calls if "TweetResultByRestId" in f["url"]]

for xhr in tweet_calls:
    data = json.loads(xhr["response"]["body"])
    tweet = parse_tweet(data['data']['tweetResult']['result'])
    print(json.dumps(tweet, indent=2))

What you get:

  • Tweet text and full content
  • Creation timestamp
  • Like, retweet, reply, and quote counts
  • View count
  • Media URLs (images, videos)
  • Attached URLs with expanded links
  • Tagged hashtags and mentioned users
  • Author information (nested user object)
  • Engagement data (bookmarks, conversation ID)
Example Output
{
  "created_at": "Thu Jun 01 13:47:03 +0000 2023",
  "attached_urls": [
    "https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/"
  ],
  "attached_media": [
    "https://pbs.twimg.com/media/FxiqTffWIAALf7O.png"
  ],
  "tagged_users": [],
  "tagged_hashtags": [],
  "favorite_count": 8,
  "bookmark_count": 1,
  "quote_count": 0,
  "reply_count": 7,
  "retweet_count": 1,
  "text": "A new blog post has been published! \n\nTop 10 Web Scraping Packages for Python 🤖\n\nCheckout it out 👇\nhttps://t.co/d2iFdAV2LJ https://t.co/zLjDlxdKee",
  "is_quote": false,
  "is_retweet": false,
  "language": "en",
  "user_id": "1310623081300402178",
  "id": "1664267318053179398",
  "conversation_id": "1664267318053179398",
  "source": "Zapier.com",
  "views": "2296",
  "poll": {},
  "user": {
    "id": "VXNlcjoxMzEwNjIzMDgxMzAwNDAyMTc4",
    "rest_id": "1310623081300402178",
    "verified": true,
    "default_profile": true,
    "default_profile_image": false,
    "description": "API products for developers:\n- Web Scraping API: scrape any page\n- Screenshot API: screenshot any website\n- Extraction API: parse data using AI",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {
        "urls": [
          {
            "display_url": "scrapfly.io",
            "expanded_url": "https://scrapfly.io",
            "url": "https://t.co/1Is3k6KzyM",
            "indices": [0, 23]
          }
        ]
      }
    },
    "fast_followers_count": 0,
    "favourites_count": 41,
    "followers_count": 281,
    "friends_count": 5,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 3,
    "media_count": 38,
    "normal_followers_count": 281,
    "pinned_tweet_ids_str": ["1863616315174551787"],
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1310623081300402178/1601320645",
    "profile_interstitial_type": "",
    "statuses_count": 186,
    "translator_type": "none",
    "url": "https://t.co/1Is3k6KzyM",
    "withheld_in_countries": []
  }
}
  

Link to full implementation: ScrapFly X.com Scraper GitHub

Proxies

Why You Need Them

Scraping X.com without proxies will fail in minutes:

  • Rate limit: 300 requests per hour per IP
  • IP blocking: Datacenter IPs are blocked instantly (within 1-2 requests)
  • Detection: X.com tracks request patterns and flags suspicious behavior

You cannot scrape X.com at any meaningful scale using a single datacenter IP. For more context, see avoiding IP address blocking and proxy introduction.

Best Type: Residential Proxies

Why residential: Requests come from real residential IP addresses, making them indistinguishable from regular users. X.com's detection systems accept them.

Cost: $1-3 per gigabyte of traffic.

Rotation strategy: Use sticky sessions that last 10-15 minutes. This keeps the guest token and IP session stable while giving you enough coverage to avoid triggering rate limits on a single IP. For more on rotation, see proxy rotation techniques.

Maintenance: ScrapFly includes residential proxies in its service, so you don't manage pools or rotation logic since it's handled for you. Check best proxy providers and advanced proxy optimization for alternatives.

Cost Example

Scraping 10,000 tweets:

  • API calls: ~50-100 requests (1-2 per tweet for detail fetching)
  • Data transferred: 5-10 MB
  • Proxy cost: $5-8 at standard residential rates

For price-tracking or sentiment-analysis projects, this is reasonable. For continuous monitoring, ScrapFly's monthly plan is more cost-effective than managing your own infrastructure. See top residential proxy providers for comparison.

FAQ

How often does X.com change its defenses?

Every 2-4 weeks, X.com rolls out changes to guest tokens, doc_ids, rate limits, or detection patterns, and there's no predictable schedule because changes happen whenever they happen.

Why can't I just use a datacenter proxy?

X.com blocks datacenter IPs on sight. They detect datacenter IP ranges and reject requests immediately. Even with perfect tokens and doc_ids, a datacenter proxy will fail within seconds.

How does ScrapFly know when X.com changes?

ScrapFly monitors X.com's API continuously, tests scraper functionality hourly, and detects failures before customers do. When failures are detected, engineers reverse-engineer the new changes and push updates within 24 hours.

Can I scrape X.com without authentication?

Yes, public tweets and profiles are scrappable without login. However, you'll have limited access to search, timelines, and engagement data. For full functionality, using authenticated sessions and cookies is necessary, though it carries account suspension risk. Related alternatives: Playwright automation or Selenium automation.

Scraping public data is generally legal. However, X.com's terms of service prohibit automated access without permission. Legally, it's a gray area. Use scraped data responsibly and review compliance requirements for your use case. For other platforms, see social media scraping in 2025.

How much does ScrapFly cost?

Pricing depends on data volume and features. Visit ScrapFly's pricing page for current rates. Most projects start at $20-100/month depending on request volume.

X.com Scraping Summary

The free API is gone. Manual scraping breaks every 2-4 weeks due to guest tokens expiring, doc_ids rotating, and rate limits shifting. Maintaining a DIY scraper costs 10-15 hours per month.

ScrapFly solves this by maintaining one scraper for you. Guest tokens are handled automatically. doc_ids are tracked and updated within 24 hours of rotation. Residential proxies are included. Rate limiting and anti-bot bypass are built in.

Clone the repo, add your API key, start scraping. When X.com changes, ScrapFly updates the code. You don't maintain anything.

Latest X.com Scraper Code
Ready to use: https://github.com/scrapfly/scrapfly-scrapers/

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Explore this Article with AI

Related Knowledgebase

Related Articles