How to Scrape X.com (Twitter) using Python (2024 Update)

How to Scrape X.com (Twitter) using Python (2024 Update)

As Twitter.com became X.com it closed its public API though web scraping is here to the rescue!

In this X.com web scraping tutorial, we'll take a look at scraping X.com posts and profiles using Python and Playwright.

We'll be using Python to retrieve X.com data such as:

  • X.com post (tweet) information.
  • X.com user profile information.

Unfortunately, the rest of the data points are not possible to scrape without login however we'll mention some potential workarounds and suggestions.

So, we'll be scraping X.com without login or any complex tricks using headless browsers and capturing background requests making this a very simple and powerful scraper.

For our headless browser environment, we'll be using Scrapfly SDK with Javascript Rendering feature. Alternatively, for non-scrapfly users, we'll also show how to achieve similar results using Playwright.

Latest X.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape X.com?

X.com (formerly Twitter.com) is a major announcement hub where people and companies publish their news. This is a great opportunity to use X.com to follow and understand industry trends. For example, stock market or crypto market targets could be scraped to predict the future price of a stock or crypto.

X is also a great source of data for sentiment analysis. You can use X.com to find out what people think about a certain topic or brand. This is useful for market research, product development, and brand awareness.

So, if we can scrape X.com data with Python we can have access to this valuable public information for free!

Project Setup

In this tutorial, we'll cover X/Twitter scraping using Python and scrapfly-sdk or Playwright.

To parse the scraped X.com datasets we'll be using Jmespath JSON parsing library which allows to parse and reshape JSON data.

All of these libraries are available for free and can be installed via pip install terminal command:

$ pip install playwright jmespath scrapfly-sdk
Web Scraping with Playwright and Python

For an introduction to web scraping with Playwright see this beginner's guide which covers common functionality and an example project.

Web Scraping with Playwright and Python

How Do X.com Pages Work?

Before we start scraping, let's take a quick look at how X.com website works through a bit of basic reverse engineering. This will help us to develop our Twitter scraper.

To start, X.com is a javascript web application that uses a lot of background requests (XHR) to display the page data. In short, it works by loading the initial HTML, then starting a JS app and then using XHR requests loads the tweet data:

Twitter page load process
Twitter page load process

So, scraping it without a headless browser such as Playwright or Scrapfly SDK would be very difficult as we'd have to reverse engineer the entire X.com API and application process.

To add, X.com page HTML is dynamic and complex making parsing scraped content very difficult. So, the best approach to scrape Twitter is to use a headless browser and capture background requests that download the Tweet and user data.

To summarize, our best bet is to:

  1. Start a headless web browser.
  2. Enable background request capture.
  3. Load X.com page.
  4. Select captured background requests that contain post or profile data.

For example, if we take a look at a Twitter profile page in Browser Developer Tools we can see the requests Twitter performs in the background to load the page data:

0:00
/
X.com (Twitter) backend making a background request to retrieve data

Next, let's start by scraping X.com posts (tweets).

Scraping Tweets (Posts)

To scrape individual X.com post pages we'll be loading the page using a headless browser and capturing the background requests that retrieve tweet details. This request can be identified by TweetResultByRestId in the URL.

This background request returns a JSON response that contains post and author information.

So, to scrape this using Python we can either use Playwright or Scrapfly SDK:

Python
ScrapFly
from playwright.sync_api import sync_playwright


def scrape_tweet(url: str) -> dict:
    """
    Scrape a single tweet page for Tweet thread e.g.:
    https://twitter.com/Scrapfly_dev/status/1667013143904567296
    Return parent tweet, reply tweets and recommended tweets
    """
    _xhr_calls = []

    def intercept_response(response):
        """capture all background requests and save them"""
        # we can extract details from background requests
        if response.request.resource_type == "xhr":
            _xhr_calls.append(response)
        return response

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=False)
        context = browser.new_context(viewport={"width": 1920, "height": 1080})
        page = context.new_page()

        # enable background request intercepting:
        page.on("response", intercept_response)
        # go to url and wait for the page to load
        page.goto(url)
        page.wait_for_selector("[data-testid='tweet']")

        # find all tweet background requests:
        tweet_calls = [f for f in _xhr_calls if "TweetResultByRestId" in f.url]
        for xhr in tweet_calls:
            data = xhr.json()
            return data['data']['tweetResult']['result']



if __name__ == "__main__":
    print(scrape_tweet("https://twitter.com/Scrapfly_dev/status/1664267318053179398"))
import asyncio
import json

from scrapfly import ScrapeConfig, ScrapflyClient

SCRAPFLY = ScrapflyClient(key="YOUR SCRAPFLY KEY")

async def scrape_tweet(url: str) -> dict:
    """
    Scrape a single tweet page for Tweet thread e.g.:
    https://twitter.com/Scrapfly_dev/status/1667013143904567296
    Return parent tweet, reply tweets and recommended tweets
    """
    result = await SCRAPFLY.async_scrape(ScrapeConfig(
        url, 
        render_js=True,  # enable headless browser
        wait_for_selector="[data-testid='tweet']"  # wait for page to finish loading 
    ))
    # capture background requests and extract ones that request Tweet data
    _xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
    tweet_call = [f for f in _xhr_calls if "TweetResultByRestId" in f["url"]]
    for xhr in tweet_call:
        if not xhr["response"]:
            continue
        data = json.loads(xhr["response"]["body"])
        return data['data']['tweetResult']['result']

if __name__ == "__main__":
    print(asyncio.run(scrape_tweet("https://twitter.com/Scrapfly_dev/status/1664267318053179398")))
Example Output
{
  "tweet": {
    "__typename": "Tweet",
    "rest_id": "1664267318053179398",
    "core": {
      "user_results": {
        "result": {
          "__typename": "User",
          "id": "VXNlcjoxMzEwNjIzMDgxMzAwNDAyMTc4",
          "rest_id": "1310623081300402178",
          "affiliates_highlighted_label": {},
          "is_blue_verified": true,
          "profile_image_shape": "Circle",
          "legacy": {
            "created_at": "Mon Sep 28 16:51:22 +0000 2020",
            "default_profile": true,
            "default_profile_image": false,
            "description": "Web Scraping API - turn any website into a database!\n\nScrapFly allows you to quickly achieve your data goals without web scraping challenges and errors.",
            "entities": {
              "description": {
                "urls": []
              },
              "url": {
                "urls": [
                  {
                    "display_url": "scrapfly.io",
                    "expanded_url": "https://scrapfly.io",
                    "url": "https://t.co/1Is3k6KzyM",
                    "indices": [
                      0,
                      23
                    ]
                  }
                ]
              }
            },
            "fast_followers_count": 0,
            "favourites_count": 26,
            "followers_count": 163,
            "friends_count": 993,
            "has_custom_timelines": true,
            "is_translator": false,
            "listed_count": 2,
            "location": "Paris",
            "media_count": 11,
            "name": "Scrapfly",
            "normal_followers_count": 163,
            "pinned_tweet_ids_str": [],
            "possibly_sensitive": false,
            "profile_banner_url": "https://pbs.twimg.com/profile_banners/1310623081300402178/1601320645",
            "profile_image_url_https": "https://pbs.twimg.com/profile_images/1310658795715076098/XedZDwC7_normal.jpg",
            "profile_interstitial_type": "",
            "screen_name": "Scrapfly_dev",
            "statuses_count": 56,
            "translator_type": "none",
            "url": "https://t.co/1Is3k6KzyM",
            "verified": false,
            "withheld_in_countries": []
          }
        }
      }
    },
    "edit_control": {
      "edit_tweet_ids": [
        "1664267318053179398"
      ],
      "editable_until_msecs": "1685629023000",
      "is_edit_eligible": true,
      "edits_remaining": "5"
    },
    "is_translatable": false,
    "views": {
      "count": "43",
      "state": "EnabledWithCount"
    },
    "source": "<a href=\"https://zapier.com/\" rel=\"nofollow\">Zapier.com</a>",
    "legacy": {
      "bookmark_count": 0,
      "bookmarked": false,
      "created_at": "Thu Jun 01 13:47:03 +0000 2023",
      "conversation_id_str": "1664267318053179398",
      "display_text_range": [
        0,
        122
      ],
      "entities": {
        "media": [
          {
            "display_url": "pic.twitter.com/zLjDlxdKee",
            "expanded_url": "https://twitter.com/Scrapfly_dev/status/1664267318053179398/photo/1",
            "id_str": "1664267314160607232",
            "indices": [
              123,
              146
            ],
            "media_url_https": "https://pbs.twimg.com/media/FxiqTffWIAALf7O.png",
            "type": "photo",
            "url": "https://t.co/zLjDlxdKee",
            "features": {
              "large": {
                "faces": []
              },
              "medium": {
                "faces": []
              },
              "small": {
                "faces": []
              },
              "orig": {
                "faces": []
              }
            },
            "sizes": {
              "large": {
                "h": 416,
                "w": 796,
                "resize": "fit"
              },
              "medium": {
                "h": 416,
                "w": 796,
                "resize": "fit"
              },
              "small": {
                "h": 355,
                "w": 680,
                "resize": "fit"
              },
              "thumb": {
                "h": 150,
                "w": 150,
                "resize": "crop"
              }
            },
            "original_info": {
              "height": 416,
              "width": 796,
              "focus_rects": [
                {
                  "x": 27,
                  "y": 0,
                  "w": 743,
                  "h": 416
                },
                {
                  "x": 190,
                  "y": 0,
                  "w": 416,
                  "h": 416
                },
                {
                  "x": 216,
                  "y": 0,
                  "w": 365,
                  "h": 416
                },
                {
                  "x": 294,
                  "y": 0,
                  "w": 208,
                  "h": 416
                },
                {
                  "x": 0,
                  "y": 0,
                  "w": 796,
                  "h": 416
                }
              ]
            }
          }
        ],
        "user_mentions": [],
        "urls": [
          {
            "display_url": "scrapfly.io/blog/top-10-we\u2026",
            "expanded_url": "https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/",
            "url": "https://t.co/d2iFdAV2LJ",
            "indices": [
              99,
              122
            ]
          }
        ],
        "hashtags": [],
        "symbols": []
      },
      "extended_entities": {
        "media": [
          {
            "display_url": "pic.twitter.com/zLjDlxdKee",
            "expanded_url": "https://twitter.com/Scrapfly_dev/status/1664267318053179398/photo/1",
            "id_str": "1664267314160607232",
            "indices": [
              123,
              146
            ],
            "media_key": "3_1664267314160607232",
            "media_url_https": "https://pbs.twimg.com/media/FxiqTffWIAALf7O.png",
            "type": "photo",
            "url": "https://t.co/zLjDlxdKee",
            "ext_media_availability": {
              "status": "Available"
            },
            "features": {
              "large": {
                "faces": []
              },
              "medium": {
                "faces": []
              },
              "small": {
                "faces": []
              },
              "orig": {
                "faces": []
              }
            },
            "sizes": {
              "large": {
                "h": 416,
                "w": 796,
                "resize": "fit"
              },
              "medium": {
                "h": 416,
                "w": 796,
                "resize": "fit"
              },
              "small": {
                "h": 355,
                "w": 680,
                "resize": "fit"
              },
              "thumb": {
                "h": 150,
                "w": 150,
                "resize": "crop"
              }
            },
            "original_info": {
              "height": 416,
              "width": 796,
              "focus_rects": [
                {
                  "x": 27,
                  "y": 0,
                  "w": 743,
                  "h": 416
                },
                {
                  "x": 190,
                  "y": 0,
                  "w": 416,
                  "h": 416
                },
                {
                  "x": 216,
                  "y": 0,
                  "w": 365,
                  "h": 416
                },
                {
                  "x": 294,
                  "y": 0,
                  "w": 208,
                  "h": 416
                },
                {
                  "x": 0,
                  "y": 0,
                  "w": 796,
                  "h": 416
                }
              ]
            }
          }
        ]
      },
      "favorite_count": 0,
      "favorited": false,
      "full_text": "A new blog post has been published! \n\nTop 10 Web Scraping Packages for Python \ud83e\udd16\n\nCheckout it out \ud83d\udc47\nhttps://t.co/d2iFdAV2LJ https://t.co/zLjDlxdKee",
      "is_quote_status": false,
      "lang": "en",
      "possibly_sensitive": false,
      "possibly_sensitive_editable": true,
      "quote_count": 0,
      "reply_count": 0,
      "retweet_count": 0,
      "retweeted": false,
      "user_id_str": "1310623081300402178",
      "id_str": "1664267318053179398"
    },
    "quick_promote_eligibility": {
      "eligibility": "IneligibleUserUnauthorized"
    }
  },
  "replies": [],
  "other": []
}

Here, we loaded the Tweet page using a headless browser and captured all of the background requests. Then, we filtered out the ones that contained the Tweet data.

One important note here is that we need to wait for the page to load which is indicated by tweets appearing on the page HTML otherwise we'll return our scrape before the background requests have finished.

This resulted in a massive JSON dataset that can be difficult to work with. So, let's take a look at how to reduce it with a bit of JSON parsing next.

Parsing Tweet Dataset

The Tweet dataset we scraped contains a lot of complex data so let's reduce it to something more clean and simple using Jmespath JSON parsing library.

For this, we'll be using jmespath's JSON reshaping feature which allows us to rename keys and flatten nested objects:

from typing import Dict

def parse_tweet(data: Dict) -> Dict:
    """Parse Twitter tweet JSON dataset for the most important fields"""
    result = jmespath.search(
        """{
        created_at: legacy.created_at,
        attached_urls: legacy.entities.urls[].expanded_url,
        attached_urls2: legacy.entities.url.urls[].expanded_url,
        attached_media: legacy.entities.media[].media_url_https,
        tagged_users: legacy.entities.user_mentions[].screen_name,
        tagged_hashtags: legacy.entities.hashtags[].text,
        favorite_count: legacy.favorite_count,
        bookmark_count: legacy.bookmark_count,
        quote_count: legacy.quote_count,
        reply_count: legacy.reply_count,
        retweet_count: legacy.retweet_count,
        quote_count: legacy.quote_count,
        text: legacy.full_text,
        is_quote: legacy.is_quote_status,
        is_retweet: legacy.retweeted,
        language: legacy.lang,
        user_id: legacy.user_id_str,
        id: legacy.id_str,
        conversation_id: legacy.conversation_id_str,
        source: source,
        views: views.count
    }""",
        data,
    )
    result["poll"] = {}
    poll_data = jmespath.search("card.legacy.binding_values", data) or []
    for poll_entry in poll_data:
        key, value = poll_entry["key"], poll_entry["value"]
        if "choice" in key:
            result["poll"][key] = value["string_value"]
        elif "end_datetime" in key:
            result["poll"]["end"] = value["string_value"]
        elif "last_updated_datetime" in key:
            result["poll"]["updated"] = value["string_value"]
        elif "counts_are_final" in key:
            result["poll"]["ended"] = value["boolean_value"]
        elif "duration_minutes" in key:
            result["poll"]["duration"] = value["string_value"]
    user_data = jmespath.search("core.user_results.result", data)
    if user_data:
        result["user"] = parse_user(user_data)
    return result

Above we're using jmespath to reshape the giant, nested dataset we scraped from Twitter's graphql backend into a flat dictionary containing only the most important fields.

Scraping X.com User Profiles

To scrape X.com profile pages we'll be using the same background request capturing approach though this time we'll be capturing UserBy endpoints.

We'll be using the same technique we used to scrape X posts - launch a headless browser, enable background request capture, load the page and get the data requests:

Python
ScrapFly
import asyncio
from playwright.sync_api import sync_playwright


def scrape_profile(url: str) -> dict:
    """
    Scrape a X.com profile details e.g.: https://x.com/Scrapfly_dev
    """
    _xhr_calls = []

    def intercept_response(response):
        """capture all background requests and save them"""
        # we can extract details from background requests
        if response.request.resource_type == "xhr":
            _xhr_calls.append(response)
        return response

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=False)
        context = browser.new_context(viewport={"width": 1920, "height": 1080})
        page = context.new_page()

        # enable background request intercepting:
        page.on("response", intercept_response)
        # go to url and wait for the page to load
        page.goto(url)
        page.wait_for_selector("[data-testid='primaryColumn']")

        # find all tweet background requests:
        tweet_calls = [f for f in _xhr_calls if "UserBy" in f.url]
        for xhr in tweet_calls:
            data = xhr.json()
            return data['data']['user']['result']



if __name__ == "__main__":
    print(asyncio.run(scrape_profile("https://x.com/Scrapfly_dev")))
import asyncio
import json

from scrapfly import ScrapeConfig, ScrapflyClient

SCRAPFLY = ScrapflyClient(key="YOUR SCRAPFLY KEY")

async def scrape_tweet(url: str) -> dict:
    """
    Scrape a X.com profile details e.g.: https://x.com/Scrapfly_dev
    """
    result = await SCRAPFLY.async_scrape(ScrapeConfig(
        url, 
        render_js=True,  # enable headless browser
        wait_for_selector="[data-testid='primaryColumn']"  # wait for page to finish loading 
    ))
    # capture background requests and extract ones that request Tweet data
    _xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
    tweet_call = [f for f in _xhr_calls if "UserBy" in f["url"]]
    for xhr in tweet_call:
        if not xhr["response"]:
            continue
        data = json.loads(xhr["response"]["body"])
        return data['data']['user']['result']

if __name__ == "__main__":
    print(asyncio.run(scrape_tweet("https://x.com/Scrapfly_dev")))
Example Output
{
  "__typename": "User",
  "id": "VXNlcjoxMzEwNjIzMDgxMzAwNDAyMTc4",
  "rest_id": "1310623081300402178",
  "affiliates_highlighted_label": {},
  "is_blue_verified": true,
  "profile_image_shape": "Circle",
  "legacy": {
    "created_at": "Mon Sep 28 16:51:22 +0000 2020",
    "default_profile": true,
    "default_profile_image": false,
    "description": "Web Scraping API - turn any website into a database!\n\nScrapFly allows you to quickly achieve your data goals without web scraping challenges and errors.",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {
        "urls": [
          {
            "display_url": "scrapfly.io",
            "expanded_url": "https://scrapfly.io",
            "url": "https://t.co/1Is3k6KzyM",
            "indices": [
              0,
              23
            ]
          }
        ]
      }
    },
    "fast_followers_count": 0,
    "favourites_count": 26,
    "followers_count": 163,
    "friends_count": 993,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 2,
    "location": "Paris",
    "media_count": 11,
    "name": "Scrapfly",
    "normal_followers_count": 163,
    "pinned_tweet_ids_str": [],
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1310623081300402178/1601320645",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1310658795715076098/XedZDwC7_normal.jpg",
    "profile_interstitial_type": "",
    "screen_name": "Scrapfly_dev",
    "statuses_count": 56,
    "translator_type": "none",
    "url": "https://t.co/1Is3k6KzyM",
    "verified": false,
    "withheld_in_countries": []
  },
  "business_account": {},
  "highlights_info": {
    "can_highlight_tweets": true,
    "highlighted_tweets": "0"
  },
  "creator_subscriptions_count": 0
}

Bypass X.com Blocking with ScrapFly

If we start scraping X.com at scale we start to quickly run into blocking as X.com doesn't allow automated requests and will block scraper IP addresses after a few scraping requests.

To get around this Scrapfly can lend a hand!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

For example, to use ScrapFly with Python we can take advantage of Python SDK:

from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")

result = scrapfly.scrape(ScrapeConfig(
    "https://twitter.com/Scrapfly_dev",
    # we can enable features like:
    # cloud headless browser use
    render_js=True,  
    # anti scraping protection bypass
    asp=True, 
    # screenshot taking
    screenshots={"all": "fullpage"},
    # proxy country selection
    country="US",
))

For more on using ScrapFly to scrape Twitter see the Full Scraper Code section.

Scraping X.com Search, Replies and Timelines

In this tutorial, we've covered how to scrape X.com posts and profiles that are publicly available for everyone.

The other areas like search and timelines pages are not publicly available and require a login to access which can lead to account suspension.

X.com does offer a public guest preview access for timelines and tweet search but only for Android devices. This is the only way to scrape X.com timelines, tweet replies and search without login.

The most reliable up-to-date source for this is Nitter.net which is an alternative, open source Twitter front-end.
For more on that, we recommend following Nitter Guest Account Branch on Github.

FAQ

To wrap up this Python Twitter scraper let's take a look at some frequently asked questions regarding web scraping Twitter:

Yes, all of the data on X.com is available publically so it's perfectly legal to scrape. However, note that some tweets can contain copyrighted material like images or videos and using this data commercially can be illegal.

How to scrape X.com without getting blocked?

X.com is a complex javascript-heavy website and is hostile to web scraping so it's easy to get blocked. To avoid this you can use ScrapFly which provides anti-scraping technology bypass and proxy rotation. Alternatively, see our article on how to avoid web scraper blocking.

The legality of scraping X.com while logged in is a bit of a grey area. Generally, logging in legally binds the user to the website's terms of service which in Twitter's case forbids automated scraping. This allows X to suspend your account or even take legal action. It's best to avoid web scraping X.com while logged in if possible.

How to reduce bandwidth use and speed up X scraping?

If you're using browser automation tools like Playwright (used in this article) then you can block images and unnecessary resources to save bandwidth and speed up scraping.

Latest X.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

X.com Scraping Summary

In this tutorial, we made a short Twitter scraper (now known as X.com) using Python headless browsers through Playwright or Scrapfly SDK.

To start, we've taken a look at how X.com works and identified where the data is located. We found that X is using background requests to populate post and profile data.

To capture and scrape these background requests we used the intercept function of Playwright or Scrapfly-SDK and parsed the raw datasets to nice clean data JSONs using jmespath.

Finally, to avoid blocking we've taken a look at ScrapFly web scraping API which provides a simple way to scrape Twitter at scale using proxies and anti scraping technology bypassing. Try out ScrapFly for free!

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.

How to Scrape LinkedIn in 2024

In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.

How to Scrape SimilarWeb Website Traffic Analytics

In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.