How to Scrape Instagram in 2025

Q: Are there public APIs for Instagram?

No. Instagram discontinued public API access in 2020 and now only offers limited APIs for verified business partners through the Instagram Graph API . However, Instagram's web and mobile apps communicate with internal REST and GraphQL APIs that we can access directly through reverse engineering. These "hidden" APIs provide far more data than the official API ever did.

Q: How to get Instagram user ID from username?

Scrape the user profile using the /api/v1/users/web_profile_info/ endpoint and extract the id field: import httpx import json response = httpx . get ( f "https://i.instagram.com/api/v1/users/web_profile_info/?username=google" , headers = { "x-ig-app-id" : "936619743392459" } ) data = json . loads ( response . content ) user_id = data [ "data" ][ "user" ][ "id" ] print ( user_id ) # Output: 1067259270

Q: How to get Instagram username from user ID?

Use Instagram's public mobile API endpoint : import httpx iphone_api = "https://i.instagram.com/api/v1/users/ {} /info/" iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90" response = httpx . get ( iphone_api . format ( "1067259270" ), headers = { "User-Agent" : iphone_user_agent } ) username = response . json ()[ 'user' ][ 'username' ] print ( username ) # Output: google

Q: How do I handle Instagram's rate limiting when scraping at scale?

Instagram rate limiting requires a three-part strategy : Residential proxy rotation: Use 50-100+ residential IPs and rotate them in sticky sessions ( 5-10 minutes per IP ). Each IP allows ~200 requests/hour . Realistic delays: Space requests 2-5 seconds apart with random variance. Perfect timing intervals look robotic. Exponential backoff: When you receive a 429 error , back off exponentially (wait 2s, then 4s, then 8s, etc.) before retrying. ScrapFly handles all three automatically—you specify your desired scraping speed and we manage rate limits, retries, and proxy rotation.

Q: Can I scrape Instagram stories or reels data?

Yes , stories and reels use dedicated GraphQL endpoints with their own doc_id values: Stories: Ephemeral (24-hour) content requires the user's ID and a stories-specific doc_id . Stories include view counts, replies, and media URLs . Reels: Similar to post scraping but with video-specific fields (play counts, audio attribution, video duration). Reels use doc_id 25981206651899035 (subject to change). Both require authentication for some accounts , but public accounts expose this data without login.

Q: How do I extract Instagram comments and engagement metrics?

Comments are nested in post data under edge_media_to_parent_comment . The initial post request returns the first ~12 comments plus a pagination cursor ( end_cursor ). To extract all comments: Scrape the post to get initial comments and end_cursor Make paginated GraphQL requests with the cursor to load more Extract nested replies from each comment's edge_threaded_comments Engagement metrics are in the post data: Likes : edge_media_preview_like.count Comments : edge_media_to_parent_comment.count Views (videos) : video_view_count Plays (reels) : video_play_count

Q: What are the most common Instagram scraping challenges?

1. Doc_id changes: Instagram updates GraphQL doc_id values every 2-4 weeks , breaking scrapers. Solution: Monitor our open-source scraper for updates. 2. IP blocks: Datacenter IPs are banned instantly . Solution: Use residential proxies with rotation . 3. TLS fingerprinting: Python libraries have detectable signatures . Solution: Use anti-bot bypass tools like ScrapFly that rotate fingerprints. 4. Rate limits: 200 requests/hour per IP . Solution: Rotate through 50+ residential IPs with sticky sessions. 5. Behavioral detection: Unnatural request patterns get flagged. Solution: Add random delays , mimic real browsing sequences.

Q: Why does my Python Instagram scraper get 403 errors immediately?

This is TLS fingerprinting blocking . Python's requests and httpx libraries have unique TLS handshake signatures that Instagram detects as bots within the first request. Solutions: Use browser automation (Selenium, Playwright) which has real browser fingerprints but it's 10x slower Use curl_cffi library which mimics Chrome's TLS fingerprint Use ScrapFly which rotates TLS fingerprints automatically Do not waste time trying to fix headers. The TLS handshake happens before HTTP headers are even sent. You need a tool that controls the TLS layer . Latest Instagram Scraper Code https://github.com/scrapfly/scrapfly-scrapers/

by Bernardas Ališauskas Sep 26, 2025

#scrapeguide #python

Instagram holds valuable data for businesses. You can extract competitor insights, customer sentiment, and market trends from profiles, posts, and comments. However, Instagram makes it hard to scrape their data. They use multiple layers of protection that block most scrapers within hours. In this guide, you'll learn how Instagram blocks scrapers, what data you can extract, and why building your own solution usually fails. We'll also show you a better approach using ScrapFly's maintained Instagram scraper that handles all the blocking challenges automatically.

Key Takeaways

Learn Instagram scraping in 2025 with working solutions that bypass blocking systems, access hidden GraphQL APIs, and scale for business intelligence.

Understand Instagram's blocking system: IP quality checks, TLS fingerprinting, rate limits, and behavioral detection
Access Instagram's hidden REST and GraphQL APIs to extract profiles, posts, comments, and engagement metrics
Use ScrapFly's open-source Instagram scraper with built-in anti-blocking to start scraping in 5 minutes
Set up proxy rotation with residential IPs to avoid datacenter IP blocks
Monitor and update doc_id parameters that Instagram changes every 2-4 weeks
Extract business intelligence: competitor analysis, sentiment tracking, influencer metrics, and lead generation data

Latest Instagram Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

What Instagram Data Can You Scrape?

Instagram's public data provides business intelligence when extracted systematically. Here's what you can scrape and why it matters:

Profiles - Extract bio, follower/following counts, verification status, business contact info, and post statistics. Use case: Build lead lists by scraping verified business profiles in specific niches, then reach out using their public email addresses.

Posts - Capture captions, images, videos, likes, view counts, timestamps, location tags, and tagged users. Use case: Analyze your competitor's top-performing content to understand what resonates with your shared audience and replicate successful formats.

Reels - Access video URLs, play counts, music attribution, duration, and engagement metrics. Use case: Track trending audio clips and formats in your industry to inform your own content strategy before trends peak.

Comments - Scrape comment text, nested replies, timestamps, author profiles, and like counts. Use case: Perform sentiment analysis on competitor posts to identify customer pain points and service gaps you can address.

Hashtags - Aggregate posts by hashtag, trending scores, and usage patterns. Use case: Discover emerging micro-influencers by scraping posts from industry hashtags and ranking by engagement rate rather than follower count.

Getting this data requires two things: finding the right API endpoints and not getting blocked by Instagram's blocking systems. Let's cover both.

How Instagram Blocks Scrapers (Anti-Bot Detection Explained)

Instagram uses a multi-layered blocking system designed to identify and block automated scraping. Understanding these systems shows why manual scraping solutions fail and require constant maintenance.

Rate Limiting & IP Blocking

Instagram enforces strict request quotas to prevent aggressive scraping:

Request limits: ~200 requests per hour per IP address for non-authenticated users
Throttling response: After exceeding limits, you receive HTTP 429 "Too Many Requests" errors
Block duration: Your IP gets temporarily rate-limited for hours or days depending on violation severity
Progressive penalties: Repeated violations lead to longer blocks and eventually permanent IP bans

Even if you implement delays and respect rate limits, you're still limited to scraping ~4,800 profiles per day per IP. This is insufficient for serious data collection.

IP Quality Detection

Instagram analyzes your IP address quality before even processing your request:

Datacenter IPs blocked instantly: Requests from AWS, DigitalOcean, Google Cloud, and other hosting providers are flagged immediately
Residential IPs required: Instagram expects requests from genuine consumer ISPs (Comcast, AT&T, etc.)
ASN reputation checking: Instagram maintains blocklists of ASNs (Autonomous System Numbers) associated with proxies and VPNs
This runs BEFORE rate limits: A datacenter IP gets blocked on the first request, regardless of how slowly you scrape

This is why you cannot deploy your scraper to a cloud server and expect it to work. Instagram blocks it before you even hit the rate limit.

Browser Fingerprinting

Instagram analyzes dozens of browser characteristics to detect automation tools:

TLS/SSL fingerprinting: Python's requests library has a unique TLS handshake signature that Instagram flags as a bot instantly
HTTP/2 fingerprinting: The order and format of HTTP/2 frames reveals whether you're using a real browser or a scripting library
Header order consistency: Real browsers send headers in a specific order; scrapers often randomize or alphabetize them
Canvas/WebGL fingerprinting: When JavaScript is enabled, Instagram tests how your browser renders graphics. Automation frameworks produce consistent, detectable signatures

Even if you copy all the correct headers from a real browser, the TLS handshake alone will expose you as a bot within seconds.

Request Pattern Detection

Instagram's behavioral analysis identifies non-human usage patterns:

Timing patterns: Perfect 3-second delays between requests look robotic; humans vary their timing
Request sequencing: Real users navigate naturally (view profile → scroll → click post); bots often access API endpoints directly without realistic browsing
Session validation: Instagram expects correlated requests (CSS, images, analytics) alongside your API calls; scraping just the data endpoints is suspicious
Cookie behavior: Missing, malformed, or inconsistent cookies signal automation

Instagram's machine learning models are trained on millions of real user sessions. Any deviation from natural human behavior raises red flags.

Diagram showing Instagram's four layers of anti-bot detection: Rate Limiting, IP Quality Detection, Browser Fingerprinting, and Request Pattern Detection — Instagram's Multi-Layered Anti-Bot Defense System

The Bottom Line: You can build a perfect scraper, but without professional anti-detection infrastructure, it will get blocked within hours. Instagram updates these blocking systems weekly, meaning even working scrapers break constantly.

How to Scrape Instagram with ScrapFly (The Easy Way)

ScrapFly provides the complete Instagram scraping solution: working scraper code + anti-blocking infrastructure. Clone the repository, configure your API key, and start scraping in 5 minutes.

What You Get

Working scraper code: Open source, actively maintained, updated within hours when Instagram changes
Built-in anti-blocking infrastructure: TLS fingerprinting, header rotation, and behavioral mimicry handled automatically
Residential proxy network included: 50M+ IPs from real consumer ISPs with no separate proxy bills or configuration
Automatic updates: When Instagram changes doc_ids or endpoints, we update the scraper immediately
Cost optimization: Proxy Saver feature reduces residential proxy costs by 30-50% through intelligent caching

Get Started in 5 Minutes →

# Clone the scraper repository
git clone https://github.com/scrapfly/scrapfly-scrapers.git
cd instagram-scraper

# Configure your ScrapFly API key
export SCRAPFLY_KEY="your_key_here"

# Install dependencies
poetry install

# Start scraping
poetry run python run.py

How ScrapFly Bypasses Every Defense

Anti-Blocking Bypass: ScrapFly rotates TLS fingerprints to match real Chrome/Firefox browsers, orders HTTP headers correctly, and mimics genuine browser behavior. Instagram sees legitimate browser traffic, not a scraper.

Proxy Management: Our network of 50M+ residential IPs automatically rotates with each request. Instagram sees requests from real consumer devices across different ISPs and locations, exactly like genuine users.

Rate Limit Handling: Smart throttling and exponential backoff automatically slow down when Instagram pushes back. The scraper adjusts its speed dynamically to stay under the radar.

Proxy Saver: Reduces residential proxy costs by 30-50% by intelligently caching static content and only using premium residential IPs for the actual API calls. For a 10,000 profile scraping job, this saves $15-30 in proxy costs.

How Instagram's Scraping API Works

Instagram does not provide official public APIs, but their web and mobile apps communicate with backend APIs we can access directly. Instagram uses two API architectures:

REST API: Simple endpoints for basic data (e.g., /api/v1/users/web_profile_info/ for profiles)
GraphQL API: Complex query system for posts, comments, and paginated data

Instagram uses REST APIs for straightforward requests where the data structure is simple, and GraphQL for complex queries involving nested data, filtering, or pagination.

Finding Instagram's Hidden Endpoints

When Instagram updates their platform, endpoints change. Here's how to discover current endpoints when they break:

Step 1: Open Instagram in Chrome/Firefox and open DevTools (F12)

Step 2: Go to Network tab and filter by "Fetch/XHR" to see API calls

Step 3: Navigate Instagram normally (visit a profile, view a post, scroll comments)

Step 4: Watch for API requests to domains like:

i.instagram.com/api/v1/ (REST endpoints)
www.instagram.com/graphql/query (GraphQL endpoints)

Step 5: Click on an API request to inspect:

Request headers (especially x-ig-app-id)
Request payload (for GraphQL, look for variables and doc_id)
Response structure (to understand data format)

REST Example: When viewing a profile, you will see a request to:

https://i.instagram.com/api/v1/users/web_profile_info/?username=google

GraphQL Example: When viewing a post, you will see a POST request to:

https://www.instagram.com/graphql/query

With a payload containing doc_id and variables parameters.

Understanding doc_id

The doc_id parameter is critical for GraphQL scraping but poorly understood. Here's what you need to know:

What is doc_id?

Instagram's internal identifier for specific GraphQL query structures
Maps to a predefined query on Instagram's backend (you cannot define custom queries)
Example: doc_id=8845758582119845 retrieves post details

Why doc_ids exist:

Performance: Pre-defined queries are optimized and cached on Instagram's servers
Security: Prevents custom queries that could overload the database
Anti-scraping: Changing doc_ids regularly breaks scrapers

Why doc_ids change:

Instagram updates their GraphQL schema every 2-4 weeks
Changes are a deliberate anti-scraping measure
No public documentation of current values. You must discover them yourself

How to find current doc_ids:

Open DevTools → Network tab → filter for "graphql"
Trigger the action on Instagram (view post, load comments, etc.)
Inspect the request payload for doc_id= parameter
Note the numeric value (e.g., 8845758582119845)

Different operations require different doc_ids:

Profile posts: 9310670392322965
Post details: 8845758582119845
Comments pagination: (changes frequently)
User search: (changes frequently)

The DIY Pain: You must monitor doc_ids manually and update your scraper every time Instagram changes them (every 2-4 weeks). Miss an update and your scraper breaks silently.

ScrapFly Solution: Our open-source repository is updated within hours of Instagram changes. You pull the latest code and keep scraping with no detective work required.

Required Headers

Headers are not just formalities. Instagram validates them strictly. Here is what you need and why:

Critical Headers:

{
    "x-ig-app-id": "936619743392459",  # Instagram web app identifier
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "*/*",
}

Why each header matters:

x-ig-app-id: Identifies your request as coming from Instagram's web app (not mobile app or unauthorized client). Wrong value equals instant 403 error.
User-Agent: Must match a real browser signature. Python's default User-Agent screams "bot" and gets blocked immediately.
Accept-Language: Instagram tracks inconsistent language preferences across requests. Keep it stable per session.
Accept-Encoding: Real browsers always accept compression. Omitting this is suspicious.
Accept: Wildcard is fine, but must be present.

What happens with wrong headers:

403 Forbidden: TLS fingerprint or app-id mismatch detected
400 Bad Request: Malformed headers or missing required fields
No response: Your IP was flagged and silently dropped

Header consistency requirement:
Instagram correlates requests within a session. If your User-Agent changes mid-session or headers conflict with your TLS fingerprint, you are flagged instantly.

How to Scrape Instagram Profiles

Instagram profiles contain valuable business intelligence: follower counts, bio information, business contact details, and recent posts. We will use Instagram's REST API endpoint that returns profile data as JSON.

What you can extract:

Full name, username, user ID, verification status
Bio text and external links
Follower and following counts
Profile picture URL
Business category, phone number, email (for business accounts)
First 12 posts with preview data

The approach:
We make a GET request to Instagram's profile API endpoint with the username as a parameter. The response includes complete profile data in JSON format.

ScrapFly's scraper handles:

Proper header formatting and x-ig-app-id rotation
Residential proxy rotation to avoid IP blocks
TLS fingerprint matching to bypass blocking detection
Automatic retry with exponential backoff on rate limits

Code snippet from the ScrapFly scraper:

from scrapfly import ScrapeConfig, ScrapflyClient
import json

scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")

INSTAGRAM_APP_ID = "936619743392459"
BASE_CONFIG = {
    "asp": True,  # Anti Scraping Protection bypass
    "country": "US",  # Use US residential proxies
}

async def scrape_profile(username: str):
    """Scrape Instagram profile data"""
    result = await scrapfly.async_scrape(
        ScrapeConfig(
            url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
            headers={"x-ig-app-id": INSTAGRAM_APP_ID},
            **BASE_CONFIG,
        )
    )
    data = json.loads(result.content)
    return data["data"]["user"]

# Example usage
profile = await scrape_profile("google")
print(f"Followers: {profile['edge_followed_by']['count']}")

Key implementation details:

The asp=True parameter activates ScrapFly's anti-blocking bypass (TLS fingerprinting, header rotation)
Residential proxies (country="US") prevent datacenter IP blocks
The endpoint returns up to 12 recent posts embedded in the profile response
Business accounts expose email/phone in business_email and business_phone_number fields

Full profile scraper code in our repository →

How to Scrape Instagram Posts

Post data includes captions, media URLs, engagement metrics, comments, and tagged users. Instagram uses GraphQL for post queries, requiring proper doc_id values and request formatting.

What you can extract:

Post shortcode, ID, and timestamp
Image/video URLs (full resolution)
Captions and hashtags
Like counts, view counts (for videos), play counts (for reels)
First page of comments (with pagination cursor for more)
Tagged users and location data
Related posts

The approach:
We send a POST request to Instagram's GraphQL endpoint with a payload containing the post shortcode and the correct doc_id. Instagram returns complete post data including engagement metrics and comments.

ScrapFly's scraper handles:

Current doc_id values (updated within hours when Instagram changes them)
GraphQL payload formatting and URL encoding
Comment pagination for posts with 100+ comments
Different post types: photos, videos, reels, carousels

Code snippet from the ScrapFly scraper:

from scrapfly import ScrapeConfig, ScrapflyClient
import json
from urllib.parse import quote

scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")

INSTAGRAM_POST_DOC_ID = "8845758582119845"  # Updated regularly
BASE_CONFIG = {"asp": True, "country": "US"}

async def scrape_post(url_or_shortcode: str):
    """Scrape single Instagram post data"""
    # Extract shortcode from URL or use directly
    if "http" in url_or_shortcode:
        shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
    else:
        shortcode = url_or_shortcode

    # Build GraphQL request payload
    variables = quote(json.dumps({
        'shortcode': shortcode,
        'fetch_tagged_user_count': None,
        'hoisted_comment_id': None,
        'hoisted_reply_id': None
    }, separators=(',', ':')))

    body = f"variables={variables}&doc_id={INSTAGRAM_POST_DOC_ID}"

    result = await scrapfly.async_scrape(
        ScrapeConfig(
            url="https://www.instagram.com/graphql/query",
            method="POST",
            body=body,
            headers={"content-type": "application/x-www-form-urlencoded"},
            **BASE_CONFIG,
        )
    )

    data = json.loads(result.content)
    return data["data"]["xdt_shortcode_media"]

# Example usage
post = await scrape_post("https://www.instagram.com/p/CuE2WNQs6vH/")
print(f"Likes: {post['edge_media_preview_like']['count']}")

Key implementation details:

The shortcode is the unique post identifier (e.g., CuE2WNQs6vH from URL /p/CuE2WNQs6vH/)
GraphQL requires URL-encoded JSON in the request body
The response includes nested structures for comments (edge_media_to_parent_comment)
Carousel posts have multiple images in edge_sidecar_to_children

Full post scraper code with pagination →

How to Scrape Instagram Comments

Comments provide sentiment data, user engagement patterns, and conversation threads. Instagram paginates comments, requiring multiple requests to extract full comment sections.

What you can extract:

Comment text and timestamp
Commenter username, profile, verification status
Like counts per comment
Nested replies (threaded conversations)
Pagination cursors for loading more comments

The approach:
Comments are included in the initial post data (first ~12 comments), but posts with hundreds of comments require pagination. We use the end_cursor value from page_info to load subsequent pages through additional GraphQL requests.

ScrapFly's scraper handles:

Nested pagination (comments and their replies have separate cursors)
Rate limit respect (Instagram throttles aggressive comment scraping)
Proper doc_id for comment pagination queries
Reply threading and parent-child comment relationships

Code snippet for comment pagination:

async def scrape_post_comments(shortcode: str, max_comments: int = 100):
    """Scrape comments from Instagram post with pagination"""
    comments = []
    cursor = None

    while len(comments) < max_comments:
        variables = quote(json.dumps({
            'shortcode': shortcode,
            'first': 50,  # Comments per page
            'after': cursor,  # Pagination cursor
        }, separators=(',', ':')))

        body = f"variables={variables}&doc_id={INSTAGRAM_COMMENTS_DOC_ID}"

        result = await scrapfly.async_scrape(
            ScrapeConfig(
                url="https://www.instagram.com/graphql/query",
                method="POST",
                body=body,
                headers={"content-type": "application/x-www-form-urlencoded"},
                **BASE_CONFIG,
            )
        )

        data = json.loads(result.content)
        comment_data = data["data"]["xdt_shortcode_media"]["edge_media_to_parent_comment"]

        # Extract comments from this page
        for edge in comment_data["edges"]:
            comments.append(edge["node"])

        # Check for more comments
        page_info = comment_data["page_info"]
        if not page_info["has_next_page"]:
            break

        cursor = page_info["end_cursor"]

    return comments[:max_comments]

Key implementation details:

The first parameter controls comments per page (max ~50)
Each comment includes edge_threaded_comments for nested replies
Replies have their own pagination system requiring separate requests
The scraper respects Instagram's rate limits by adding delays between pagination requests

Full comment scraper with reply handling →

How to Scrape Instagram with Proxies

Proxies are mandatory for Instagram scraping at any scale. Instagram's IP quality detection blocks datacenter IPs instantly, and rate limits force you to rotate residential IPs to maintain scraping speed.

Best Proxies for Instagram Scraping (Residential vs Datacenter)

Datacenter Proxies: Do not use

Blocked instantly by Instagram's IP quality checks
No request volume possible. Banned on first request
Cheaper per GB, but 100% failure rate makes cost irrelevant

Residential Proxies: Required

IPs from real consumer ISPs (Comcast, Verizon, AT&T, etc.)
Pass Instagram's IP quality detection
Each IP allows ~200 requests/hour before rate limiting
Geographic targeting (e.g., US-only IPs for US-focused scraping)

Mobile Proxies: Premium Option

IPs from mobile carriers (4G/5G networks)
Highest trust score. Instagram rarely blocks mobile IPs
Better rate limits (~300 requests/hour per IP)
More expensive ($60-120/month per IP vs $1-3 for residential)

Recommendation: Residential proxies are the sweet spot for Instagram scraping. Mobile proxies offer marginal improvement at 10-20x the cost. Not worth it unless you are scraping millions of profiles daily.

How to Rotate Proxies for Instagram Scraping

Proxy rotation strategies determine your scraping speed and block rate:

Sticky Sessions (Recommended):

Use the same IP for 5-10 minutes, then rotate
Mimics real user behavior (one person does not change IPs every 10 seconds)
Allows ~15-30 requests per IP before rotation
Instagram's behavioral analysis flags instant IP changes as suspicious

Request-Level Rotation (Aggressive):

New IP for every single request
Maximizes speed but looks unnatural to Instagram
Higher block rate. Use only with anti-bot bypass (like ScrapFly)
Necessary when scraping 10,000+ profiles/hour

Smart Rotation Based on Response:

Rotate immediately on 429 (rate limit) or 403 (block)
Continue using same IP while responses are 200 OK
Implements exponential backoff: 2s delay → 4s → 8s → 16s before rotating
Reduces wasted proxy bandwidth

ScrapFly's automatic proxy management:

Intelligent rotation using sticky sessions by default
Instant rotation on rate limits or blocks
Geographic pinning (keep requests in same country/region)
Proxy pool health monitoring (removes dead IPs automatically)

Instagram Proxy Costs

Residential proxies are billed per GB of bandwidth consumed. Here's what Instagram scraping costs:

Data usage per request type:

Profile scrape: ~50-100 KB per profile
Post scrape: ~30-80 KB per post
Comment scrape: ~20-50 KB per comment page

Example scraping job: 10,000 Instagram profiles

10,000 profiles × 75 KB average = 750 MB
Standard residential proxy cost: $10-15 per GB
Total cost: $7.50-11.25 in proxy bandwidth

ScrapFly's Proxy Saver feature:

Caches static content (profile images, CSS, JavaScript)
Only uses residential bandwidth for actual API calls
Reduces bandwidth consumption by 30-50%
Same 10,000 profile job: $5.25-7.88 with Proxy Saver
Savings: $2.25-3.37 per 10K profiles (30-40% reduction)

For serious Instagram scraping (100K+ profiles/month), Proxy Saver saves $50-100+ monthly in proxy costs alone.

FAQ

Are there public APIs for Instagram?

No. Instagram discontinued public API access in 2020 and now only offers limited APIs for verified business partners through the Instagram Graph API. However, Instagram's web and mobile apps communicate with internal REST and GraphQL APIs that we can access directly through reverse engineering. These "hidden" APIs provide far more data than the official API ever did.

How to get Instagram user ID from username?

Scrape the user profile using the /api/v1/users/web_profile_info/ endpoint and extract the id field:

import httpx
import json

response = httpx.get(
    f"https://i.instagram.com/api/v1/users/web_profile_info/?username=google",
    headers={"x-ig-app-id": "936619743392459"}
)
data = json.loads(response.content)
user_id = data["data"]["user"]["id"]
print(user_id)  # Output: 1067259270

How to get Instagram username from user ID?

Use Instagram's public mobile API endpoint:

import httpx

iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90"

response = httpx.get(
    iphone_api.format("1067259270"),
    headers={"User-Agent": iphone_user_agent}
)
username = response.json()['user']['username']
print(username)  # Output: google

How do I handle Instagram's rate limiting when scraping at scale?

Instagram rate limiting requires a three-part strategy:

Residential proxy rotation: Use 50-100+ residential IPs and rotate them in sticky sessions (5-10 minutes per IP). Each IP allows ~200 requests/hour.
Realistic delays: Space requests 2-5 seconds apart with random variance. Perfect timing intervals look robotic.
Exponential backoff: When you receive a 429 error, back off exponentially (wait 2s, then 4s, then 8s, etc.) before retrying.

ScrapFly handles all three automatically—you specify your desired scraping speed and we manage rate limits, retries, and proxy rotation.

Can I scrape Instagram stories or reels data?

Yes, stories and reels use dedicated GraphQL endpoints with their own doc_id values:

Stories: Ephemeral (24-hour) content requires the user's ID and a stories-specific doc_id. Stories include view counts, replies, and media URLs.

Reels: Similar to post scraping but with video-specific fields (play counts, audio attribution, video duration). Reels use doc_id 25981206651899035 (subject to change).

Both require authentication for some accounts, but public accounts expose this data without login.

How do I extract Instagram comments and engagement metrics?

Comments are nested in post data under edge_media_to_parent_comment. The initial post request returns the first ~12 comments plus a pagination cursor (end_cursor).

To extract all comments:

Scrape the post to get initial comments and end_cursor
Make paginated GraphQL requests with the cursor to load more
Extract nested replies from each comment's edge_threaded_comments

Engagement metrics are in the post data:

Likes: edge_media_preview_like.count
Comments: edge_media_to_parent_comment.count
Views (videos): video_view_count
Plays (reels): video_play_count

What are the most common Instagram scraping challenges?

1. Doc_id changes: Instagram updates GraphQL doc_id values every 2-4 weeks, breaking scrapers. Solution: Monitor our open-source scraper for updates.

2. IP blocks: Datacenter IPs are banned instantly. Solution: Use residential proxies with rotation.

3. TLS fingerprinting: Python libraries have detectable signatures. Solution: Use anti-bot bypass tools like ScrapFly that rotate fingerprints.

4. Rate limits: 200 requests/hour per IP. Solution: Rotate through 50+ residential IPs with sticky sessions.

5. Behavioral detection: Unnatural request patterns get flagged. Solution: Add random delays, mimic real browsing sequences.

Why does my Python Instagram scraper get 403 errors immediately?

This is TLS fingerprinting blocking. Python's requests and httpx libraries have unique TLS handshake signatures that Instagram detects as bots within the first request.

Solutions:

Use browser automation (Selenium, Playwright) which has real browser fingerprints but it's 10x slower
Use curl_cffi library which mimics Chrome's TLS fingerprint
Use ScrapFly which rotates TLS fingerprints automatically

Do not waste time trying to fix headers. The TLS handshake happens before HTTP headers are even sent. You need a tool that controls the TLS layer.

Latest Instagram Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Summary

Instagram scraping in 2025 requires navigating complex blocking systems: IP quality detection, TLS fingerprinting, rate limiting, and behavioral analysis. Building a scraper from scratch means constant maintenance as Instagram updates doc_ids every 2-4 weeks and evolves their blocking systems weekly.

We covered Instagram's multi-layered blocking system and why manual scrapers fail within hours, how to access hidden REST and GraphQL APIs for profiles, posts, and comments, why residential proxies are mandatory (datacenter IPs blocked instantly), how doc_id parameters work and change every 2-4 weeks, and why Python libraries get blocked immediately due to TLS fingerprinting.

The smart approach: Start with ScrapFly's working Instagram scraper that includes anti-blocking bypass, residential proxies, and automatic updates when Instagram changes. This saves you hundreds of hours in maintenance and debugging.

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping and for more you should consult a lawyer.

How to Scrape Instagram in 2025

Explore this Article with AI

Key Takeaways

Latest Instagram Scraper Code

What Instagram Data Can You Scrape?

How Instagram Blocks Scrapers (Anti-Bot Detection Explained)

Rate Limiting & IP Blocking

IP Quality Detection

Browser Fingerprinting

Request Pattern Detection

How to Scrape Instagram with ScrapFly (The Easy Way)

What You Get

How ScrapFly Bypasses Every Defense

How Instagram's Scraping API Works

Finding Instagram's Hidden Endpoints

Understanding doc_id

Required Headers

How to Scrape Instagram Profiles

How to Scrape Instagram Posts

How to Scrape Instagram Comments

How to Scrape Instagram with Proxies

Best Proxies for Instagram Scraping (Residential vs Datacenter)

How to Rotate Proxies for Instagram Scraping

Instagram Proxy Costs

FAQ

Are there public APIs for Instagram?

How to get Instagram user ID from username?

How to get Instagram username from user ID?

How do I handle Instagram's rate limiting when scraping at scale?

Can I scrape Instagram stories or reels data?

How do I extract Instagram comments and engagement metrics?

What are the most common Instagram scraping challenges?

Why does my Python Instagram scraper get 403 errors immediately?

Summary

Explore this Article with AI

Related Knowledgebase

Python httpx vs requests vs aiohttp - key differences

What Python libraries support HTTP2?

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to select dictionary key recursively in Python?

How to use cURL in Python?

How to fix Python requests SSLError?

How to fix Python requests TooManyRedirects error?

How to fix python requests ConnectTimeout error?

Selenium: chromedriver executable needs to be in PATH?

Related Articles

How to Scrape Facebook: Marketplace and Events

Social Media Scraping in 2025

How to Scrape Naver.com

How to Scrape Imovelweb.com

How to Scrape AutoScout24

How to Scrape Allegro.pl