What if you could extract insights from millions of Instagram profiles, posts, and comments to understand market trends, analyze competitors, or generate leads? Instagram holds a goldmine of public data, but accessing it programmatically is deliberately difficult. Instagram employs sophisticated anti-bot defensesâTLS fingerprinting, IP quality detection, and behavioral analysisâdesigned to stop scrapers in their tracks. In this guide, we'll reveal exactly how Instagram blocks scrapers and why building a manual solution is a losing battle. More importantly, we'll show you the production-ready approach that bypasses every defense: ScrapFly's maintained Instagram scraper with built-in anti-bot infrastructure. You'll learn what data you can extract, how Instagram's hidden APIs work, and why the smart approach is starting with battle-tested code rather than building from scratch.
Key Takeaways
Master Instagram scraping in 2025 with production-ready solutions that bypass anti-bot defenses, access hidden GraphQL APIs, and scale reliably for business intelligence.
- Understand Instagram's multi-layered anti-bot system: IP quality checks, TLS fingerprinting, rate limits, and behavioral detection
- Access Instagram's hidden REST and GraphQL APIs to extract profiles, posts, comments, and engagement metrics
- Use ScrapFly's open-source Instagram scraper with built-in anti-blocking to start scraping in 5 minutes
- Implement proper proxy rotation with residential IPs to avoid instant datacenter IP blocks
- Monitor and update doc_id parameters that Instagram changes every 2-4 weeks to break scrapers
- Extract business intelligence: competitor analysis, sentiment tracking, influencer metrics, and lead generation data
Latest Instagram Scraper Code
What Instagram Data Can You Scrape?
Instagram's public data offers powerful business intelligence when extracted systematically. Here's what you can scrape and why it matters:
Profiles - Extract bio, follower/following counts, verification status, business contact info, and post statistics. Use case: Build targeted lead lists by scraping verified business profiles in specific niches, then reach out using their public email addresses.
Posts - Capture captions, images, videos, likes, view counts, timestamps, location tags, and tagged users. Use case: Analyze your competitor's top-performing content to understand what resonates with your shared audience and replicate successful formats.
Reels - Access video URLs, play counts, music attribution, duration, and engagement metrics. Use case: Track trending audio clips and formats in your industry to inform your own content strategy before trends peak.
Comments - Scrape comment text, nested replies, timestamps, author profiles, and like counts. Use case: Perform sentiment analysis on competitor posts to identify customer pain points and service gaps you can address.
Hashtags - Aggregate posts by hashtag, trending scores, and usage patterns. Use case: Discover emerging micro-influencers by scraping posts from industry hashtags and ranking by engagement rate rather than follower count.
But getting this data is a two-part challenge: finding the right API endpoints AND not getting blocked by Instagram's anti-bot defenses. Let's tackle both.
How Instagram Blocks Scrapers (Anti-Bot Detection Explained)
Instagram employs a sophisticated, multi-layered anti-bot system designed to identify and block automated scraping. Understanding these defenses reveals why manual scraping solutions fail and require constant maintenance.
Rate Limiting & IP Blocking
Instagram enforces strict request quotas to prevent aggressive scraping:
- Request limits: ~200 requests per hour per IP address for non-authenticated users
- Throttling response: After exceeding limits, you'll receive HTTP 429 "Too Many Requests" errors
- Block duration: Your IP gets temporarily rate-limited for hours or days depending on violation severity
- Progressive penalties: Repeated violations lead to longer blocks and eventually permanent IP bans
Even if you perfectly implement delays and respect rate limits, you're still limited to scraping ~4,800 profiles per day per IPâinsufficient for any serious data collection.
IP Quality Detection
Instagram analyzes your IP address quality before even processing your request:
- Datacenter IPs blocked instantly: Requests from AWS, DigitalOcean, Google Cloud, and other hosting providers are flagged immediately
- Residential IPs required: Instagram expects requests from genuine consumer ISPs (Comcast, AT&T, etc.)
- ASN reputation checking: Instagram maintains blocklists of ASNs (Autonomous System Numbers) associated with proxies and VPNs
- This runs BEFORE rate limits: A datacenter IP gets blocked on the first request, regardless of how slowly you scrape
This is why you can't just deploy your scraper to a cloud server and expect it to workâInstagram blocks it before you even hit the rate limit.
Browser Fingerprinting
Instagram analyzes dozens of browser characteristics to detect automation tools:
- TLS/SSL fingerprinting: Python's
requests
library has a unique TLS handshake signature that Instagram flags as a bot instantly - HTTP/2 fingerprinting: The order and format of HTTP/2 frames reveals whether you're using a real browser or a scripting library
- Header order consistency: Real browsers send headers in a specific order; scrapers often randomize or alphabetize them
- Canvas/WebGL fingerprinting: When JavaScript is enabled, Instagram tests how your browser renders graphicsâautomation frameworks produce consistent, detectable signatures
Even if you copy all the correct headers from a real browser, the TLS handshake alone will expose you as a bot within seconds.
Request Pattern Detection
Instagram's behavioral analysis identifies non-human usage patterns:
- Timing patterns: Perfect 3-second delays between requests look robotic; humans vary their timing
- Request sequencing: Real users navigate naturally (view profile â scroll â click post); bots often access API endpoints directly without realistic browsing
- Session validation: Instagram expects correlated requests (CSS, images, analytics) alongside your API calls; scraping just the data endpoints is suspicious
- Cookie behavior: Missing, malformed, or inconsistent cookies signal automation
Instagram's machine learning models are trained on millions of real user sessionsâany deviation from natural human behavior raises red flags.
The Bottom Line: You can build a perfect scraper, but without professional anti-detection infrastructure, it'll get blocked within hours. Instagram updates these defenses weekly, meaning even working scrapers break constantly.
How to Scrape Instagram with ScrapFly (The Easy Way)
ScrapFly provides the complete Instagram scraping solution: production-ready scraper code + anti-blocking infrastructure. Clone the repository, configure your API key, and start scraping in 5 minutes.
What You Get
- Production-ready scraper code: Open source, actively maintained, updated within hours when Instagram changes
- Built-in anti-bot infrastructure: TLS fingerprinting, header rotation, and behavioral mimicry handled automatically
- Residential proxy network included: 50M+ IPs from real consumer ISPsâno separate proxy bills or configuration
- Automatic updates: When Instagram changes doc_ids or endpoints, we update the scraper immediately
- Cost optimization: Proxy Saver feature reduces residential proxy costs by 30-50% through intelligent caching
# Clone the scraper repository
git clone https://github.com/scrapfly/scrapfly-scrapers.git
cd instagram-scraper
# Configure your ScrapFly API key
export SCRAPFLY_KEY="your_key_here"
# Install dependencies
poetry install
# Start scraping
poetry run python run.py
How ScrapFly Bypasses Every Defense
Anti-Bot Bypass: ScrapFly rotates TLS fingerprints to match real Chrome/Firefox browsers, orders HTTP headers correctly, and mimics genuine browser behavior. Instagram sees legitimate browser traffic, not a scraper.
Proxy Management: Our network of 50M+ residential IPs automatically rotates with each request. Instagram sees requests from real consumer devices across different ISPs and locationsâexactly like genuine users.
Rate Limit Handling: Smart throttling and exponential backoff automatically slow down when Instagram pushes back. The scraper adjusts its speed dynamically to stay under the radar.
Proxy Saver: Reduces residential proxy costs by 30-50% by intelligently caching static content and only using premium residential IPs for the actual API calls. For a 10,000 profile scraping job, this saves $15-30 in proxy costs.
How Instagram's Scraping API Works
Instagram doesn't provide official public APIs, but their web and mobile apps communicate with backend APIs we can access directly. Instagram uses two API architectures:
- REST API: Simple endpoints for basic data (e.g.,
/api/v1/users/web_profile_info/
for profiles) - GraphQL API: Complex query system for posts, comments, and paginated data
Instagram uses REST APIs for straightforward requests where the data structure is simple, and GraphQL for complex queries involving nested data, filtering, or pagination.
Finding Instagram's Hidden Endpoints
When Instagram updates their platform, endpoints change. Here's how to discover current endpoints when they break:
Step 1: Open Instagram in Chrome/Firefox and open DevTools (F12)
Step 2: Go to Network tab and filter by "Fetch/XHR" to see API calls
Step 3: Navigate Instagram normally (visit a profile, view a post, scroll comments)
Step 4: Watch for API requests to domains like:
i.instagram.com/api/v1/
(REST endpoints)www.instagram.com/graphql/query
(GraphQL endpoints)
Step 5: Click on an API request to inspect:
- Request headers (especially
x-ig-app-id
) - Request payload (for GraphQL, look for
variables
anddoc_id
) - Response structure (to understand data format)
REST Example: When viewing a profile, you'll see a request to:
https://i.instagram.com/api/v1/users/web_profile_info/?username=google
GraphQL Example: When viewing a post, you'll see a POST request to:
https://www.instagram.com/graphql/query
With a payload containing doc_id
and variables
parameters.
Understanding doc_id
The doc_id
parameter is critical for GraphQL scraping but poorly understood. Here's what you need to know:
What is doc_id?
- Instagram's internal identifier for specific GraphQL query structures
- Maps to a predefined query on Instagram's backend (you can't define custom queries)
- Example:
doc_id=8845758582119845
retrieves post details
Why doc_ids exist:
- Performance: Pre-defined queries are optimized and cached on Instagram's servers
- Security: Prevents custom queries that could overload the database
- Anti-scraping: Changing doc_ids regularly breaks scrapers
Why doc_ids change:
- Instagram updates their GraphQL schema every 2-4 weeks
- Changes are a deliberate anti-scraping measure
- No public documentation of current valuesâyou must discover them yourself
How to find current doc_ids:
- Open DevTools â Network tab â filter for "graphql"
- Trigger the action on Instagram (view post, load comments, etc.)
- Inspect the request payload for
doc_id=
parameter - Note the numeric value (e.g.,
8845758582119845
)
Different operations require different doc_ids:
- Profile posts:
9310670392322965
- Post details:
8845758582119845
- Comments pagination: (changes frequently)
- User search: (changes frequently)
The DIY Pain: You must monitor doc_ids manually and update your scraper every time Instagram changes them (every 2-4 weeks). Miss an update and your scraper breaks silently.
ScrapFly Solution: Our open-source repository is updated within hours of Instagram changes. You pull the latest code and keep scrapingâno detective work required.
Required Headers
Headers aren't just formalitiesâInstagram validates them strictly. Here's what you need and why:
Critical Headers:
{
"x-ig-app-id": "936619743392459", # Instagram web app identifier
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
Why each header matters:
- x-ig-app-id: Identifies your request as coming from Instagram's web app (not mobile app or unauthorized client). Wrong value = instant 403 error.
- User-Agent: Must match a real browser signature. Python's default User-Agent screams "bot" and gets blocked immediately.
- Accept-Language: Instagram tracks inconsistent language preferences across requestsâkeep it stable per session.
- Accept-Encoding: Real browsers always accept compression; omitting this is suspicious.
- Accept: Wildcard is fine, but must be present.
What happens with wrong headers:
- 403 Forbidden: TLS fingerprint or app-id mismatch detected
- 400 Bad Request: Malformed headers or missing required fields
- No response: Your IP was flagged and silently dropped
Header consistency requirement:
Instagram correlates requests within a session. If your User-Agent changes mid-session or headers conflict with your TLS fingerprint, you're flagged instantly.
How to Scrape Instagram Profiles
Instagram profiles contain valuable business intelligence: follower counts, bio information, business contact details, and recent posts. We'll use Instagram's REST API endpoint that returns profile data as JSON.
What you can extract:
- Full name, username, user ID, verification status
- Bio text and external links
- Follower and following counts
- Profile picture URL
- Business category, phone number, email (for business accounts)
- First 12 posts with preview data
The approach:
We make a GET request to Instagram's profile API endpoint with the username as a parameter. The response includes comprehensive profile data in JSON format.
ScrapFly's scraper handles:
- Proper header formatting and x-ig-app-id rotation
- Residential proxy rotation to avoid IP blocks
- TLS fingerprint matching to bypass bot detection
- Automatic retry with exponential backoff on rate limits
Code snippet from the ScrapFly scraper:
from scrapfly import ScrapeConfig, ScrapflyClient
import json
scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")
INSTAGRAM_APP_ID = "936619743392459"
BASE_CONFIG = {
"asp": True, # Anti Scraping Protection bypass
"country": "US", # Use US residential proxies
}
async def scrape_profile(username: str):
"""Scrape Instagram profile data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
headers={"x-ig-app-id": INSTAGRAM_APP_ID},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
return data["data"]["user"]
# Example usage
profile = await scrape_profile("google")
print(f"Followers: {profile['edge_followed_by']['count']}")
Key implementation details:
- The
asp=True
parameter activates ScrapFly's anti-bot bypass (TLS fingerprinting, header rotation) - Residential proxies (
country="US"
) prevent datacenter IP blocks - The endpoint returns up to 12 recent posts embedded in the profile response
- Business accounts expose email/phone in
business_email
andbusiness_phone_number
fields
Full profile scraper code in our repository â
How to Scrape Instagram Posts
Post data includes captions, media URLs, engagement metrics, comments, and tagged users. Instagram uses GraphQL for post queries, requiring proper doc_id values and request formatting.
What you can extract:
- Post shortcode, ID, and timestamp
- Image/video URLs (full resolution)
- Captions and hashtags
- Like counts, view counts (for videos), play counts (for reels)
- First page of comments (with pagination cursor for more)
- Tagged users and location data
- Related posts
The approach:
We send a POST request to Instagram's GraphQL endpoint with a payload containing the post shortcode and the correct doc_id. Instagram returns comprehensive post data including engagement metrics and comments.
ScrapFly's scraper handles:
- Current doc_id values (updated within hours when Instagram changes them)
- GraphQL payload formatting and URL encoding
- Comment pagination for posts with 100+ comments
- Different post types: photos, videos, reels, carousels
Code snippet from the ScrapFly scraper:
from scrapfly import ScrapeConfig, ScrapflyClient
import json
from urllib.parse import quote
scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")
INSTAGRAM_POST_DOC_ID = "8845758582119845" # Updated regularly
BASE_CONFIG = {"asp": True, "country": "US"}
async def scrape_post(url_or_shortcode: str):
"""Scrape single Instagram post data"""
# Extract shortcode from URL or use directly
if "http" in url_or_shortcode:
shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
else:
shortcode = url_or_shortcode
# Build GraphQL request payload
variables = quote(json.dumps({
'shortcode': shortcode,
'fetch_tagged_user_count': None,
'hoisted_comment_id': None,
'hoisted_reply_id': None
}, separators=(',', ':')))
body = f"variables={variables}&doc_id={INSTAGRAM_POST_DOC_ID}"
result = await scrapfly.async_scrape(
ScrapeConfig(
url="https://www.instagram.com/graphql/query",
method="POST",
body=body,
headers={"content-type": "application/x-www-form-urlencoded"},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
return data["data"]["xdt_shortcode_media"]
# Example usage
post = await scrape_post("https://www.instagram.com/p/CuE2WNQs6vH/")
print(f"Likes: {post['edge_media_preview_like']['count']}")
Key implementation details:
- The shortcode is the unique post identifier (e.g.,
CuE2WNQs6vH
from URL/p/CuE2WNQs6vH/
) - GraphQL requires URL-encoded JSON in the request body
- The response includes nested structures for comments (
edge_media_to_parent_comment
) - Carousel posts have multiple images in
edge_sidecar_to_children
Full post scraper code with pagination â
How to Scrape Instagram Comments
Comments provide sentiment data, user engagement patterns, and conversation threads. Instagram paginates comments, requiring multiple requests to extract full comment sections.
What you can extract:
- Comment text and timestamp
- Commenter username, profile, verification status
- Like counts per comment
- Nested replies (threaded conversations)
- Pagination cursors for loading more comments
The approach:
Comments are included in the initial post data (first ~12 comments), but posts with hundreds of comments require pagination. We use the end_cursor
value from page_info
to load subsequent pages through additional GraphQL requests.
ScrapFly's scraper handles:
- Nested pagination (comments and their replies have separate cursors)
- Rate limit respect (Instagram throttles aggressive comment scraping)
- Proper doc_id for comment pagination queries
- Reply threading and parent-child comment relationships
Code snippet for comment pagination:
async def scrape_post_comments(shortcode: str, max_comments: int = 100):
"""Scrape comments from Instagram post with pagination"""
comments = []
cursor = None
while len(comments) < max_comments:
variables = quote(json.dumps({
'shortcode': shortcode,
'first': 50, # Comments per page
'after': cursor, # Pagination cursor
}, separators=(',', ':')))
body = f"variables={variables}&doc_id={INSTAGRAM_COMMENTS_DOC_ID}"
result = await scrapfly.async_scrape(
ScrapeConfig(
url="https://www.instagram.com/graphql/query",
method="POST",
body=body,
headers={"content-type": "application/x-www-form-urlencoded"},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
comment_data = data["data"]["xdt_shortcode_media"]["edge_media_to_parent_comment"]
# Extract comments from this page
for edge in comment_data["edges"]:
comments.append(edge["node"])
# Check for more comments
page_info = comment_data["page_info"]
if not page_info["has_next_page"]:
break
cursor = page_info["end_cursor"]
return comments[:max_comments]
Key implementation details:
- The
first
parameter controls comments per page (max ~50) - Each comment includes
edge_threaded_comments
for nested replies - Replies have their own pagination system requiring separate requests
- The scraper respects Instagram's rate limits by adding delays between pagination requests
Full comment scraper with reply handling â
How to Scrape Instagram with Proxies
Proxies are mandatory for Instagram scraping at any scale. Instagram's IP quality detection blocks datacenter IPs instantly, and rate limits force you to rotate residential IPs to maintain scraping speed.
Best Proxies for Instagram Scraping (Residential vs Datacenter)
Datacenter Proxies: Don't Bother
- â Blocked instantly by Instagram's IP quality checks
- â No request volume possibleâbanned on first request
- â Cheaper per GB, but 100% failure rate makes cost irrelevant
Residential Proxies: Required
- â IPs from real consumer ISPs (Comcast, Verizon, AT&T, etc.)
- â Pass Instagram's IP quality detection
- â Each IP allows ~200 requests/hour before rate limiting
- â Geographic targeting (e.g., US-only IPs for US-focused scraping)
Mobile Proxies: Premium Option
- â IPs from mobile carriers (4G/5G networks)
- â Highest trust scoreâInstagram rarely blocks mobile IPs
- â Better rate limits (~300 requests/hour per IP)
- â More expensive ($60-120/month per IP vs $1-3 for residential)
Recommendation: Residential proxies are the sweet spot for Instagram scraping. Mobile proxies offer marginal improvement at 10-20x the costânot worth it unless you're scraping millions of profiles daily.
How to Rotate Proxies for Instagram Scraping
Proxy rotation strategies determine your scraping speed and block rate:
Sticky Sessions (Recommended):
- Use the same IP for 5-10 minutes, then rotate
- Mimics real user behavior (one person doesn't change IPs every 10 seconds)
- Allows ~15-30 requests per IP before rotation
- Instagram's behavioral analysis flags instant IP changes as suspicious
Request-Level Rotation (Aggressive):
- New IP for every single request
- Maximizes speed but looks unnatural to Instagram
- Higher block rateâuse only with anti-bot bypass (like ScrapFly)
- Necessary when scraping 10,000+ profiles/hour
Smart Rotation Based on Response:
- Rotate immediately on 429 (rate limit) or 403 (block)
- Continue using same IP while responses are 200 OK
- Implements exponential backoff: 2s delay â 4s â 8s â 16s before rotating
- Reduces wasted proxy bandwidth
ScrapFly's automatic proxy management:
- Intelligent rotation using sticky sessions by default
- Instant rotation on rate limits or blocks
- Geographic pinning (keep requests in same country/region)
- Proxy pool health monitoring (removes dead IPs automatically)
Instagram Proxy Costs
Residential proxies are billed per GB of bandwidth consumed. Here's what Instagram scraping costs:
Data usage per request type:
- Profile scrape: ~50-100 KB per profile
- Post scrape: ~30-80 KB per post
- Comment scrape: ~20-50 KB per comment page
Example scraping job: 10,000 Instagram profiles
- 10,000 profiles Ă 75 KB average = 750 MB
- Standard residential proxy cost: $10-15 per GB
- Total cost: $7.50-11.25 in proxy bandwidth
ScrapFly's Proxy Saver feature:
- Caches static content (profile images, CSS, JavaScript)
- Only uses residential bandwidth for actual API calls
- Reduces bandwidth consumption by 30-50%
- Same 10,000 profile job: $5.25-7.88 with Proxy Saver
- Savings: $2.25-3.37 per 10K profiles (30-40% reduction)
For serious Instagram scraping (100K+ profiles/month), Proxy Saver saves $50-100+ monthly in proxy costs alone.
FAQ
Are there public APIs for Instagram?
No. Instagram discontinued public API access in 2020 and now only offers limited APIs for verified business partners through the Instagram Graph API. However, Instagram's web and mobile apps communicate with internal REST and GraphQL APIs that we can access directly through reverse engineering. These "hidden" APIs provide far more data than the official API ever did.
Is web scraping Instagram legal?
Yes, scraping publicly accessible Instagram data is generally legal under current precedent (hiQ Labs v. LinkedIn, 2022). Courts have ruled that accessing public data doesn't violate the Computer Fraud and Abuse Act. However, you must:
- Only scrape public data (no accessing private accounts)
- Respect rate limits and don't overload Instagram's servers
- Comply with GDPR, CCPA, and other data privacy laws when handling personal information
- Not violate Instagram's Terms of Service (which prohibit scraping, but isn't typically illegal)
For more details, see our web scraping legality guide.
How to get Instagram user ID from username?
Scrape the user profile using the /api/v1/users/web_profile_info/
endpoint and extract the id
field:
import httpx
import json
response = httpx.get(
f"https://i.instagram.com/api/v1/users/web_profile_info/?username=google",
headers={"x-ig-app-id": "936619743392459"}
)
data = json.loads(response.content)
user_id = data["data"]["user"]["id"]
print(user_id) # Output: 1067259270
How to get Instagram username from user ID?
Use Instagram's public mobile API endpoint:
import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90"
response = httpx.get(
iphone_api.format("1067259270"),
headers={"User-Agent": iphone_user_agent}
)
username = response.json()['user']['username']
print(username) # Output: google
How do I handle Instagram's rate limiting when scraping at scale?
Instagram rate limiting requires a three-part strategy:
Residential proxy rotation: Use 50-100+ residential IPs and rotate them in sticky sessions (5-10 minutes per IP). Each IP allows ~200 requests/hour.
Realistic delays: Space requests 2-5 seconds apart with random variance. Perfect timing intervals look robotic.
Exponential backoff: When you receive a 429 error, back off exponentially (wait 2s, then 4s, then 8s, etc.) before retrying.
ScrapFly handles all three automaticallyâyou specify your desired scraping speed and we manage rate limits, retries, and proxy rotation.
What's the difference between scraping Instagram profiles vs posts?
Profiles use a simple REST API endpoint (/api/v1/users/web_profile_info/
) that returns JSON with a single GET request. The response includes profile data plus the first 12 posts.
Posts use Instagram's GraphQL API (/graphql/query
) requiring POST requests with specific doc_id values and URL-encoded JSON payloads. The response structure is more complex with nested data for comments, likes, and tagged users.
Pagination: Profile data is single-page, while post comment scraping requires pagination through multiple requests using cursor values.
Can I scrape Instagram stories or reels data?
Yes, stories and reels use dedicated GraphQL endpoints with their own doc_id values:
Stories: Ephemeral (24-hour) content requires the user's ID and a stories-specific doc_id. Stories include view counts, replies, and media URLs.
Reels: Similar to post scraping but with video-specific fields (play counts, audio attribution, video duration). Reels use doc_id 25981206651899035
(subject to change).
Both require authentication for some accounts, but public accounts expose this data without login.
How do I extract Instagram comments and engagement metrics?
Comments are nested in post data under edge_media_to_parent_comment
. The initial post request returns the first ~12 comments plus a pagination cursor (end_cursor
).
To extract all comments:
- Scrape the post to get initial comments and
end_cursor
- Make paginated GraphQL requests with the cursor to load more
- Extract nested replies from each comment's
edge_threaded_comments
Engagement metrics are in the post data:
- Likes:
edge_media_preview_like.count
- Comments:
edge_media_to_parent_comment.count
- Views (videos):
video_view_count
- Plays (reels):
video_play_count
What are the most common Instagram scraping challenges?
1. Doc_id changes: Instagram updates GraphQL doc_id values every 2-4 weeks, breaking scrapers. Solution: Monitor our open-source scraper for updates.
2. IP blocks: Datacenter IPs are banned instantly. Solution: Use residential proxies with rotation.
3. TLS fingerprinting: Python libraries have detectable signatures. Solution: Use anti-bot bypass tools like ScrapFly that rotate fingerprints.
4. Rate limits: 200 requests/hour per IP. Solution: Rotate through 50+ residential IPs with sticky sessions.
5. Behavioral detection: Unnatural request patterns get flagged. Solution: Add random delays, mimic real browsing sequences.
Why does my Python Instagram scraper get 403 errors immediately?
This is TLS fingerprinting blocking. Python's requests
and httpx
libraries have unique TLS handshake signatures that Instagram detects as bots within the first request.
Solutions:
- Use browser automation (Selenium, Playwright) which has real browser fingerprints but it's 10x slower
- Use curl_cffi library which mimics Chrome's TLS fingerprint
- Use ScrapFly which rotates TLS fingerprints automatically
Don't waste time trying to fix headersâthe TLS handshake happens before HTTP headers are even sent. You need a tool that controls the TLS layer.
Summary
Instagram scraping in 2025 requires navigating sophisticated anti-bot defenses: IP quality detection, TLS fingerprinting, rate limiting, and behavioral analysis. Building a scraper from scratch means constant maintenance as Instagram updates doc_ids every 2-4 weeks and evolves their blocking systems weekly.
We covered Instagram's multi-layered anti-bot system and why manual scrapers fail within hours, how to access hidden REST and GraphQL APIs for profiles, posts, and comments, why residential proxies are mandatory (datacenter IPs blocked instantly), how doc_id parameters work and change every 2-4 weeks, and why Python libraries get blocked immediately due to TLS fingerprinting.
The smart approach: Start with ScrapFly's production-ready Instagram scraper that includes anti-bot bypass, residential proxies, and automatic updates when Instagram changesâsaving you hundreds of hours in maintenance and debugging.
Legal Disclaimer and Precautions
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens who are protected by GDPR.
- Do not repurpose the entire public datasets which can be illegal in some countries.