Instagram holds valuable data for businesses. You can extract competitor insights, customer sentiment, and market trends from profiles, posts, and comments. However, Instagram makes it hard to scrape their data. They use multiple layers of protection that block most scrapers within hours. In this guide, you'll learn how Instagram blocks scrapers, what data you can extract, and why building your own solution usually fails. We'll also show you a better approach using ScrapFly's maintained Instagram scraper that handles all the blocking challenges automatically.
Key Takeaways
Learn Instagram scraping in 2025 with working solutions that bypass blocking systems, access hidden GraphQL APIs, and scale for business intelligence.
- Understand Instagram's blocking system: IP quality checks, TLS fingerprinting, rate limits, and behavioral detection
- Access Instagram's hidden REST and GraphQL APIs to extract profiles, posts, comments, and engagement metrics
- Use ScrapFly's open-source Instagram scraper with built-in anti-blocking to start scraping in 5 minutes
- Set up proxy rotation with residential IPs to avoid datacenter IP blocks
- Monitor and update doc_id parameters that Instagram changes every 2-4 weeks
- Extract business intelligence: competitor analysis, sentiment tracking, influencer metrics, and lead generation data
Latest Instagram Scraper Code
What Instagram Data Can You Scrape?
Instagram's public data provides business intelligence when extracted systematically. Here's what you can scrape and why it matters:
Profiles - Extract bio, follower/following counts, verification status, business contact info, and post statistics. Use case: Build lead lists by scraping verified business profiles in specific niches, then reach out using their public email addresses.
Posts - Capture captions, images, videos, likes, view counts, timestamps, location tags, and tagged users. Use case: Analyze your competitor's top-performing content to understand what resonates with your shared audience and replicate successful formats.
Reels - Access video URLs, play counts, music attribution, duration, and engagement metrics. Use case: Track trending audio clips and formats in your industry to inform your own content strategy before trends peak.
Comments - Scrape comment text, nested replies, timestamps, author profiles, and like counts. Use case: Perform sentiment analysis on competitor posts to identify customer pain points and service gaps you can address.
Hashtags - Aggregate posts by hashtag, trending scores, and usage patterns. Use case: Discover emerging micro-influencers by scraping posts from industry hashtags and ranking by engagement rate rather than follower count.
Getting this data requires two things: finding the right API endpoints and not getting blocked by Instagram's blocking systems. Let's cover both.
How Instagram Blocks Scrapers (Anti-Bot Detection Explained)
Instagram uses a multi-layered blocking system designed to identify and block automated scraping. Understanding these systems shows why manual scraping solutions fail and require constant maintenance.
Rate Limiting & IP Blocking
Instagram enforces strict request quotas to prevent aggressive scraping:
- Request limits: ~200 requests per hour per IP address for non-authenticated users
- Throttling response: After exceeding limits, you receive HTTP 429 "Too Many Requests" errors
- Block duration: Your IP gets temporarily rate-limited for hours or days depending on violation severity
- Progressive penalties: Repeated violations lead to longer blocks and eventually permanent IP bans
Even if you implement delays and respect rate limits, you're still limited to scraping ~4,800 profiles per day per IP. This is insufficient for serious data collection.
IP Quality Detection
Instagram analyzes your IP address quality before even processing your request:
- Datacenter IPs blocked instantly: Requests from AWS, DigitalOcean, Google Cloud, and other hosting providers are flagged immediately
- Residential IPs required: Instagram expects requests from genuine consumer ISPs (Comcast, AT&T, etc.)
- ASN reputation checking: Instagram maintains blocklists of ASNs (Autonomous System Numbers) associated with proxies and VPNs
- This runs BEFORE rate limits: A datacenter IP gets blocked on the first request, regardless of how slowly you scrape
This is why you cannot deploy your scraper to a cloud server and expect it to work. Instagram blocks it before you even hit the rate limit.
Browser Fingerprinting
Instagram analyzes dozens of browser characteristics to detect automation tools:
- TLS/SSL fingerprinting: Python's
requestslibrary has a unique TLS handshake signature that Instagram flags as a bot instantly - HTTP/2 fingerprinting: The order and format of HTTP/2 frames reveals whether you're using a real browser or a scripting library
- Header order consistency: Real browsers send headers in a specific order; scrapers often randomize or alphabetize them
- Canvas/WebGL fingerprinting: When JavaScript is enabled, Instagram tests how your browser renders graphics. Automation frameworks produce consistent, detectable signatures
Even if you copy all the correct headers from a real browser, the TLS handshake alone will expose you as a bot within seconds.
Request Pattern Detection
Instagram's behavioral analysis identifies non-human usage patterns:
- Timing patterns: Perfect 3-second delays between requests look robotic; humans vary their timing
- Request sequencing: Real users navigate naturally (view profile → scroll → click post); bots often access API endpoints directly without realistic browsing
- Session validation: Instagram expects correlated requests (CSS, images, analytics) alongside your API calls; scraping just the data endpoints is suspicious
- Cookie behavior: Missing, malformed, or inconsistent cookies signal automation
Instagram's machine learning models are trained on millions of real user sessions. Any deviation from natural human behavior raises red flags.
The Bottom Line: You can build a perfect scraper, but without professional anti-detection infrastructure, it will get blocked within hours. Instagram updates these blocking systems weekly, meaning even working scrapers break constantly.
How to Scrape Instagram with ScrapFly (The Easy Way)
ScrapFly provides the complete Instagram scraping solution: working scraper code + anti-blocking infrastructure. Clone the repository, configure your API key, and start scraping in 5 minutes.
What You Get
- Working scraper code: Open source, actively maintained, updated within hours when Instagram changes
- Built-in anti-blocking infrastructure: TLS fingerprinting, header rotation, and behavioral mimicry handled automatically
- Residential proxy network included: 50M+ IPs from real consumer ISPs with no separate proxy bills or configuration
- Automatic updates: When Instagram changes doc_ids or endpoints, we update the scraper immediately
- Cost optimization: Proxy Saver feature reduces residential proxy costs by 30-50% through intelligent caching
# Clone the scraper repository
git clone https://github.com/scrapfly/scrapfly-scrapers.git
cd instagram-scraper
# Configure your ScrapFly API key
export SCRAPFLY_KEY="your_key_here"
# Install dependencies
poetry install
# Start scraping
poetry run python run.py
How ScrapFly Bypasses Every Defense
Anti-Blocking Bypass: ScrapFly rotates TLS fingerprints to match real Chrome/Firefox browsers, orders HTTP headers correctly, and mimics genuine browser behavior. Instagram sees legitimate browser traffic, not a scraper.
Proxy Management: Our network of 50M+ residential IPs automatically rotates with each request. Instagram sees requests from real consumer devices across different ISPs and locations, exactly like genuine users.
Rate Limit Handling: Smart throttling and exponential backoff automatically slow down when Instagram pushes back. The scraper adjusts its speed dynamically to stay under the radar.
Proxy Saver: Reduces residential proxy costs by 30-50% by intelligently caching static content and only using premium residential IPs for the actual API calls. For a 10,000 profile scraping job, this saves $15-30 in proxy costs.
How Instagram's Scraping API Works
Instagram does not provide official public APIs, but their web and mobile apps communicate with backend APIs we can access directly. Instagram uses two API architectures:
- REST API: Simple endpoints for basic data (e.g.,
/api/v1/users/web_profile_info/for profiles) - GraphQL API: Complex query system for posts, comments, and paginated data
Instagram uses REST APIs for straightforward requests where the data structure is simple, and GraphQL for complex queries involving nested data, filtering, or pagination.
Finding Instagram's Hidden Endpoints
When Instagram updates their platform, endpoints change. Here's how to discover current endpoints when they break:
Step 1: Open Instagram in Chrome/Firefox and open DevTools (F12)
Step 2: Go to Network tab and filter by "Fetch/XHR" to see API calls
Step 3: Navigate Instagram normally (visit a profile, view a post, scroll comments)
Step 4: Watch for API requests to domains like:
i.instagram.com/api/v1/(REST endpoints)www.instagram.com/graphql/query(GraphQL endpoints)
Step 5: Click on an API request to inspect:
- Request headers (especially
x-ig-app-id) - Request payload (for GraphQL, look for
variablesanddoc_id) - Response structure (to understand data format)
REST Example: When viewing a profile, you will see a request to:
https://i.instagram.com/api/v1/users/web_profile_info/?username=google
GraphQL Example: When viewing a post, you will see a POST request to:
https://www.instagram.com/graphql/query
With a payload containing doc_id and variables parameters.
Understanding doc_id
The doc_id parameter is critical for GraphQL scraping but poorly understood. Here's what you need to know:
What is doc_id?
- Instagram's internal identifier for specific GraphQL query structures
- Maps to a predefined query on Instagram's backend (you cannot define custom queries)
- Example:
doc_id=8845758582119845retrieves post details
Why doc_ids exist:
- Performance: Pre-defined queries are optimized and cached on Instagram's servers
- Security: Prevents custom queries that could overload the database
- Anti-scraping: Changing doc_ids regularly breaks scrapers
Why doc_ids change:
- Instagram updates their GraphQL schema every 2-4 weeks
- Changes are a deliberate anti-scraping measure
- No public documentation of current values. You must discover them yourself
How to find current doc_ids:
- Open DevTools → Network tab → filter for "graphql"
- Trigger the action on Instagram (view post, load comments, etc.)
- Inspect the request payload for
doc_id=parameter - Note the numeric value (e.g.,
8845758582119845)
Different operations require different doc_ids:
- Profile posts:
9310670392322965 - Post details:
8845758582119845 - Comments pagination: (changes frequently)
- User search: (changes frequently)
The DIY Pain: You must monitor doc_ids manually and update your scraper every time Instagram changes them (every 2-4 weeks). Miss an update and your scraper breaks silently.
ScrapFly Solution: Our open-source repository is updated within hours of Instagram changes. You pull the latest code and keep scraping with no detective work required.
Required Headers
Headers are not just formalities. Instagram validates them strictly. Here is what you need and why:
Critical Headers:
{
"x-ig-app-id": "936619743392459", # Instagram web app identifier
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
Why each header matters:
- x-ig-app-id: Identifies your request as coming from Instagram's web app (not mobile app or unauthorized client). Wrong value equals instant 403 error.
- User-Agent: Must match a real browser signature. Python's default User-Agent screams "bot" and gets blocked immediately.
- Accept-Language: Instagram tracks inconsistent language preferences across requests. Keep it stable per session.
- Accept-Encoding: Real browsers always accept compression. Omitting this is suspicious.
- Accept: Wildcard is fine, but must be present.
What happens with wrong headers:
- 403 Forbidden: TLS fingerprint or app-id mismatch detected
- 400 Bad Request: Malformed headers or missing required fields
- No response: Your IP was flagged and silently dropped
Header consistency requirement:
Instagram correlates requests within a session. If your User-Agent changes mid-session or headers conflict with your TLS fingerprint, you are flagged instantly.
How to Scrape Instagram Profiles
Instagram profiles contain valuable business intelligence: follower counts, bio information, business contact details, and recent posts. We will use Instagram's REST API endpoint that returns profile data as JSON.
What you can extract:
- Full name, username, user ID, verification status
- Bio text and external links
- Follower and following counts
- Profile picture URL
- Business category, phone number, email (for business accounts)
- First 12 posts with preview data
The approach:
We make a GET request to Instagram's profile API endpoint with the username as a parameter. The response includes complete profile data in JSON format.
ScrapFly's scraper handles:
- Proper header formatting and x-ig-app-id rotation
- Residential proxy rotation to avoid IP blocks
- TLS fingerprint matching to bypass blocking detection
- Automatic retry with exponential backoff on rate limits
Code snippet from the ScrapFly scraper:
from scrapfly import ScrapeConfig, ScrapflyClient
import json
scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")
INSTAGRAM_APP_ID = "936619743392459"
BASE_CONFIG = {
"asp": True, # Anti Scraping Protection bypass
"country": "US", # Use US residential proxies
}
async def scrape_profile(username: str):
"""Scrape Instagram profile data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
headers={"x-ig-app-id": INSTAGRAM_APP_ID},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
return data["data"]["user"]
# Example usage
profile = await scrape_profile("google")
print(f"Followers: {profile['edge_followed_by']['count']}")
Key implementation details:
- The
asp=Trueparameter activates ScrapFly's anti-blocking bypass (TLS fingerprinting, header rotation) - Residential proxies (
country="US") prevent datacenter IP blocks - The endpoint returns up to 12 recent posts embedded in the profile response
- Business accounts expose email/phone in
business_emailandbusiness_phone_numberfields
Full profile scraper code in our repository →
How to Scrape Instagram Posts
Post data includes captions, media URLs, engagement metrics, comments, and tagged users. Instagram uses GraphQL for post queries, requiring proper doc_id values and request formatting.
What you can extract:
- Post shortcode, ID, and timestamp
- Image/video URLs (full resolution)
- Captions and hashtags
- Like counts, view counts (for videos), play counts (for reels)
- First page of comments (with pagination cursor for more)
- Tagged users and location data
- Related posts
The approach:
We send a POST request to Instagram's GraphQL endpoint with a payload containing the post shortcode and the correct doc_id. Instagram returns complete post data including engagement metrics and comments.
ScrapFly's scraper handles:
- Current doc_id values (updated within hours when Instagram changes them)
- GraphQL payload formatting and URL encoding
- Comment pagination for posts with 100+ comments
- Different post types: photos, videos, reels, carousels
Code snippet from the ScrapFly scraper:
from scrapfly import ScrapeConfig, ScrapflyClient
import json
from urllib.parse import quote
scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")
INSTAGRAM_POST_DOC_ID = "8845758582119845" # Updated regularly
BASE_CONFIG = {"asp": True, "country": "US"}
async def scrape_post(url_or_shortcode: str):
"""Scrape single Instagram post data"""
# Extract shortcode from URL or use directly
if "http" in url_or_shortcode:
shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
else:
shortcode = url_or_shortcode
# Build GraphQL request payload
variables = quote(json.dumps({
'shortcode': shortcode,
'fetch_tagged_user_count': None,
'hoisted_comment_id': None,
'hoisted_reply_id': None
}, separators=(',', ':')))
body = f"variables={variables}&doc_id={INSTAGRAM_POST_DOC_ID}"
result = await scrapfly.async_scrape(
ScrapeConfig(
url="https://www.instagram.com/graphql/query",
method="POST",
body=body,
headers={"content-type": "application/x-www-form-urlencoded"},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
return data["data"]["xdt_shortcode_media"]
# Example usage
post = await scrape_post("https://www.instagram.com/p/CuE2WNQs6vH/")
print(f"Likes: {post['edge_media_preview_like']['count']}")
Key implementation details:
- The shortcode is the unique post identifier (e.g.,
CuE2WNQs6vHfrom URL/p/CuE2WNQs6vH/) - GraphQL requires URL-encoded JSON in the request body
- The response includes nested structures for comments (
edge_media_to_parent_comment) - Carousel posts have multiple images in
edge_sidecar_to_children
Full post scraper code with pagination →
How to Scrape Instagram Comments
Comments provide sentiment data, user engagement patterns, and conversation threads. Instagram paginates comments, requiring multiple requests to extract full comment sections.
What you can extract:
- Comment text and timestamp
- Commenter username, profile, verification status
- Like counts per comment
- Nested replies (threaded conversations)
- Pagination cursors for loading more comments
The approach:
Comments are included in the initial post data (first ~12 comments), but posts with hundreds of comments require pagination. We use the end_cursor value from page_info to load subsequent pages through additional GraphQL requests.
ScrapFly's scraper handles:
- Nested pagination (comments and their replies have separate cursors)
- Rate limit respect (Instagram throttles aggressive comment scraping)
- Proper doc_id for comment pagination queries
- Reply threading and parent-child comment relationships
Code snippet for comment pagination:
async def scrape_post_comments(shortcode: str, max_comments: int = 100):
"""Scrape comments from Instagram post with pagination"""
comments = []
cursor = None
while len(comments) < max_comments:
variables = quote(json.dumps({
'shortcode': shortcode,
'first': 50, # Comments per page
'after': cursor, # Pagination cursor
}, separators=(',', ':')))
body = f"variables={variables}&doc_id={INSTAGRAM_COMMENTS_DOC_ID}"
result = await scrapfly.async_scrape(
ScrapeConfig(
url="https://www.instagram.com/graphql/query",
method="POST",
body=body,
headers={"content-type": "application/x-www-form-urlencoded"},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
comment_data = data["data"]["xdt_shortcode_media"]["edge_media_to_parent_comment"]
# Extract comments from this page
for edge in comment_data["edges"]:
comments.append(edge["node"])
# Check for more comments
page_info = comment_data["page_info"]
if not page_info["has_next_page"]:
break
cursor = page_info["end_cursor"]
return comments[:max_comments]
Key implementation details:
- The
firstparameter controls comments per page (max ~50) - Each comment includes
edge_threaded_commentsfor nested replies - Replies have their own pagination system requiring separate requests
- The scraper respects Instagram's rate limits by adding delays between pagination requests
Full comment scraper with reply handling →
How to Scrape Instagram with Proxies
Proxies are mandatory for Instagram scraping at any scale. Instagram's IP quality detection blocks datacenter IPs instantly, and rate limits force you to rotate residential IPs to maintain scraping speed.
Best Proxies for Instagram Scraping (Residential vs Datacenter)
Datacenter Proxies: Do not use
- Blocked instantly by Instagram's IP quality checks
- No request volume possible. Banned on first request
- Cheaper per GB, but 100% failure rate makes cost irrelevant
Residential Proxies: Required
- IPs from real consumer ISPs (Comcast, Verizon, AT&T, etc.)
- Pass Instagram's IP quality detection
- Each IP allows ~200 requests/hour before rate limiting
- Geographic targeting (e.g., US-only IPs for US-focused scraping)
Mobile Proxies: Premium Option
- IPs from mobile carriers (4G/5G networks)
- Highest trust score. Instagram rarely blocks mobile IPs
- Better rate limits (~300 requests/hour per IP)
- More expensive ($60-120/month per IP vs $1-3 for residential)
Recommendation: Residential proxies are the sweet spot for Instagram scraping. Mobile proxies offer marginal improvement at 10-20x the cost. Not worth it unless you are scraping millions of profiles daily.
How to Rotate Proxies for Instagram Scraping
Proxy rotation strategies determine your scraping speed and block rate:
Sticky Sessions (Recommended):
- Use the same IP for 5-10 minutes, then rotate
- Mimics real user behavior (one person does not change IPs every 10 seconds)
- Allows ~15-30 requests per IP before rotation
- Instagram's behavioral analysis flags instant IP changes as suspicious
Request-Level Rotation (Aggressive):
- New IP for every single request
- Maximizes speed but looks unnatural to Instagram
- Higher block rate. Use only with anti-bot bypass (like ScrapFly)
- Necessary when scraping 10,000+ profiles/hour
Smart Rotation Based on Response:
- Rotate immediately on 429 (rate limit) or 403 (block)
- Continue using same IP while responses are 200 OK
- Implements exponential backoff: 2s delay → 4s → 8s → 16s before rotating
- Reduces wasted proxy bandwidth
ScrapFly's automatic proxy management:
- Intelligent rotation using sticky sessions by default
- Instant rotation on rate limits or blocks
- Geographic pinning (keep requests in same country/region)
- Proxy pool health monitoring (removes dead IPs automatically)
Instagram Proxy Costs
Residential proxies are billed per GB of bandwidth consumed. Here's what Instagram scraping costs:
Data usage per request type:
- Profile scrape: ~50-100 KB per profile
- Post scrape: ~30-80 KB per post
- Comment scrape: ~20-50 KB per comment page
Example scraping job: 10,000 Instagram profiles
- 10,000 profiles × 75 KB average = 750 MB
- Standard residential proxy cost: $10-15 per GB
- Total cost: $7.50-11.25 in proxy bandwidth
ScrapFly's Proxy Saver feature:
- Caches static content (profile images, CSS, JavaScript)
- Only uses residential bandwidth for actual API calls
- Reduces bandwidth consumption by 30-50%
- Same 10,000 profile job: $5.25-7.88 with Proxy Saver
- Savings: $2.25-3.37 per 10K profiles (30-40% reduction)
For serious Instagram scraping (100K+ profiles/month), Proxy Saver saves $50-100+ monthly in proxy costs alone.
FAQ
Are there public APIs for Instagram?
No. Instagram discontinued public API access in 2020 and now only offers limited APIs for verified business partners through the Instagram Graph API. However, Instagram's web and mobile apps communicate with internal REST and GraphQL APIs that we can access directly through reverse engineering. These "hidden" APIs provide far more data than the official API ever did.
How to get Instagram user ID from username?
Scrape the user profile using the /api/v1/users/web_profile_info/ endpoint and extract the id field:
import httpx
import json
response = httpx.get(
f"https://i.instagram.com/api/v1/users/web_profile_info/?username=google",
headers={"x-ig-app-id": "936619743392459"}
)
data = json.loads(response.content)
user_id = data["data"]["user"]["id"]
print(user_id) # Output: 1067259270
How to get Instagram username from user ID?
Use Instagram's public mobile API endpoint:
import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90"
response = httpx.get(
iphone_api.format("1067259270"),
headers={"User-Agent": iphone_user_agent}
)
username = response.json()['user']['username']
print(username) # Output: google
How do I handle Instagram's rate limiting when scraping at scale?
Instagram rate limiting requires a three-part strategy:
Residential proxy rotation: Use 50-100+ residential IPs and rotate them in sticky sessions (5-10 minutes per IP). Each IP allows ~200 requests/hour.
Realistic delays: Space requests 2-5 seconds apart with random variance. Perfect timing intervals look robotic.
Exponential backoff: When you receive a 429 error, back off exponentially (wait 2s, then 4s, then 8s, etc.) before retrying.
ScrapFly handles all three automatically—you specify your desired scraping speed and we manage rate limits, retries, and proxy rotation.
Can I scrape Instagram stories or reels data?
Yes, stories and reels use dedicated GraphQL endpoints with their own doc_id values:
Stories: Ephemeral (24-hour) content requires the user's ID and a stories-specific doc_id. Stories include view counts, replies, and media URLs.
Reels: Similar to post scraping but with video-specific fields (play counts, audio attribution, video duration). Reels use doc_id 25981206651899035 (subject to change).
Both require authentication for some accounts, but public accounts expose this data without login.
How do I extract Instagram comments and engagement metrics?
Comments are nested in post data under edge_media_to_parent_comment. The initial post request returns the first ~12 comments plus a pagination cursor (end_cursor).
To extract all comments:
- Scrape the post to get initial comments and
end_cursor - Make paginated GraphQL requests with the cursor to load more
- Extract nested replies from each comment's
edge_threaded_comments
Engagement metrics are in the post data:
- Likes:
edge_media_preview_like.count - Comments:
edge_media_to_parent_comment.count - Views (videos):
video_view_count - Plays (reels):
video_play_count
What are the most common Instagram scraping challenges?
1. Doc_id changes: Instagram updates GraphQL doc_id values every 2-4 weeks, breaking scrapers. Solution: Monitor our open-source scraper for updates.
2. IP blocks: Datacenter IPs are banned instantly. Solution: Use residential proxies with rotation.
3. TLS fingerprinting: Python libraries have detectable signatures. Solution: Use anti-bot bypass tools like ScrapFly that rotate fingerprints.
4. Rate limits: 200 requests/hour per IP. Solution: Rotate through 50+ residential IPs with sticky sessions.
5. Behavioral detection: Unnatural request patterns get flagged. Solution: Add random delays, mimic real browsing sequences.
Why does my Python Instagram scraper get 403 errors immediately?
This is TLS fingerprinting blocking. Python's requests and httpx libraries have unique TLS handshake signatures that Instagram detects as bots within the first request.
Solutions:
- Use browser automation (Selenium, Playwright) which has real browser fingerprints but it's 10x slower
- Use curl_cffi library which mimics Chrome's TLS fingerprint
- Use ScrapFly which rotates TLS fingerprints automatically
Do not waste time trying to fix headers. The TLS handshake happens before HTTP headers are even sent. You need a tool that controls the TLS layer.
Summary
Instagram scraping in 2025 requires navigating complex blocking systems: IP quality detection, TLS fingerprinting, rate limiting, and behavioral analysis. Building a scraper from scratch means constant maintenance as Instagram updates doc_ids every 2-4 weeks and evolves their blocking systems weekly.
We covered Instagram's multi-layered blocking system and why manual scrapers fail within hours, how to access hidden REST and GraphQL APIs for profiles, posts, and comments, why residential proxies are mandatory (datacenter IPs blocked instantly), how doc_id parameters work and change every 2-4 weeks, and why Python libraries get blocked immediately due to TLS fingerprinting.
The smart approach: Start with ScrapFly's working Instagram scraper that includes anti-blocking bypass, residential proxies, and automatic updates when Instagram changes. This saves you hundreds of hours in maintenance and debugging.
Legal Disclaimer and Precautions
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens who are protected by GDPR.
- Do not repurpose the entire public datasets which can be illegal in some countries.