How to scrape Threads by Meta using Python (2023-08 Update)
Guide how to scrape Threads - new social media network by Meta and Instagram - using Python and popular libraries like Playwright and background request capture techniques.
In this Python web scraping tutorial we'll explore Instagram - one of the biggest social media websites out there. We'll take a look at how to scrape Instagram's user and post data.
We'll also focus on some tips and tricks of how to reach these endpoints efficiently and how to avoid web scraper blocking and access all of this information without having to login to Instagram. So, let's dive in!
In this web scraping Instagram tutorial, we'll be using Python with an HTTP client library httpx which will power all of our interactions with Instagram's server.
We'll also be using JMespath JSON parsing library which will help us to reduce giant datasets we get from Instagram to only the most important bits like photo urls, comments, like counts etc.
We can install all of these packages via pip
console command:
$ pip install httpx jmespath
For Scrapfly users, we'll also be including a version of each code snippet using scrapfly Python SDK
Many Instagram endpoints require login though not all. In this tutorial, we'll be only covering the endpoints that don't require login and are publicly accessible to everyone.
Scraping Instagram through login can have many unintended consequences from your account being blocked to Instagram taking legal action for explicitly breaking their Terms of Service. As noted in this tutorial, login is often not necessary, so let's take a look at how to scrape Instagram without having to log in and risk suspension.
Let's start with scraping user profiles. For this, we'll be using Instagram's backend API endpoint which is being fired when the browser loads the profile URL. For example, here's Google's Instagram profile page:
This endpoint is called on page load and returns a JSON dataset with all of the user's data. We can use this endpoint to scrape Instagram user data without having to login to Instagram:
import json
import httpx
client = httpx.Client(
headers={
# this is internal ID of an instegram backend app. It doesn't change often.
"x-ig-app-id": "936619743392459",
# use browser-like features
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
)
def scrape_user(username: str):
"""Scrape Instagram user's data"""
result = client.get(
f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
)
data = json.loads(result.content)
return data["data"]["user"]
print(scrape_user("google"))
import asyncio
import json
from typing import Dict
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
BASE_CONFIG = {
# Instagram.com requires Anti Scraping Protection bypass feature:
# for more: https://scrapfly.io/docs/scrape-api/anti-scraping-protection
"asp": True,
"country": "CA",
}
INSTAGRAM_APP_ID = "936619743392459" # this is the public app id for instagram.com
async def scrape_user(username: str) -> Dict:
"""Scrape instagram user's data"""
result = await SCRAPFLY.async_scrape(
ScrapeConfig(
url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
headers={"x-ig-app-id": INSTAGRAM_APP_ID},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
return parse_user(data["data"]["user"])
print(asyncio.run(scrape_user("google")))
This approach will return Instagram user data such as bio description, follower counts, profile pictures etc:
{
"biography": "Google unfiltered—sometimes with filters.",
"external_url": "https://linkin.bio/google",
"external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1",
"edge_followed_by": {
"count": 13015078
},
"fbid": "17841401778116675",
"edge_follow": {
"count": 33
},
"full_name": "Google",
"highlight_reel_count": 5,
"id": "1067259270",
"is_business_account": true,
"is_professional_account": true,
"is_supervision_enabled": false,
"is_guardian_of_viewer": false,
"is_supervised_by_viewer": false,
"is_embeds_disabled": false,
"is_joined_recently": false,
"guardian_id": null,
"is_verified": true,
"profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83",
"username": "google",
...
}
This is a great, easy method to scrape Instagram profiles - it even includes the details of the first 12 posts including photos and videos!
The user dataset we scraped can be a bit daunting as it contains a lot of data. To reduce it to the most important bits we can use Jmespath.
import jmespath
def parse_user(data: Dict) -> Dict:
"""Parse instagram user's hidden web dataset for user's data"""
log.debug("parsing user data {}", data['username'])
result = jmespath.search(
"""{
name: full_name,
username: username,
id: id,
category: category_name,
business_category: business_category_name,
phone: business_phone_number,
email: business_email,
bio: biography,
bio_links: bio_links[].url,
homepage: external_url,
followers: edge_followed_by.count,
follows: edge_follow.count,
facebook_id: fbid,
is_private: is_private,
is_verified: is_verified,
profile_image: profile_pic_url_hd,
video_count: edge_felix_video_timeline.count,
videos: edge_felix_video_timeline.edges[].node.{
id: id,
title: title,
shortcode: shortcode,
thumb: display_url,
url: video_url,
views: video_view_count,
tagged: edge_media_to_tagged_user.edges[].node.user.username,
captions: edge_media_to_caption.edges[].node.text,
comments_count: edge_media_to_comment.count,
comments_disabled: comments_disabled,
taken_at: taken_at_timestamp,
likes: edge_liked_by.count,
location: location.name,
duration: video_duration
},
image_count: edge_owner_to_timeline_media.count,
images: edge_felix_video_timeline.edges[].node.{
id: id,
title: title,
shortcode: shortcode,
src: display_url,
url: video_url,
views: video_view_count,
tagged: edge_media_to_tagged_user.edges[].node.user.username,
captions: edge_media_to_caption.edges[].node.text,
comments_count: edge_media_to_comment.count,
comments_disabled: comments_disabled,
taken_at: taken_at_timestamp,
likes: edge_liked_by.count,
location: location.name,
accesibility_caption: accessibility_caption,
duration: video_duration
},
saved_count: edge_saved_media.count,
collections_count: edge_saved_media.count,
related_profiles: edge_related_profiles.edges[].node.username
}""",
data,
)
return result
This function will take in the full dataset and reduce it to a more flat structure that contains only the important fields. We're using JMespath's reshaping feature which allows us to distil the dataset into a new structure.
To scrape Instagram post data we'll be using the same method as before, but this time we'll be using the post
endpoint.
To generate post views dynamically Instagram uses a GraphQL backend query which returns post data, comments, likes, and other information. We can use this endpoint to scrape post data.
All Instagram GrapQL endpoints are accessed through:
https://www.instagram.com/graphql/query/?query_hash=<>&variables=<>
Where query hash and variables define the query functionality. For example, to scrape post data we'll be using the following query hash and variables:
{
"query_hash": "b3055c01b4b222b8a47dc12b090e4e64", # this post query hash which doesn't change
"variables": {
"shortcode": "CQYQ1Y1nZ1Y", # post shortcode (from URL)
# how many and what comments to include
"child_comment_count": 20,
"fetch_comment_count": 100,
"parent_comment_count": 24,
"has_threaded_comments": true
}
}
So, to scrape it using Python we'll be using the following code:
import json
from typing import Dict
from urllib.parse import quote
import httpx
INSTAGRAM_APP_ID = "936619743392459" # this is the public app id for instagram.com
def scrape_post(url_or_shortcode: str) -> Dict:
"""Scrape single Instagram post data"""
if "http" in url_or_shortcode:
shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
else:
shortcode = url_or_shortcode
print(f"scraping instagram post: {shortcode})
variables = {
"shortcode": shortcode,
"child_comment_count": 20,
"fetch_comment_count": 100,
"parent_comment_count": 24,
"has_threaded_comments": True,
}
url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables="
result = httpx.get(
url=url + quote(json.dumps(variables)),
headers={"x-ig-app-id": INSTAGRAM_APP_ID},
)
data = json.loads(result.content)
return data["data"]["shortcode_media"]
# Example usage:
posts = scrape_post("https://www.instagram.com/p/CuE2WNQs6vH/")
print(json.dumps(posts, indent=2, ensure_ascii=False))
from scrapfly import ScrapeConfig, ScrapflyClient
SCRAPFLY = ScrapflyClient("YOUR SCRAPFLY KEY")
async def scrape_post(url_or_shortcode: str) -> Dict:
"""Scrape single Instagram post data"""
if "http" in url_or_shortcode:
shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
else:
shortcode = url_or_shortcode
log.info("scraping instagram post: {}", shortcode)
variables = {
"shortcode": shortcode,
"child_comment_count": 20,
"fetch_comment_count": 100,
"parent_comment_count": 24,
"has_threaded_comments": True,
}
url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables="
result = await SCRAPFLY.async_scrape(
ScrapeConfig(
url=url + quote(json.dumps(variables)),
headers={"x-ig-app-id": INSTAGRAM_APP_ID},
**BASE_CONFIG,
)
)
data = json.loads(result.content)
return parse_post(data["data"]["shortcode_media"])
This scrape approach will return an entire post dataset which includes many useful fields like post captions, comments, likes, and other information. However, it also includes many flags and unnecessary fields which are often not very useful.
To reduce the scraped dataset let's take a look at JSON parsing using Jmespath next.
Instagram post data is even more complex than the user profile data so we'll be using Jmespath to reduce it once again.
import jmespath
def parse_post(data: Dict) -> Dict:
log.debug("parsing post data {}", data['shortcode'])
result = jmespath.search("""{
id: id,
shortcode: shortcode,
dimensions: dimensions,
src: display_url,
src_attached: edge_sidecar_to_children.edges[].node.display_url,
has_audio: has_audio,
video_url: video_url,
views: video_view_count,
plays: video_play_count,
likes: edge_media_preview_like.count,
location: location.name,
taken_at: taken_at_timestamp,
related: edge_web_media_to_related_media.edges[].node.shortcode,
type: product_type,
video_duration: video_duration,
music: clips_music_attribution_info,
is_video: is_video,
tagged_users: edge_media_to_tagged_user.edges[].node.user.username,
captions: edge_media_to_caption.edges[].node.text,
related_profiles: edge_related_profiles.edges[].node.username,
comments_count: edge_media_to_parent_comment.count,
comments_disabled: comments_disabled,
comments_next_page: edge_media_to_parent_comment.page_info.end_cursor,
comments: edge_media_to_parent_comment.edges[].node.{
id: id,
text: text,
created_at: created_at,
owner: owner.username,
owner_verified: owner.is_verified,
viewer_has_liked: viewer_has_liked,
likes: edge_liked_by.count
}
}""", data)
return result
Here, just like before, we used jmespath to extract the most useful datafields from the massive JSON response we've received from our scraper. Note that different post types (reels, images, videos, etc) have different fields available.
To retrieve the user's posts and post comments, we'll be using yet another GraphQl endpoint that requires three variables: the user's ID which we got from scraping the user's profile previously, page size and page offset cursor:
{
"id": "NUMERIC USER ID",
"first": 12,
"after": "CURSOR ID FOR PAGING"
}
For example, if we would like to retrieve all of the Instagram posts create by Google we first have to retrieve this user's ID and then compile our graphql request.
In Google's example, the graphql URL would be:
We can try in our browser, and we should see a JSON returned with the data of the first 12 posts which include details like:
However, to retrieve all posts we need to implement pagination logic as all of the information is scattered through multiple pages.
import json
import httpx
from urllib.parse import quote
def scrape_user_posts(user_id: str, session: httpx.Client, page_size=12):
base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
variables = {
"id": user_id,
"first": page_size,
"after": None,
}
_page_number = 1
while True:
resp = session.get(base_url + quote(json.dumps(variables)))
data = resp.json()
posts = data["data"]["user"]["edge_owner_to_timeline_media"]
for post in posts["edges"]:
yield parse_post(post["node"]) # note: we're using parse_post function from previous chapter
page_info = posts["page_info"]
if _page_number == 1:
print(f"scraping total {posts['count']} posts of {user_id}")
else:
print(f"scraping page {_page_number}")
if not page_info["has_next_page"]:
break
if variables["after"] == page_info["end_cursor"]:
break
variables["after"] = page_info["end_cursor"]
_page_number += 1
# Example run:
if __name__ == "__main__":
with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
posts = list(scrape_user_posts("1067259270", session, page_limit=3))
print(json.dumps(posts, indent=2, ensure_ascii=False))
from scrapfly import ScrapflyClient, ScrapeConfig
SCRAPFLY = ScrapflyClient("YOUR SCRAPFLY KEY")
async def scrape_user_posts(user_id: str, page_size=50, max_pages: Optional[int] = None):
"""Scrape all posts of an instagram user of given numeric user id"""
base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
variables = {
"id": user_id,
"first": page_size,
"after": None,
}
_page_number = 1
while True:
url = base_url + quote(json.dumps(variables))
result = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
data = json.loads(result.content)
posts = data["data"]["user"]["edge_owner_to_timeline_media"]
for post in posts["edges"]:
yield parse_post(post["node"])
page_info = posts["page_info"]
if _page_number == 1:
print(f"scraping total {posts['count']} posts of {user_id}")
else:
print(f"scraping posts page {_page_number}")
if not page_info["has_next_page"]:
break
if variables["after"] == page_info["end_cursor"]:
break
variables["after"] = page_info["end_cursor"]
_page_number += 1
if max_pages and _page_number > max_pages:
break
[
{
"__typename": "GraphImage",
"id": "2890253001563912589",
"dimensions": {
"height": 1080,
"width": 1080
},
"display_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-C93CjLzMapgPHOinoltBXypU_wi7s6zzLj1th-s9p-Q&oe=62E80627&_nc_sid=86f79a",
"display_resources": [
{
"src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
"config_width": 640,
"config_height": 640
},
"..."
],
"is_video": false,
"tracking_token": "eyJ2ZXJzaW9uIjo1LCJwYXlsb2FkIjp7ImlzX2FuYWx5dGljc190cmFja2VkIjp0cnVlLCJ1dWlkIjoiOWJiNzUyMjljMjU2NDExMTliOGI4NzM5MTE2Mjk4MTYyODkwMjUzMDAxNTYzOTEyNTg5In0sInNpZ25hdHVyZSI6IiJ9",
"edge_media_to_tagged_user": {
"edges": [
{
"node": {
"user": {
"full_name": "Jahmar Gale | Data Analyst",
"id": "51661809026",
"is_verified": false,
"profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/284007837_5070066053047326_6283083692098566083_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=KXI8oOdZRb4AX8w28nr&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-4iYsawdTCHI5a2zD_PF9F-WCyKnTIPuvYwVAQo82l_w&oe=62E7609B&_nc_sid=86f79a",
"username": "datajayintech"
},
"x": 0.68611115,
"y": 0.32222223
}
},
"..."
]
},
"accessibility_caption": "A screenshot of a tweet from @DataJayInTech, which says: \"A recruiter just called me and said The Google Data Analytics Certificate is a good look. This post is to encourage YOU to finish the course.\" The background of the image is red with white, yellow, and blue geometric shapes.",
"edge_media_to_caption": {
"edges": [
{
"node": {
"text": "Ring, ring — opportunity is calling📱\nStart your Google Career Certificate journey at the link in bio. #GrowWithGoogle"
}
},
"..."
]
},
"shortcode": "CgcPcqtOTmN",
"edge_media_to_comment": {
"count": 139,
"page_info": {
"has_next_page": true,
"end_cursor": "QVFCaU1FNGZiNktBOWFiTERJdU80dDVwMlNjTE5DWTkwZ0E5NENLU2xLZnFLemw3eTJtcU54ZkVVS2dzYTBKVEppeVpZbkd4dWhQdktubW1QVzJrZXNHbg=="
},
"edges": [
{
"node": {
"id": "18209382946080093",
"text": "@google your company is garbage for meddling with supposedly fair elections...you have been exposed",
"created_at": 1658867672,
"did_report_as_spam": false,
"owner": {
"id": "39246725285",
"is_verified": false,
"profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/115823005_750712482350308_4191423925707982372_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=104&_nc_ohc=4iOCWDHJLFAAX-JFPh7&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9sH7npBTmHN01BndUhYVreHOk63OqZ5ISJlzNou3QD8A&oe=62E87360&_nc_sid=86f79a",
"username": "bud_mcgrowin"
},
"viewer_has_liked": false
}
},
"..."
]
},
"edge_media_to_sponsor_user": {
"edges": []
},
"comments_disabled": false,
"taken_at_timestamp": 1658765028,
"edge_media_preview_like": {
"count": 9251,
"edges": []
},
"gating_info": null,
"fact_check_overall_rating": null,
"fact_check_information": null,
"media_preview": "ACoqbj8KkijDnBOfpU1tAkis8mcL2H0zU8EMEqh1Dc56H0/KublclpoejKoo3WtylMgQ4HeohW0LKJ+u7PueaX+z4v8Aa/OmoNJJ6kqtG3UxT0pta9xZRxxswzkDjJrIoatuawkpq6NXTvuN9f6VdDFeAMAdsf8A16oWDKFYMQMnuR6e9Xd8f94fmtax2OGqnzsk3n/I/wDsqN7f5H/2VR74/wC8PzWlEkY7g/iv+NVcys+wy5JML59P89zWDW3dSx+UwGMnjjH9KxKynud1BWi79wpQM+g+tJRUHQO2+4pCuO4pKKAFFHP+RSUUgP/Z",
"owner": {
"id": "1067259270",
"username": "google"
},
"location": null,
"viewer_has_liked": false,
"viewer_has_saved": false,
"viewer_has_saved_to_collection": false,
"viewer_in_photo_of_you": false,
"viewer_can_reshare": true,
"thumbnail_src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
"thumbnail_resources": [
{
"src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9nmASHsbmNWUQnwOdkGE4PvE8b27MqK-gbj5z0YLu8qg&oe=62E80627&_nc_sid=86f79a",
"config_width": 150,
"config_height": 150
},
"..."
]
},
...
]
Now that we can scrape all user posts, we can try out a popular analytics exercise: scrape all posts and extract hashtag mentions.
For this, let's scrape all posts, extract mentioned hashtags from the post description and count everything up:
from collections import Counter
def scrape_hashtag_mentions(user_id, session: httpx.AsyncClient, page_limit:int=None):
"""find all hashtags user mentioned in their posts"""
hashtags = Counter()
hashtag_pattern = re.compile(r"#(\w+)")
for post in scrape_user_posts(user_id, session=session, page_limit=page_limit):
desc = '\n'.join(post['captions'])
found = hashtag_pattern.findall(desc)
for tag in found:
hashtags[tag] += 1
return hashtags
import json
import httpx
if __name__ == "__main__":
with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
# if we only know the username but not user id we can scrape
# the user profile to find the id:
user_id = scrape_user("google")["id"] # will result in: 1067259270
# then we can scrape the hashtag profile
hashtags = scrape_hastag_mentions(user_id, session, page_limit=5)
# order results and print them as JSON:
print(json.dumps(dict(hashtags.most_common()), indent=2, ensure_ascii=False))
{
"MadeByGoogle": 10,
"TeamPixel": 5,
"GrowWithGoogle": 4,
"Pixel7": 3,
"LifeAtGoogle": 3,
"SaferWithGoogle": 3,
"Pixel6a": 3,
"DoodleForGoogle": 2,
"MySuperG": 2,
"ShotOnPixel": 1,
"DayInTheLife": 1,
"DITL": 1,
"GoogleAustin": 1,
"Austin": 1,
"NestWifi": 1,
"NestDoorbell": 1,
"GoogleATAPAmbientExperiments": 1,
"GoogleATAPxKOCHE": 1,
"SoliATAP": 1,
"GooglePixelWatch": 1,
"Chromecast": 1,
"DooglersAroundTheWorld": 1,
"GoogleSearch": 1,
"GoogleSingapore": 1,
"InternationalDogDay": 1,
"Doogler": 1,
"BlackBusinessMonth": 1,
"PixelBuds": 1,
"HowTo": 1,
"Privacy": 1,
"Settings": 1,
"GoogleDoodle": 1,
"NationalInternDay": 1,
"GoogleInterns": 1,
"Sushi": 1,
"StopMotion": 1,
"LetsInternetBetter": 1
}
With this simple analytics script, we've collected profile hashtags that we can use to determine the interestest of any public Instagram account.
Scraping Instagram seems to be easy though, unfortunately, Instagram started restricting public access to its public data. Often allowing users few requests per hour and for anything more requiring a login.
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
Which offers several powerful features that'll help us to get around Instagram's blocking:
For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk
using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our Instagram web scraper all we need to do is replace httpx
requests with scrapfly-sdk
requests. For more see our up-to-date and maintained implementation of Instagram scraper here:
To wrap this guide up let's take a look at some frequently asked questions about web scraping instagram.com:
Yes. Instagram's data is publicly available so scraping instagram.com at slow, respectful rates would fall under the ethical scraping definition. However, when working with personal data we need to be aware of local copyright and user data laws like GDPR in the EU. For more see our Is Web Scraping Legal? article.
To get the private user ID from the public username we can scrape user profile using our scrape_user
function and the private id will be located in the id
field:
with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
user_id = scrape_user('google')['id']
print(user_id)
To get the public username from Instagram's private user ID we can take advantage of public iPhone API https://i.instagram.com/api/v1/users/<USER_ID>/info/
:
import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920"
resp = httpx.get(iphone_api.format("1067259270"), headers={"User-Agent": iphone_user_agent})
print(resp.json()['user']['username'])
Instagram has been rolling out new changes and slowly retiring this feature. However, in this article we've covered two alternatives for ?__a=1
features which are the /v1/
API endpoints and GraphQl endpoints which perform even better!
In this Instagram scraping tutorial, we've taken a look at how to easily scrape Instagram using Python and hidden API endpoints.
We've scraped the user profile page which contains user details, posts and meta information as well as each individual post data. To reduce scraped datasets we used JMespath JSON parsing library.
Finally, to start scaling the scraper we took a look at how to scrape Instagram without login by taking advantage of ScrapFly's smart scraper blocking bypass systems. For more on ScrapFly see our documentation and try it out for free!