How to Scrape Instagram in 2024

Jun 03, 2024

scrapeguide Python

In this Python web scraping tutorial we'll explain how to scrape Instagram - one of the most popular social media websites out there. We'll explain at how to create an Instagram data scraper to extract data from Instagram profiles and post pages.

We'll focus on utilizing the unofficial Instagram API for scraping Instagram data. Furthermore, we'll take into account Instagram web scraping blocking and extract its data without login. Let's dive in!

Latest Instagram Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.

Why Scrape Instagram?

The amount of public data on Instagram is significant, allowing for various insights. Businesses can scrape Instagram data for lead generation, where popular Instagram profiles with similar interests can be reached to attract new customers.

Moreover, scraped Instagram data is a viable resource for performing sentiment analysis research. This data is found in posts and comments, which can be used to gather public opinions on specific trends and news.

For further details on scraping Instagram data use cases, refer to our dedicated guide.

Project Setup

In this web scraping Instagram tutorial, we'll be using Python with a few community packages:

httpx: For sending HTTP requests and getting the data as HTML.
JMESPath: To parse the JSON Instagram scraped data and reduce its size.

For Scrapfly users, we'll also be including a version of each code snippet using the ScrapFly Python SDK.

Note - Login Requirement

Many Instagram endpoints require login, though not all. Our Instagram scraper will only cover endpoints that don't require login and are publicly accessible to everyone.

Web Scraping Instagram through login can have many unintended consequences, from your account being blocked to Instagram taking legal action for explicitly breaking their Terms of Service. As noted in this tutorial, login is often unnecessary, so let's look at how to scrape Instagram without having to log in and risk suspension.

How to Scrape Instagram Profile Pages?

Let's start with scraping Instagram user profiles. For this, we'll use Instagram's backend API endpoint, which gets triggered when browsing the profile URL. For example, here's Google's Instagram profile page:

Google's Instagram page

This endpoint is called on page load and returns a JSON dataset with the profile's data. We can use this endpoint to scrape Instagram profile data without having to login to Instagram:

Python

ScrapFly

import json
import httpx

client = httpx.Client(
    headers={
        # this is internal ID of an instegram backend app. It doesn't change often.
        "x-ig-app-id": "936619743392459",
        # use browser-like features
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "*/*",
    }
)


def scrape_user(username: str):
    """Scrape Instagram user's data"""
    result = client.get(
        f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
    )
    data = json.loads(result.content)
    return data["data"]["user"]

print(scrape_user("google"))

import asyncio
import json
from typing import Dict
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
BASE_CONFIG = {
    # Instagram.com requires Anti Scraping Protection bypass feature:
    # for more: https://scrapfly.io/docs/scrape-api/anti-scraping-protection
    "asp": True,
    "country": "CA",
}
INSTAGRAM_APP_ID = "936619743392459"  # this is the public app id for instagram.com


async def scrape_user(username: str) -> Dict:
    """Scrape instagram user's data"""
    result = await SCRAPFLY.async_scrape(
        ScrapeConfig(
            url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
            headers={"x-ig-app-id": INSTAGRAM_APP_ID},
            **BASE_CONFIG,
        )
    )
    data = json.loads(result.content)
    return parse_user(data["data"]["user"])


print(asyncio.run(scrape_user("google")))

Let's break down the above Instagram scraping code. We start by creating an httpx client with basic headers to reduce the chances of getting blocked. Then, we use the defined client to request the Instagram API for profiles and get the scraped data as JSON.

The above code can extract Instagram data such as bio description, follower counts, profile pictures, etc:

Example Output

```json { "biography": "Google unfiltered—sometimes with filters.", "external_url": "https://linkin.bio/google", "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1", "edge_followed_by": { "count": 13015078 }, "fbid": "17841401778116675", "edge_follow": { "count": 33 }, "full_name": "Google", "highlight_reel_count": 5, "id": "1067259270", "is_business_account": true, "is_professional_account": true, "is_supervision_enabled": false, "is_guardian_of_viewer": false, "is_supervised_by_viewer": false, "is_embeds_disabled": false, "is_joined_recently": false, "guardian_id": null, "is_verified": true, "profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83", "username": "google", ... } ```

Great! Our Instagram data scraper can extract profile data - it even includes the details of the first 12 posts including photos and videos!

Parsing Instagram Profile Data

The profile dataset we scraped is quite comprehensive, and it contains many useless details. To reduce it to the most important bits, we can use JMESPath:

import jmespath

def parse_user(data: Dict) -> Dict:
    """Parse instagram user's hidden web dataset for user's data"""
    log.debug("parsing user data {}", data['username'])
    result = jmespath.search(
        """{
        name: full_name,
        username: username,
        id: id,
        category: category_name,
        business_category: business_category_name,
        phone: business_phone_number,
        email: business_email,
        bio: biography,
        bio_links: bio_links[].url,
        homepage: external_url,        
        followers: edge_followed_by.count,
        follows: edge_follow.count,
        facebook_id: fbid,
        is_private: is_private,
        is_verified: is_verified,
        profile_image: profile_pic_url_hd,
        video_count: edge_felix_video_timeline.count,
        videos: edge_felix_video_timeline.edges[].node.{
            id: id, 
            title: title,
            shortcode: shortcode,
            thumb: display_url,
            url: video_url,
            views: video_view_count,
            tagged: edge_media_to_tagged_user.edges[].node.user.username,
            captions: edge_media_to_caption.edges[].node.text,
            comments_count: edge_media_to_comment.count,
            comments_disabled: comments_disabled,
            taken_at: taken_at_timestamp,
            likes: edge_liked_by.count,
            location: location.name,
            duration: video_duration
        },
        image_count: edge_owner_to_timeline_media.count,
        images: edge_felix_video_timeline.edges[].node.{
            id: id, 
            title: title,
            shortcode: shortcode,
            src: display_url,
            url: video_url,
            views: video_view_count,
            tagged: edge_media_to_tagged_user.edges[].node.user.username,
            captions: edge_media_to_caption.edges[].node.text,
            comments_count: edge_media_to_comment.count,
            comments_disabled: comments_disabled,
            taken_at: taken_at_timestamp,
            likes: edge_liked_by.count,
            location: location.name,
            accesibility_caption: accessibility_caption,
            duration: video_duration
        },
        saved_count: edge_saved_media.count,
        collections_count: edge_saved_media.count,
        related_profiles: edge_related_profiles.edges[].node.username
    }""",
        data,
    )
    return result

The above Instagram parsing logic will take in the full dataset and reduce it to a more flat structure that contains only the important fields. We use JMESPath to rephrase the Instagram scraped data into a new dataset with a different structure.

For further details on JMESPath, refer to our dedicated guide.

Quick Intro to Parsing JSON with JMESPath in Python

Learn how to use JMESPath to filter and refine JSON datasets when scraping to exclude tons of unnecessary details.

How to Scrape Instagram Post Pages?

To create an Instagram web scraper for post data, we'll use the previously used method. However, we'll use the Instagram post endpoint this time.

Instagram uses GraphQL to generate the post views dynamically using a backend query. This endpoints returns the different post data, including comments, likes, as well other details. Hence, we'll utilize this GraphQL endpoint to scrape Instagram post data.

All the Instagram Graph API queries can be accessed through the following URL:

https://www.instagram.com/graphql/query/?query_hash=<>&variables=<>

The query_hashand variables represent the query details. To further illustrate this, here's an example of scraping data from a post using the query details:

{
  "query_hash": "b3055c01b4b222b8a47dc12b090e4e64",  # this post query hash which doesn't change
  "variables": {
    "shortcode": "CQYQ1Y1nZ1Y",  # post shortcode (from URL)
    # how many and what comments to include
    "child_comment_count": 20,  
    "fetch_comment_count": 100,
    "parent_comment_count": 24,
    "has_threaded_comments": true
  }
}

Let's utilize the above code within our Instagram data scraper:

Python

ScrapFly

import json
from typing import Dict
from urllib.parse import quote

import httpx

INSTAGRAM_APP_ID = "936619743392459"  # this is the public app id for instagram.com


def scrape_post(url_or_shortcode: str) -> Dict:
    """Scrape single Instagram post data"""
    if "http" in url_or_shortcode:
        shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
    else:
        shortcode = url_or_shortcode
    print(f"scraping instagram post: {shortcode}")

    variables = {
        "shortcode": shortcode,
        "child_comment_count": 20,
        "fetch_comment_count": 100,
        "parent_comment_count": 24,
        "has_threaded_comments": True,
    }
    url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables="
    result = httpx.get(
        url=url + quote(json.dumps(variables)),
        headers={"x-ig-app-id": INSTAGRAM_APP_ID},
    )
    data = json.loads(result.content)
    return data["data"]["shortcode_media"]

# Example usage:
posts = scrape_post("https://www.instagram.com/p/CuE2WNQs6vH/")
print(json.dumps(posts, indent=2, ensure_ascii=False))

from scrapfly import ScrapeConfig, ScrapflyClient

SCRAPFLY = ScrapflyClient("YOUR SCRAPFLY KEY")

async def scrape_post(url_or_shortcode: str) -> Dict:
    """Scrape single Instagram post data"""
    if "http" in url_or_shortcode:
        shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
    else:
        shortcode = url_or_shortcode
    log.info("scraping instagram post: {}", shortcode)

    variables = {
        "shortcode": shortcode,
        "child_comment_count": 20,
        "fetch_comment_count": 100,
        "parent_comment_count": 24,
        "has_threaded_comments": True,
    }
    url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables="
    result = await SCRAPFLY.async_scrape(
        ScrapeConfig(
            url=url + quote(json.dumps(variables)),
            headers={"x-ig-app-id": INSTAGRAM_APP_ID},
            **BASE_CONFIG,
        )
    )
    data = json.loads(result.content)
    return parse_post(data["data"]["shortcode_media"])

The above Instagram scraper code will return the entire post dataset, including various fields, such as post captions, comments, likes, and other information. However, it also includes many flags and unnecessary fields, which aren't very useful.

To reduce the collected data size, we'll parse it with JMESPath.

Parsing Instagram Post Data

Instagram post data is even more complex than the user profile data. Therefore, we'll use JMESPath to create an even more comprehensive Instagram parser to reduce its size:

import jmespath
from typing import Dict

def parse_post(data: Dict) -> Dict:
    print("parsing post data {}", data['shortcode'])
    result = jmespath.search("""{
        id: id,
        shortcode: shortcode,
        dimensions: dimensions,
        src: display_url,
        src_attached: edge_sidecar_to_children.edges[].node.display_url,
        has_audio: has_audio,
        video_url: video_url,
        views: video_view_count,
        plays: video_play_count,
        likes: edge_media_preview_like.count,
        location: location.name,
        taken_at: taken_at_timestamp,
        related: edge_web_media_to_related_media.edges[].node.shortcode,
        type: product_type,
        video_duration: video_duration,
        music: clips_music_attribution_info,
        is_video: is_video,
        tagged_users: edge_media_to_tagged_user.edges[].node.user.username,
        captions: edge_media_to_caption.edges[].node.text,
        related_profiles: edge_related_profiles.edges[].node.username,
        comments_count: edge_media_to_parent_comment.count,
        comments_disabled: comments_disabled,
        comments_next_page: edge_media_to_parent_comment.page_info.end_cursor,
        comments: edge_media_to_parent_comment.edges[].node.{
            id: id,
            text: text,
            created_at: created_at,
            owner: owner.username,
            owner_verified: owner.is_verified,
            viewer_has_liked: viewer_has_liked,
            likes: edge_liked_by.count
        }
    }""", data)
    return result

Similar to the previous parse_user function, we define the desired fields within our parsing logic to only extract the valuable data from our Instagram scraper. Note that different post types (reels, images, videos, etc.) have different fields available.

How to Scrape Instagram For All User Posts

In this section, we'll extract post and comment data from user profiles. For this, we'll utilize another GraphQL endpoint, which requires three variables: the user's ID, which we got from scraping Instagram's user profile earlier, page size, and page offset cursor:

{
  "id": "NUMERIC USER ID",
  "first": 12,
  "after": "CURSOR ID FOR PAGING"
}

As an example, let's scrape Instagram for all the post data created by the Google profile. First, we have to retrieve the user ID and then compile our GraphQL request.

screenshot of google instagram page — Google's Instagram page - we can access all of this post data in JSON format

In this example context, the GraphQL query is the following:

https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={id:1067259270,first: 12}

Navigating the above URL in the browser will return the JSON data from Instagram. It includes the first 12 posts' data, with the following details:

Post photos and videos
The first page of the post's comments
Post metadata such as view and comment counts

However, to extract all the posts, our Instagram scraper needs a pagination logic to request multiple pages. Here's how to create such logic:

Python

ScrapFly

import json
import httpx
from urllib.parse import quote

def scrape_user_posts(user_id: str, session: httpx.Client, page_size=12, max_pages: int = None):
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    _page_number = 1
    while True:
        resp = session.get(base_url + quote(json.dumps(variables)))
        data = resp.json()
        posts = data["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield parse_post(post["node"])  # note: we're using parse_post function from previous chapter
        page_info = posts["page_info"]
        if _page_number == 1:
            print(f"scraping total {posts['count']} posts of {user_id}")
        else:
            print(f"scraping page {_page_number}")
        if not page_info["has_next_page"]:
            break
        if variables["after"] == page_info["end_cursor"]:
            break
        variables["after"] = page_info["end_cursor"]
        _page_number += 1     
        if max_pages and _page_number > max_pages:
            break


# Example run:
if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        posts = list(scrape_user_posts("1067259270", session, max_pages=3))
        print(json.dumps(posts, indent=2, ensure_ascii=False))

from scrapfly import ScrapflyClient, ScrapeConfig

SCRAPFLY = ScrapflyClient("YOUR SCRAPFLY KEY")

async def scrape_user_posts(user_id: str, page_size=50, max_pages: Optional[int] = None):
    """Scrape all posts of an instagram user of given numeric user id"""
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    _page_number = 1
    while True:
        url = base_url + quote(json.dumps(variables))
        result = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
        data = json.loads(result.content)
        posts = data["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield parse_post(post["node"])
        page_info = posts["page_info"]
        if _page_number == 1:
            print(f"scraping total {posts['count']} posts of {user_id}")
        else:
            print(f"scraping posts page {_page_number}")
        if not page_info["has_next_page"]:
            break
        if variables["after"] == page_info["end_cursor"]:
            break
        variables["after"] = page_info["end_cursor"]
        _page_number += 1
        if max_pages and _page_number > max_pages:
            break

Example Output

[
  {
  "__typename": "GraphImage",
  "id": "2890253001563912589",
  "dimensions": {
    "height": 1080,
    "width": 1080
  },
  "display_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-C93CjLzMapgPHOinoltBXypU_wi7s6zzLj1th-s9p-Q&oe=62E80627&_nc_sid=86f79a",
  "display_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 640,
      "config_height": 640
    },
    "..."
  ],
  "is_video": false,
  "tracking_token": "eyJ2ZXJzaW9uIjo1LCJwYXlsb2FkIjp7ImlzX2FuYWx5dGljc190cmFja2VkIjp0cnVlLCJ1dWlkIjoiOWJiNzUyMjljMjU2NDExMTliOGI4NzM5MTE2Mjk4MTYyODkwMjUzMDAxNTYzOTEyNTg5In0sInNpZ25hdHVyZSI6IiJ9",
  "edge_media_to_tagged_user": {
    "edges": [
      {
        "node": {
          "user": {
            "full_name": "Jahmar Gale | Data Analyst",
            "id": "51661809026",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/284007837_5070066053047326_6283083692098566083_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=KXI8oOdZRb4AX8w28nr&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-4iYsawdTCHI5a2zD_PF9F-WCyKnTIPuvYwVAQo82l_w&oe=62E7609B&_nc_sid=86f79a",
            "username": "datajayintech"
          },
          "x": 0.68611115,
          "y": 0.32222223
        }
      },
      "..."
    ]
  },
  "accessibility_caption": "A screenshot of a tweet from @DataJayInTech, which says: \"A recruiter just called me and said The Google Data Analytics Certificate is a good look. This post is to encourage YOU to finish the course.\" The background of the image is red with white, yellow, and blue geometric shapes.",
  "edge_media_to_caption": {
    "edges": [
      {
        "node": {
          "text": "Ring, ring — opportunity is calling📱\nStart your Google Career Certificate journey at the link in bio. #GrowWithGoogle"
        }
      },
      "..."
    ]
  },
  "shortcode": "CgcPcqtOTmN",
  "edge_media_to_comment": {
    "count": 139,
    "page_info": {
      "has_next_page": true,
      "end_cursor": "QVFCaU1FNGZiNktBOWFiTERJdU80dDVwMlNjTE5DWTkwZ0E5NENLU2xLZnFLemw3eTJtcU54ZkVVS2dzYTBKVEppeVpZbkd4dWhQdktubW1QVzJrZXNHbg=="
    },
    "edges": [
      {
        "node": {
          "id": "18209382946080093",
          "text": "@google your company is garbage for meddling with supposedly fair elections...you have been exposed",
          "created_at": 1658867672,
          "did_report_as_spam": false,
          "owner": {
            "id": "39246725285",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/115823005_750712482350308_4191423925707982372_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=104&_nc_ohc=4iOCWDHJLFAAX-JFPh7&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9sH7npBTmHN01BndUhYVreHOk63OqZ5ISJlzNou3QD8A&oe=62E87360&_nc_sid=86f79a",
            "username": "bud_mcgrowin"
          },
          "viewer_has_liked": false
        }
      },
      "..."
    ]
  },
  "edge_media_to_sponsor_user": {
    "edges": []
  },
  "comments_disabled": false,
  "taken_at_timestamp": 1658765028,
  "edge_media_preview_like": {
    "count": 9251,
    "edges": []
  },
  "gating_info": null,
  "fact_check_overall_rating": null,
  "fact_check_information": null,
  "media_preview": "ACoqbj8KkijDnBOfpU1tAkis8mcL2H0zU8EMEqh1Dc56H0/KublclpoejKoo3WtylMgQ4HeohW0LKJ+u7PueaX+z4v8Aa/OmoNJJ6kqtG3UxT0pta9xZRxxswzkDjJrIoatuawkpq6NXTvuN9f6VdDFeAMAdsf8A16oWDKFYMQMnuR6e9Xd8f94fmtax2OGqnzsk3n/I/wDsqN7f5H/2VR74/wC8PzWlEkY7g/iv+NVcys+wy5JML59P89zWDW3dSx+UwGMnjjH9KxKynud1BWi79wpQM+g+tJRUHQO2+4pCuO4pKKAFFHP+RSUUgP/Z",
  "owner": {
    "id": "1067259270",
    "username": "google"
  },
  "location": null,
  "viewer_has_liked": false,
  "viewer_has_saved": false,
  "viewer_has_saved_to_collection": false,
  "viewer_in_photo_of_you": false,
  "viewer_can_reshare": true,
  "thumbnail_src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
  "thumbnail_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9nmASHsbmNWUQnwOdkGE4PvE8b27MqK-gbj5z0YLu8qg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 150,
      "config_height": 150
    },
    "..."
  ]
},
...
]

Instagram Scraping Exercise: Gathering Mentions

Our Instagram scraper can successfully retrieve all the profile posts. Let's use it to extract hashtag mentions of a profile.

For this, we'll scrape Instagram post data, extract each mentioned hashtag, and then group the results:

from collections import Counter

def scrape_hashtag_mentions(user_id, session: httpx.AsyncClient, page_limit:int=None):
    """find all hashtags user mentioned in their posts"""
    hashtags = Counter()
    hashtag_pattern = re.compile(r"#(\w+)")
    for post in scrape_user_posts(user_id, session=session, page_limit=page_limit):
        desc = '\n'.join(post['captions'])
        found = hashtag_pattern.findall(desc)
        for tag in found:
            hashtags[tag] += 1
    return hashtags

Run Code & Example Output

import json
import httpx

if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        # if we only know the username but not user id we can scrape
        # the user profile to find the id:
        user_id = scrape_user("google")["id"]  # will result in: 1067259270
        # then we can scrape the hashtag profile
        hashtags = scrape_hastag_mentions(user_id, session, page_limit=5)
        # order results and print them as JSON:
        print(json.dumps(dict(hashtags.most_common()), indent=2, ensure_ascii=False))

{
    "MadeByGoogle": 10,
    "TeamPixel": 5,
    "GrowWithGoogle": 4,
    "Pixel7": 3,
    "LifeAtGoogle": 3,
    "SaferWithGoogle": 3,
    "Pixel6a": 3,
    "DoodleForGoogle": 2,
    "MySuperG": 2,
    "ShotOnPixel": 1,
    "DayInTheLife": 1,
    "DITL": 1,
    "GoogleAustin": 1,
    "Austin": 1,
    "NestWifi": 1,
    "NestDoorbell": 1,
    "GoogleATAPAmbientExperiments": 1,
    "GoogleATAPxKOCHE": 1,
    "SoliATAP": 1,
    "GooglePixelWatch": 1,
    "Chromecast": 1,
    "DooglersAroundTheWorld": 1,
    "GoogleSearch": 1,
    "GoogleSingapore": 1,
    "InternationalDogDay": 1,
    "Doogler": 1,
    "BlackBusinessMonth": 1,
    "PixelBuds": 1,
    "HowTo": 1,
    "Privacy": 1,
    "Settings": 1,
    "GoogleDoodle": 1,
    "NationalInternDay": 1,
    "GoogleInterns": 1,
    "Sushi": 1,
    "StopMotion": 1,
    "LetsInternetBetter": 1
}

With this simple analytics script, we've scraped profile hashtags, which can be used to determine the interests of a public Instagram account.

Avoid Instagram Scraping Blocking with ScrapFly

Web scraping Instagram can be straightforward. However, Instagram restricts access to its publicly available data. It only allows for a few requests per day for non-logged-in users, and exceeding this limit will redirect the requests to a login page:

instagram login request screenshot — Instagram scraping blocking

To avoid Instagram scraper blocking, we'll take advantage of the ScrapFly API, which manages the turnaround for us!

ScrapFly allows for scraping Instagram and other target websites at scale by providing:

Anti-scraping protection bypass - For scraping any website without getting blocked.
Residential proxies in over 50 countries - For scraping from almost any geographical location while also preventing IP address throttling and blocking.
JavaScript rendering - For scraping dynamic web pages through cloud headless browsers wihtout running them yourself.
Easy to use Python and Typescript SDKs, as well as Scrapy integration.
And much more!

Here is how we can power our Instagram data scraper with the ScrapFly API. All we have to do is replace out HTTP client with the ScrapFly client, enable the asp parameter, and select a proxy country:

# standard web scraping code
import httpx
from parsel import Selector

response = httpx.get("some instagram.com URL")
selector = Selector(response.text)

# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient

# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response = scrapfly.scrape(ScrapeConfig(
    url="website URL",
    asp=True, # enable the anti scraping protection to bypass blocking
    country="US", # set the proxy location to a specfic country
    render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))

# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']

Try for FREE!

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping instagram.com:

Are there public APIs for Instagram?

At the time of writing, Instagram doesn't provide APIs for public usage. However, we have seen that we can utilize hidden Instagram APIs for a fast and efficient Instagram scraper.

Is web scraping instagram.com legal?

Yes. Instagram's data is publicly available so scraping instagram.com at slow, respectful rates would fall under the ethical scraping definition. However, when working with personal data we need to be aware of local copyright and user data laws like GDPR in the EU. For more details, refer our guide on web scraping legality.

How to get Instagram user ID from username?

To get the private user ID from the public username we can scrape user profile using our scrape_user function and the private id will be located in the id field:

with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
    user_id = scrape_user('google')['id']
    print(user_id)

How to get Instagram username from user ID?

To get the public username from Instagram's private user ID we can take advantage of public iPhone API https://i.instagram.com/api/v1/users/<USER_ID>/info/:

import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920"
resp = httpx.get(iphone_api.format("1067259270"), headers={"User-Agent": iphone_user_agent})
print(resp.json()['user']['username'])

Magic parameter __a=1 is no longer working?

Instagram has been rolling out new changes and slowly retiring this feature. However, in this article we've covered two alternatives for ?__a=1 features which are the /v1/ API endpoints and GraphQl endpoints which perform even better!

Latest TikTok Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Web Scraping Instagram - Summary

In this Instagram scraping tutorial, we've taken a look at how to easily scrape Instagram using Python and hidden API endpoints.

We've scraped user profile pages containing user details, posts, meta information, and each individual post data. To reduce scraped datasets, we used the JMESPath parsing library.

Finally, we have explored scaling our Instagram data scraper using the ScrapFly web scraping API, which bypasses Instagram scraping blocking.

Check out ScrapFly Python SDK

Try ScrapFly for FREE!

Apr 22, 2024

How to Scrape Instagram in 2024

Latest Instagram Scraper Code

Why Scrape Instagram?

Project Setup

Note - Login Requirement

How to Scrape Instagram Profile Pages?

Parsing Instagram Profile Data

How to Scrape Instagram Post Pages?

Parsing Instagram Post Data

How to Scrape Instagram For All User Posts

Instagram Scraping Exercise: Gathering Mentions

Avoid Instagram Scraping Blocking with ScrapFly

FAQ

Are there public APIs for Instagram?

Is web scraping instagram.com legal?

How to get Instagram user ID from username?

How to get Instagram username from user ID?

Magic parameter __a=1 is no longer working?

Web Scraping Instagram - Summary

Company

Tools

Resources

Learn Web Scraping

Usage

How to Scrape Instagram in 2024

Latest Instagram Scraper Code

Why Scrape Instagram?

Project Setup

Note - Login Requirement

How to Scrape Instagram Profile Pages?

Parsing Instagram Profile Data

How to Scrape Instagram Post Pages?

Parsing Instagram Post Data

How to Scrape Instagram For All User Posts

Instagram Scraping Exercise: Gathering Mentions

Avoid Instagram Scraping Blocking with ScrapFly

FAQ

Are there public APIs for Instagram?

Is web scraping instagram.com legal?

How to get Instagram user ID from username?

How to get Instagram username from user ID?

Magic parameter __a=1 is no longer working?

Web Scraping Instagram - Summary

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

How to Scrape LinkedIn in 2024

How to Scrape SimilarWeb Website Traffic Analytics

Company

Tools

Resources

Learn Web Scraping

Usage