How to Scrape Instagram

article feature image

In this Python web scraping tutorial we'll explore how to scrape Instagram - one of the biggest social media websites out there. We'll take a look at how to create an Instagram data scraper for user and post data.

We'll also focus on some tips and tricks of how to reach these endpoints efficiently and how to avoid instagram web scraping blocking while accessing all of this information without having to login to Instagram. So, let's dive in!

Latest Instagram Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Project Setup

In this web scraping Instagram tutorial, we'll be using Python with an HTTP client library httpx which will power all of our interactions with Instagram's server.

We'll also be using JMespath JSON parsing library which will help us to reduce giant datasets we get from Instagram to only the most important bits like image URLs, comments, like counts etc.

We can install all of these packages via pip console command:

$ pip install httpx jmespath

For Scrapfly users, we'll also be including a version of each code snippet using scrapfly Python SDK

Note - Login Requirement

Many Instagram endpoints require login though not all. Our instagram data scraper will only cover endpoints that don't require login and are publicly accessible to everyone.

Scraping Instagram through login can have many unintended consequences from your account being blocked to Instagram taking legal action for explicitly breaking their Terms of Service. As noted in this tutorial, login is often not necessary, so let's take a look at how to scrape Instagram without having to log in and risk suspension.

How to Scrape Instagram User Data

Let's start with scraping user profiles. For this, we'll be using Instagram's backend API endpoint which is being fired when the browser loads the profile URL. For example, here's Google's Instagram profile page:

preview screenshot of Google's instagram profile summary
Google's Instagram page

This endpoint is called on page load and returns a JSON dataset with all of the user's data. We can use this endpoint to scrape Instagram user data without having to login to Instagram:

Python
ScrapFly
import json
import httpx

client = httpx.Client(
    headers={
        # this is internal ID of an instegram backend app. It doesn't change often.
        "x-ig-app-id": "936619743392459",
        # use browser-like features
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "*/*",
    }
)


def scrape_user(username: str):
    """Scrape Instagram user's data"""
    result = client.get(
        f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
    )
    data = json.loads(result.content)
    return data["data"]["user"]

print(scrape_user("google"))
import asyncio
import json
from typing import Dict
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
BASE_CONFIG = {
    # Instagram.com requires Anti Scraping Protection bypass feature:
    # for more: https://scrapfly.io/docs/scrape-api/anti-scraping-protection
    "asp": True,
    "country": "CA",
}
INSTAGRAM_APP_ID = "936619743392459"  # this is the public app id for instagram.com


async def scrape_user(username: str) -> Dict:
    """Scrape instagram user's data"""
    result = await SCRAPFLY.async_scrape(
        ScrapeConfig(
            url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
            headers={"x-ig-app-id": INSTAGRAM_APP_ID},
            **BASE_CONFIG,
        )
    )
    data = json.loads(result.content)
    return parse_user(data["data"]["user"])


print(asyncio.run(scrape_user("google")))
Example Output

This approach will return Instagram user data such as bio description, follower counts, profile pictures etc:

{
  "biography": "Google unfiltered—sometimes with filters.",
  "external_url": "https://linkin.bio/google",
  "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1",
  "edge_followed_by": {
    "count": 13015078
  },
  "fbid": "17841401778116675",
  "edge_follow": {
    "count": 33
  },
  "full_name": "Google",
  "highlight_reel_count": 5,
  "id": "1067259270",
  "is_business_account": true,
  "is_professional_account": true,
  "is_supervision_enabled": false,
  "is_guardian_of_viewer": false,
  "is_supervised_by_viewer": false,
  "is_embeds_disabled": false,
  "is_joined_recently": false,
  "guardian_id": null,
  "is_verified": true,
  "profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83",
  "username": "google",
  ...
}

This is a great, our instagram data scraer can scrape profiles - it even includes the details of the first 12 posts including photos and videos!

Parsing Instagram User Data

The user dataset we scraped can be a bit daunting as it contains a lot of data. To reduce it to the most important bits we can use Jmespath.

import jmespath

def parse_user(data: Dict) -> Dict:
    """Parse instagram user's hidden web dataset for user's data"""
    log.debug("parsing user data {}", data['username'])
    result = jmespath.search(
        """{
        name: full_name,
        username: username,
        id: id,
        category: category_name,
        business_category: business_category_name,
        phone: business_phone_number,
        email: business_email,
        bio: biography,
        bio_links: bio_links[].url,
        homepage: external_url,        
        followers: edge_followed_by.count,
        follows: edge_follow.count,
        facebook_id: fbid,
        is_private: is_private,
        is_verified: is_verified,
        profile_image: profile_pic_url_hd,
        video_count: edge_felix_video_timeline.count,
        videos: edge_felix_video_timeline.edges[].node.{
            id: id, 
            title: title,
            shortcode: shortcode,
            thumb: display_url,
            url: video_url,
            views: video_view_count,
            tagged: edge_media_to_tagged_user.edges[].node.user.username,
            captions: edge_media_to_caption.edges[].node.text,
            comments_count: edge_media_to_comment.count,
            comments_disabled: comments_disabled,
            taken_at: taken_at_timestamp,
            likes: edge_liked_by.count,
            location: location.name,
            duration: video_duration
        },
        image_count: edge_owner_to_timeline_media.count,
        images: edge_felix_video_timeline.edges[].node.{
            id: id, 
            title: title,
            shortcode: shortcode,
            src: display_url,
            url: video_url,
            views: video_view_count,
            tagged: edge_media_to_tagged_user.edges[].node.user.username,
            captions: edge_media_to_caption.edges[].node.text,
            comments_count: edge_media_to_comment.count,
            comments_disabled: comments_disabled,
            taken_at: taken_at_timestamp,
            likes: edge_liked_by.count,
            location: location.name,
            accesibility_caption: accessibility_caption,
            duration: video_duration
        },
        saved_count: edge_saved_media.count,
        collections_count: edge_saved_media.count,
        related_profiles: edge_related_profiles.edges[].node.username
    }""",
        data,
    )
    return result

This function will take in the full dataset and reduce it to a more flat structure that contains only the important fields. We're using JMespath's reshaping feature which allows us to distil the dataset into a new structure.

Scraping Instagram Post Data

To scrape Instagram post data we'll be using the same method as before, but this time we'll be using the post endpoint.

To generate post views dynamically Instagram uses a GraphQL backend query which returns post data, comments, likes, and other information. We can use this endpoint to scrape post data.

All Instagram GrapQL endpoints are accessed through:

https://www.instagram.com/graphql/query/?query_hash=<>&variables=<>

Where query hash and variables define the query functionality. For example, to scrape post data we'll be using the following query hash and variables:

{
  "query_hash": "b3055c01b4b222b8a47dc12b090e4e64",  # this post query hash which doesn't change
  "variables": {
    "shortcode": "CQYQ1Y1nZ1Y",  # post shortcode (from URL)
    # how many and what comments to include
    "child_comment_count": 20,  
    "fetch_comment_count": 100,
    "parent_comment_count": 24,
    "has_threaded_comments": true
  }
}

Let's make this happen within our instagram data scraper code:

Python
ScrapFly
import json
from typing import Dict
from urllib.parse import quote

import httpx

INSTAGRAM_APP_ID = "936619743392459"  # this is the public app id for instagram.com


def scrape_post(url_or_shortcode: str) -> Dict:
    """Scrape single Instagram post data"""
    if "http" in url_or_shortcode:
        shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
    else:
        shortcode = url_or_shortcode
    print(f"scraping instagram post: {shortcode})

    variables = {
        "shortcode": shortcode,
        "child_comment_count": 20,
        "fetch_comment_count": 100,
        "parent_comment_count": 24,
        "has_threaded_comments": True,
    }
    url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables="
    result = httpx.get(
        url=url + quote(json.dumps(variables)),
        headers={"x-ig-app-id": INSTAGRAM_APP_ID},
    )
    data = json.loads(result.content)
    return data["data"]["shortcode_media"]

# Example usage:
posts = scrape_post("https://www.instagram.com/p/CuE2WNQs6vH/")
print(json.dumps(posts, indent=2, ensure_ascii=False))
from scrapfly import ScrapeConfig, ScrapflyClient

SCRAPFLY = ScrapflyClient("YOUR SCRAPFLY KEY")

async def scrape_post(url_or_shortcode: str) -> Dict:
    """Scrape single Instagram post data"""
    if "http" in url_or_shortcode:
        shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
    else:
        shortcode = url_or_shortcode
    log.info("scraping instagram post: {}", shortcode)

    variables = {
        "shortcode": shortcode,
        "child_comment_count": 20,
        "fetch_comment_count": 100,
        "parent_comment_count": 24,
        "has_threaded_comments": True,
    }
    url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables="
    result = await SCRAPFLY.async_scrape(
        ScrapeConfig(
            url=url + quote(json.dumps(variables)),
            headers={"x-ig-app-id": INSTAGRAM_APP_ID},
            **BASE_CONFIG,
        )
    )
    data = json.loads(result.content)
    return parse_post(data["data"]["shortcode_media"])

This web scraping approach will return an entire post dataset which includes many useful fields like post captions, comments, likes, and other information. However, it also includes many flags and unnecessary fields which are often not very useful.

To reduce the scraped dataset let's take a look at JSON parsing using Jmespath next.

Parsing Instagram Post Data

Instagram post data is even more complex than the user profile data so we'll be using Jmespath to reduce it once again.

import jmespath

def parse_post(data: Dict) -> Dict:
    log.debug("parsing post data {}", data['shortcode'])
    result = jmespath.search("""{
        id: id,
        shortcode: shortcode,
        dimensions: dimensions,
        src: display_url,
        src_attached: edge_sidecar_to_children.edges[].node.display_url,
        has_audio: has_audio,
        video_url: video_url,
        views: video_view_count,
        plays: video_play_count,
        likes: edge_media_preview_like.count,
        location: location.name,
        taken_at: taken_at_timestamp,
        related: edge_web_media_to_related_media.edges[].node.shortcode,
        type: product_type,
        video_duration: video_duration,
        music: clips_music_attribution_info,
        is_video: is_video,
        tagged_users: edge_media_to_tagged_user.edges[].node.user.username,
        captions: edge_media_to_caption.edges[].node.text,
        related_profiles: edge_related_profiles.edges[].node.username,
        comments_count: edge_media_to_parent_comment.count,
        comments_disabled: comments_disabled,
        comments_next_page: edge_media_to_parent_comment.page_info.end_cursor,
        comments: edge_media_to_parent_comment.edges[].node.{
            id: id,
            text: text,
            created_at: created_at,
            owner: owner.username,
            owner_verified: owner.is_verified,
            viewer_has_liked: viewer_has_liked,
            likes: edge_liked_by.count
        }
    }""", data)
    return result

Here, just like before, we used jmespath to extract the most useful datafields from the massive JSON response we've received from our instagram scraper. Note that different post types (reels, images, videos, etc) have different fields available.

How to Scrape Instagram For All User Posts

To retrieve the user's posts and post comments, we'll be using yet another GraphQL endpoint that requires three variables: the user's ID which we got from scraping the user's profile previously, page size and page offset cursor:

{
  "id": "NUMERIC USER ID",
  "first": 12,
  "after": "CURSOR ID FOR PAGING"
}

For example, if we would like to retrieve all of the Instagram posts create by Google we first have to retrieve this user's ID and then compile our graphql request.

screenshot of google instagram page
Google's Instagram page - we can access all of this post data in JSON format

In Google's example, the graphql URL would be:

https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={id:1067259270,first: 12}

We can try in our browser, and we should see a JSON returned with the data of the first 12 posts which include details like:

  • Post photos and videos
  • The first page of post's comments
  • Post metadata such as view and comment counts

However, to retrieve all posts we need to implement pagination logic as all of the information is scattered through multiple pages.

Python
ScrapFly
import json
import httpx
from urllib.parse import quote


def scrape_user_posts(user_id: str, session: httpx.Client, page_size=12):
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    _page_number = 1
    while True:
        resp = session.get(base_url + quote(json.dumps(variables)))
        data = resp.json()
        posts = data["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield parse_post(post["node"])  # note: we're using parse_post function from previous chapter
        page_info = posts["page_info"]
        if _page_number == 1:
            print(f"scraping total {posts['count']} posts of {user_id}")
        else:
            print(f"scraping page {_page_number}")
        if not page_info["has_next_page"]:
            break
        if variables["after"] == page_info["end_cursor"]:
            break
        variables["after"] = page_info["end_cursor"]
        _page_number += 1

# Example run:
if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        posts = list(scrape_user_posts("1067259270", session, page_limit=3))
        print(json.dumps(posts, indent=2, ensure_ascii=False))
from scrapfly import ScrapflyClient, ScrapeConfig

SCRAPFLY = ScrapflyClient("YOUR SCRAPFLY KEY")

async def scrape_user_posts(user_id: str, page_size=50, max_pages: Optional[int] = None):
    """Scrape all posts of an instagram user of given numeric user id"""
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    _page_number = 1
    while True:
        url = base_url + quote(json.dumps(variables))
        result = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
        data = json.loads(result.content)
        posts = data["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield parse_post(post["node"])
        page_info = posts["page_info"]
        if _page_number == 1:
            print(f"scraping total {posts['count']} posts of {user_id}")
        else:
            print(f"scraping posts page {_page_number}")
        if not page_info["has_next_page"]:
            break
        if variables["after"] == page_info["end_cursor"]:
            break
        variables["after"] = page_info["end_cursor"]
        _page_number += 1
        if max_pages and _page_number > max_pages:
            break
Example Output
[
  {
  "__typename": "GraphImage",
  "id": "2890253001563912589",
  "dimensions": {
    "height": 1080,
    "width": 1080
  },
  "display_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-C93CjLzMapgPHOinoltBXypU_wi7s6zzLj1th-s9p-Q&oe=62E80627&_nc_sid=86f79a",
  "display_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 640,
      "config_height": 640
    },
    "..."
  ],
  "is_video": false,
  "tracking_token": "eyJ2ZXJzaW9uIjo1LCJwYXlsb2FkIjp7ImlzX2FuYWx5dGljc190cmFja2VkIjp0cnVlLCJ1dWlkIjoiOWJiNzUyMjljMjU2NDExMTliOGI4NzM5MTE2Mjk4MTYyODkwMjUzMDAxNTYzOTEyNTg5In0sInNpZ25hdHVyZSI6IiJ9",
  "edge_media_to_tagged_user": {
    "edges": [
      {
        "node": {
          "user": {
            "full_name": "Jahmar Gale | Data Analyst",
            "id": "51661809026",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/284007837_5070066053047326_6283083692098566083_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=KXI8oOdZRb4AX8w28nr&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-4iYsawdTCHI5a2zD_PF9F-WCyKnTIPuvYwVAQo82l_w&oe=62E7609B&_nc_sid=86f79a",
            "username": "datajayintech"
          },
          "x": 0.68611115,
          "y": 0.32222223
        }
      },
      "..."
    ]
  },
  "accessibility_caption": "A screenshot of a tweet from @DataJayInTech, which says: \"A recruiter just called me and said The Google Data Analytics Certificate is a good look. This post is to encourage YOU to finish the course.\" The background of the image is red with white, yellow, and blue geometric shapes.",
  "edge_media_to_caption": {
    "edges": [
      {
        "node": {
          "text": "Ring, ring — opportunity is calling📱\nStart your Google Career Certificate journey at the link in bio. #GrowWithGoogle"
        }
      },
      "..."
    ]
  },
  "shortcode": "CgcPcqtOTmN",
  "edge_media_to_comment": {
    "count": 139,
    "page_info": {
      "has_next_page": true,
      "end_cursor": "QVFCaU1FNGZiNktBOWFiTERJdU80dDVwMlNjTE5DWTkwZ0E5NENLU2xLZnFLemw3eTJtcU54ZkVVS2dzYTBKVEppeVpZbkd4dWhQdktubW1QVzJrZXNHbg=="
    },
    "edges": [
      {
        "node": {
          "id": "18209382946080093",
          "text": "@google your company is garbage for meddling with supposedly fair elections...you have been exposed",
          "created_at": 1658867672,
          "did_report_as_spam": false,
          "owner": {
            "id": "39246725285",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/115823005_750712482350308_4191423925707982372_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=104&_nc_ohc=4iOCWDHJLFAAX-JFPh7&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9sH7npBTmHN01BndUhYVreHOk63OqZ5ISJlzNou3QD8A&oe=62E87360&_nc_sid=86f79a",
            "username": "bud_mcgrowin"
          },
          "viewer_has_liked": false
        }
      },
      "..."
    ]
  },
  "edge_media_to_sponsor_user": {
    "edges": []
  },
  "comments_disabled": false,
  "taken_at_timestamp": 1658765028,
  "edge_media_preview_like": {
    "count": 9251,
    "edges": []
  },
  "gating_info": null,
  "fact_check_overall_rating": null,
  "fact_check_information": null,
  "media_preview": "ACoqbj8KkijDnBOfpU1tAkis8mcL2H0zU8EMEqh1Dc56H0/KublclpoejKoo3WtylMgQ4HeohW0LKJ+u7PueaX+z4v8Aa/OmoNJJ6kqtG3UxT0pta9xZRxxswzkDjJrIoatuawkpq6NXTvuN9f6VdDFeAMAdsf8A16oWDKFYMQMnuR6e9Xd8f94fmtax2OGqnzsk3n/I/wDsqN7f5H/2VR74/wC8PzWlEkY7g/iv+NVcys+wy5JML59P89zWDW3dSx+UwGMnjjH9KxKynud1BWi79wpQM+g+tJRUHQO2+4pCuO4pKKAFFHP+RSUUgP/Z",
  "owner": {
    "id": "1067259270",
    "username": "google"
  },
  "location": null,
  "viewer_has_liked": false,
  "viewer_has_saved": false,
  "viewer_has_saved_to_collection": false,
  "viewer_in_photo_of_you": false,
  "viewer_can_reshare": true,
  "thumbnail_src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
  "thumbnail_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9nmASHsbmNWUQnwOdkGE4PvE8b27MqK-gbj5z0YLu8qg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 150,
      "config_height": 150
    },
    "..."
  ]
},
...
]

Building a Profile - Hashtag Mentions

Now that we can scrape all user posts, we can try out a popular analytics exercise: scrape all posts and extract hashtag mentions.

For this, let's scrape all posts, extract mentioned hashtags from the post description and count everything up:

from collections import Counter


def scrape_hashtag_mentions(user_id, session: httpx.AsyncClient, page_limit:int=None):
    """find all hashtags user mentioned in their posts"""
    hashtags = Counter()
    hashtag_pattern = re.compile(r"#(\w+)")
    for post in scrape_user_posts(user_id, session=session, page_limit=page_limit):
        desc = '\n'.join(post['captions'])
        found = hashtag_pattern.findall(desc)
        for tag in found:
            hashtags[tag] += 1
    return hashtags
Run Code & Example Output
import json
import httpx

if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        # if we only know the username but not user id we can scrape
        # the user profile to find the id:
        user_id = scrape_user("google")["id"]  # will result in: 1067259270
        # then we can scrape the hashtag profile
        hashtags = scrape_hastag_mentions(user_id, session, page_limit=5)
        # order results and print them as JSON:
        print(json.dumps(dict(hashtags.most_common()), indent=2, ensure_ascii=False))
{
    "MadeByGoogle": 10,
    "TeamPixel": 5,
    "GrowWithGoogle": 4,
    "Pixel7": 3,
    "LifeAtGoogle": 3,
    "SaferWithGoogle": 3,
    "Pixel6a": 3,
    "DoodleForGoogle": 2,
    "MySuperG": 2,
    "ShotOnPixel": 1,
    "DayInTheLife": 1,
    "DITL": 1,
    "GoogleAustin": 1,
    "Austin": 1,
    "NestWifi": 1,
    "NestDoorbell": 1,
    "GoogleATAPAmbientExperiments": 1,
    "GoogleATAPxKOCHE": 1,
    "SoliATAP": 1,
    "GooglePixelWatch": 1,
    "Chromecast": 1,
    "DooglersAroundTheWorld": 1,
    "GoogleSearch": 1,
    "GoogleSingapore": 1,
    "InternationalDogDay": 1,
    "Doogler": 1,
    "BlackBusinessMonth": 1,
    "PixelBuds": 1,
    "HowTo": 1,
    "Privacy": 1,
    "Settings": 1,
    "GoogleDoodle": 1,
    "NationalInternDay": 1,
    "GoogleInterns": 1,
    "Sushi": 1,
    "StopMotion": 1,
    "LetsInternetBetter": 1
}

With this simple analytics script, we've collected profile hashtags that we can use to determine the interestest of any public Instagram account.

Avoid Instagram Scraping Blocking with Scrapfly

Scraping Instagram seems to be easy though, unfortunately, Instagram started restricting public access to its public data. Often allowing users few requests per hour and for anything more requiring a login.

instagram login request screenshot
Instagram redirects to login page if scraping is detected

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around Instagram's blocking:

For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Instagram data scraper all we need to do is replace httpx requests with scrapfly-sdk requests. For more see our up-to-date and maintained implementation of Instagram scraper here:

Latest Instagram.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping instagram.com:

Yes. Instagram's data is publicly available so scraping instagram.com at slow, respectful rates would fall under the ethical scraping definition. However, when working with personal data we need to be aware of local copyright and user data laws like GDPR in the EU. For more see our Is Web Scraping Legal? article.

How to get Instagram user ID from username?

To get the private user ID from the public username we can scrape user profile using our scrape_user function and the private id will be located in the id field:

with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
    user_id = scrape_user('google')['id']
    print(user_id)

How to get Instagram username from user ID?

To get the public username from Instagram's private user ID we can take advantage of public iPhone API https://i.instagram.com/api/v1/users/<USER_ID>/info/:

import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920"
resp = httpx.get(iphone_api.format("1067259270"), headers={"User-Agent": iphone_user_agent})
print(resp.json()['user']['username'])

Magic parameter __a=1 is no longer working?

Instagram has been rolling out new changes and slowly retiring this feature. However, in this article we've covered two alternatives for ?__a=1 features which are the /v1/ API endpoints and GraphQl endpoints which perform even better!

Web Scraping Instagram - Summary

In this Instagram scraping tutorial, we've taken a look at how to easily scrape Instagram using Python and hidden API endpoints.

We've scraped the user profile page which contains user details, posts and meta information as well as each individual post data. To reduce scraped datasets we used JMespath JSON parsing library.

Finally, to start scaling the instagram data scraper we took a look at how to take advantage of ScrapFly's smart scraper blocking bypass systems. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to Scrape Bing Search with Python

In this scrape guide we'll be taking a look at scraping Bing search results. It's the second biggest search engine in the world and it contains a lot of data - all retrievable with a bit a of Python.

How to Scrape G2 Company Data and Reviews

In this scrapeguide we're taking a look at G2.com - one of the biggest digital product metawebsites out there. We'll be scraping product data, reviews and company profiles.

How to Scrape Etsy.com Product, Shop and Search Data

In this scrapeguide we're taking a look at Etsy.com - a popular e-commerce market for hand crafted and vintage items. We'll be using Python and HTML parsing to scrape search and product data.