How to Scrape Instagram

article feature image

In this Python web scraping tutorial we'll explore Instagram - one of the biggest social media websites out there. We'll take a look at how to scrape Instagram's search and explore endpoints to find user profile data and post information.

We'll also focus on some tips and tricks of how to reach these endpoints efficiently and how to avoid web scraper blocking and access all of this information without having to login to instagram. So, let's dive in!

Setup

In this web scraping Instagram tutorial, we'll be using Python with an HTTP client library httpx which will power all of our interactions with Instagram's server. We can install it via pip command:

$ pip install httpx

That's all we need for this tutorial. We'll mostly be working with JSON objects which we can parse in native Python without any extra packages.

The Magic "__a=1" Parameter

The easiest way to scrape instagram is to take advantage of their magic __a=1 url parameter which turns any page into JSON output. For example, let's take Google's instagram page https://www.instagram.com/google/, by simply attaching ?__a=1 we can retrieve data as json https://www.instagram.com/google/?__a=1:

import httpx

print(httpx.get("https://www.instagram.com/google/?__a=1").json())
<...>
"graphql": {
"user": {
  "biography": "Google unfiltered—sometimes with filters.",
  "external_url": "https://linkin.bio/google",
  "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1",
  "edge_followed_by": {
    "count": 13015078
  },
  "fbid": "17841401778116675",
  "edge_follow": {
    "count": 33
  },
  "full_name": "Google",
  "highlight_reel_count": 5,
  "id": "1067259270",
  "is_business_account": true,
  "is_professional_account": true,
  "is_supervision_enabled": false,
  "is_guardian_of_viewer": false,
  "is_supervised_by_viewer": false,
  "is_embeds_disabled": false,
  "is_joined_recently": false,
  "guardian_id": null,
  "is_verified": true,
  "profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83",
  "username": "google",
<...>

It contains all of the user information we could find on the HTML version of the the page and much more!

This parameter can be applied to most instagram endpoints, meaning we can build a web scraper really easily. Let's take a look at some of them and how to use them to put together a fully functional web scraper.

Finding Posts and Users

By Hashtag

illustration of instagrams hashtag search

To find users we can approach many of instagram exploration pages with the same __a=1 parameters. For example, the most common approach is to use /explore/tags endpoint to find posts by hashtag:

def scrape_users_by_hashtag(hashtag: str, session: httpx.Client, page_limit=None):
    url = f"https://www.instagram.com/explore/tags/{hashtag}/?__a=1"
    page = 1
    next_id = ""
    while True:
        resp = session.get(url + (f"&max_id={next_id}" if next_id else ""))
        data = resp.json()['data']
        print(f"scraped location {hashtag} page {page}")
        for section in data["recent"]["sections"]:
            for media in section["layout_content"]["medias"]:
                yield media["media"]["user"]["username"]
        next_id = data["recent"]["next_max_id"]
        if not next_id:
            print(f"no more results after page {page}")
            break
        if page_limit and page_limit < page:
            print(f"reached page limit {page}")
            break
        page += 1

# Example usage:
if __name__ == "__main__":
    with httpx.Client(
        timeout=httpx.Timeout(20.0),
    ) as session:
        for user in scrape_users_by_hashtag("cats", session):
            print(user)

In the example above, we see that this approach even supports paging by using another url parameter called max_id. Using this parameter we can retrieve data of multiple pages via simple loop.

By Location

illustration of instagrams hashtag search

Alternatively, we can also find posts by location by using /explore/locations endpoint. For example, we could find all posts tagged with London location by scraping explore/locations/213385402/london-united-kingdom/?__a=1

Though, for this we need to know location's numeric ID. For London, we can see it's 213385402, but how do we find it for any other location?

For this we need another endpoint - /web/search/topsearch/, which allows us to search top results from a given query. To find the ID of London we'd use url web/search/topsearch/?query=london which will return us top users, hashtag and location results matching this query:

"places": [
    {
      "place": {
        "location": {
          "pk": "213385402",
          "short_name": "London",
          "facebook_places_id": 106078429431815,
          "external_source": "facebook_places",
          "name": "London, United Kingdom",
          "address": "",
          "city": "",
          "has_viewer_saved": false,
          "lng": -0.1094,
          "lat": 51.5141
        },
        "title": "London, United Kingdom",
        "subtitle": "",
        "media_bundles": [],
        "slug": "london-united-kingdom"
      },
      "position": 51
    }
  ],

We can see the location ID is under pk or facebook_places_id fields (which are interchangeable in this scenario).
Let's put this together in Python:

import httpx


def find_location_id(query: str, session: httpx.Client):
    """finds most likely location ID from given location name"""
    resp = session.get(f"https://www.instagram.com/web/search/topsearch/?query={query}")
    data = resp.json()
    try:
        first_result = sorted(data["places"], key=lambda place: place["position"])[0]
        return first_result["place"]["location"]["pk"]
    except IndexError:
        print(f'no locations matching query "{query}" were found')
        return


def scrape_users_by_location(location_id: str, session: httpx.Client, page_limit=None):
    url = f"https://www.instagram.com/explore/locations/{location_id}/?__a=1"
    page = 1
    next_id = ""
    while True:
        resp = session.get(url + (f"&max_id={next_id}" if next_id else ""))
        data = resp.json()["native_location_data"]
        print(f"scraped location {location_id} page {page}")
        for section in data["recent"]["sections"]:
            for media in section["layout_content"]["medias"]:
                yield media["media"]["user"]["username"]
        next_id = data["recent"]["next_max_id"]
        if not next_id:
            print(f"no more results after page {page}")
            break
        if page_limit and page_limit < page:
            print(f"reached page limit {page}")
            break
        page += 1

# Example usage:
if __name__ == "__main__":
    with httpx.Client(
        timeout=httpx.Timeout(20.0)
    ) as session:
        location_name = "London"
        location_id = find_location_id(location_name, session=session)
        print(f'resolved location id from {location_name} to {location_id}')
        for username in scrape_users_by_location(location_id, session=session):
            print(username)

In the example above, we created two functions that defined the logic we've described earlier: one to retrieve location ID from location string and another to retrieve all usernames of recent posts tagged with this location.

note: there's a lot more information in recent post data than just the usernames, we just kept it brief for example purposes but post images, captions and even comment information can be found there.

Scraping User Data

Google's Instagram page

To retrieve instagram user's profile page data we can use the same __a=1 parameter:

def scrape_user(username: str, session: httpx.Client):
    resp = session.get(f"https://www.instagram.com/{username}/?__a=1")
    data = resp.json()
    return {
        k: v
        for k, v in data["graphql"]["user"].items()
        if not k.startswith("edge_")  # skip post data
    }

# Example use:
if __name__ == "__main__":
    with httpx.Client(
        timeout=httpx.Timeout(20.0),
    ) as session:
        user = scrape_user("google", session)

This approach will return instagram user data such as bio description, follower counts, profile pictures etc:

{
  "biography": "Google unfiltered—sometimes with filters.",
  "external_url": "https://linkin.bio/google",
  "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1",
  "edge_followed_by": {
    "count": 13015078
  },
  "fbid": "17841401778116675",
  "edge_follow": {
    "count": 33
  },
  "full_name": "Google",
  "highlight_reel_count": 5,
  "id": "1067259270",
  "is_business_account": true,
  "is_professional_account": true,
  "is_supervision_enabled": false,
  "is_guardian_of_viewer": false,
  "is_supervised_by_viewer": false,
  "is_embeds_disabled": false,
  "is_joined_recently": false,
  "guardian_id": null,
  "is_verified": true,
  "profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83",
  "username": "google",
  ...
}

While this approach also includes the details of first 12 posts we won't be able retrieve more than that. To scrape all instagram user posts we'll have to take advantage of another endpoint.

Scraping User Posts using GraphQL

To retrieve user's posts instead of the magic __a=1 parameter we'll be using graphql endpoint /graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables=

The query_hash parameter indicates graphql query we want executed, for example query for user posts has the hash ID of e769aa130647d2354c40ea6a439bfc08. Some graphql queries accept variables and this query in particular can take in 3 variables:

{
  "id": "NUMERIC USER DI",
  "first": 12,
  "after": "CURSOR ID FOR PAGING"
}

For example, if we would like to retrieve instagram posts create by Google we first have to retrieve this user's ID and then compile our graphql request.

Google's Instagram page - we can access all of this post data in JSON format

In Google's example, the graphql URL would be:

https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={"id":"1067259270","first": 12}

Which we can try in our browser, and we should see a JSON returned with the data of the most recent 12 posts.
However, to retrieve all posts we need to implement a bit of parsing logic:

import json
from urllib.parse import quote


def scrape_user_posts(user_id: str, session: httpx.Client, page_size=12):
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    while True:
        resp = session.get(base_url + quote(json.dumps(variables)))
        posts = resp.json()["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield post["node"]
        page_info = posts["page_info"]
        if not page_info["has_next_page"]:
            break
        variables["after"] = page_info["end_cursor"]

# Example use:
if __name__ == "__main__":
    with httpx.Client(
        timeout=httpx.Timeout(20.0),
    ) as session:
        for post in scrape_user_posts("1067259270", session):
            print(post)

Blocking / Login Requirement

Scraping Instagram seems to be easy though unfortunately Instagram started restricting public access to it's public data. Often allowing users few requests per hour and for anything more requiring a login.

Instagram redirects to login page if scraping is detected

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around Instagram's blocking:

For this we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx requests to scrapfly-sdk requests. See the highlighted areas:


import json
from urllib.parse import quote

from scrapfly import ScrapeConfig, ScrapflyClient


def find_location_id(query: str, session: ScrapflyClient):
    """finds most likely location ID from given location name"""
    resp = session.scrape(
        ScrapeConfig(
            f"https://www.instagram.com/web/search/topsearch/?query={query}", asp=True
        )
    ).upstream_result_into_response()
    data = resp.json()
    try:
        first_result = sorted(data["places"], key=lambda place: place["position"])[0]
        return first_result["place"]["location"]["pk"]
    except IndexError:
        print(f'no locations matching query "{query}" were found')
        return


def scrape_users_by_hashtag(hashtag: str, session: ScrapflyClient, page_limit=None):
    url = f"https://www.instagram.com/explore/tags/{hashtag}/?__a=1"
    page = 1
    next_id = ""
    while True:
        resp = session.scrape(
            ScrapeConfig(url + (f"&max_id={next_id}" if next_id else ""), asp=True)
        ).upstream_result_into_response()
        data = resp.json()["data"]
        print(f"scraped location {hashtag} page {page}")
        for section in data["recent"]["sections"]:
            for media in section["layout_content"]["medias"]:
                yield media["media"]["user"]["username"]
        next_id = data["recent"]["next_max_id"]
        if not next_id:
            print(f"no more results after page {page}")
            break
        if page_limit and page_limit < page:
            print(f"reached page limit {page}")
            break
        page += 1


def scrape_users_by_location(
    location_id: str, session: ScrapflyClient, page_limit=None
):
    url = f"https://www.instagram.com/explore/locations/{location_id}/?__a=1"
    page = 1
    next_id = ""
    while True:
        resp = session.scrape(
            ScrapeConfig(url + (f"&max_id={next_id}" if next_id else ""), asp=True)
        ).upstream_result_into_response()
        data = resp.json()["native_location_data"]
        print(f"scraped location {location_id} page {page}")
        for section in data["recent"]["sections"]:
            for media in section["layout_content"]["medias"]:
                yield media["media"]["user"]["username"]
        next_id = data["recent"]["next_max_id"]
        if not next_id:
            print(f"no more results after page {page}")
            break
        if page_limit and page_limit < page:
            print(f"reached page limit {page}")
            break
        page += 1


def scrape_user(username: str, session: ScrapflyClient):
    resp = session.scrape(
        ScrapeConfig(f"https://www.instagram.com/{username}/?__a=1", asp=True)
    ).upstream_result_into_response()
    data = resp.json()
    return {
        k: v
        for k, v in data["graphql"]["user"].items()
        if not k.startswith("edge_")  # skip post data
    }


def scrape_user_posts(user_id: str, session: ScrapflyClient, page_size=12):
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    while True:
        resp = session.scrape(
            ScrapeConfig(base_url + quote(json.dumps(variables)), asp=True)
        ).upstream_result_into_response()
        posts = resp.json()["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield post["node"]
        page_info = posts["page_info"]
        if not page_info["has_next_page"]:
            break
        variables["after"] = page_info["end_cursor"]

# Example use:
if __name__ == "__main__":
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY") as session:
        user = scrape_user("google", session)
        for post in scrape_user_posts(user["id"], session):
            print(post)

In the example above we're using ScrapFly's Anti Bot Protection Bypass feature to get around Instagram's login requirement. To enable this all we had to do is replace few lines of code and every Instagram page could be accessed without logging in!

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping instagram.com:

How to get Instagram user ID from username?

To get private user ID from public username we can take advantage of ?__a=1 url parameter and scrape https://www.instagram.com/<USERNAME>?__a=1 url which will have user id in it's content. Note that this endpoint might require a login but using ScrapFly API we can scrape it without having to log in.

How to get Instagram username from user ID?

To get public username from Instagram's private user ID we can take advantage of public iPhone API https://i.instagram.com/api/v1/users/<USER_ID>/info/:

import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920"
resp = httpx.get(iphone_api.format("1067259270"), headers={"User-Agent": iphone_user_agent})
print(resp.json()['user']['username'])

Summary

In this Instagram scraping tutorial we've taken a look at how to find Instagram posts and users using hashtag or location lookup. For this we used the magic ?__a=1 url parameters. We also used it to retrieve detailed user information such as follower/following counts, bio description and user account type tags.

We've also taken a look at an alternative approach of using GraphQL endpoints which we used to retrieve all Instagram user posts be it videos or pictures.
Finally, to start scaling our scraper we took a look at how to scrape Instagram without login by taking advantage of ScrapFly's smart blocking bypass systems. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Scrape YellowPages.com

Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape Amazon.com

Tutorial on how to scrape Amazon.com's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape Zillow.com

Tutorial on how to scrape Zillow.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape TripAdvisor.com

Tutorial on how to scrape TripAdvisor.com hotel, restaurant, tour and review data using Python and how to avoid blocking to scrape at scale.