How to Scrape Instagram

article feature image

In this Python web scraping tutorial we'll explore Instagram - one of the biggest social media websites out there. We'll take a look at how to scrape Instagram's search and explore endpoints to find user profile data and post information.

We'll also focus on some tips and tricks of how to reach these endpoints efficiently and how to avoid web scraper blocking and access all of this information without having to login to instagram. So, let's dive in!

Setup

In this web scraping Instagram tutorial, we'll be using Python with an HTTP client library httpx which will power all of our interactions with Instagram's server. We can install it via pip command:

$ pip install httpx

That's all we need for this tutorial. We'll mostly be working with JSON objects which we can parse in native Python without any extra packages.

Finding Posts and Users

By Hashtag

illustration of Instagrams hashtag search

To find users we can approach many Instagram exploration pages. For example, the most common approach is to use /explore/tags endpoint to find posts by hashtag. Instead of scraping the HTML endpoint, we can use Instagram's GraphQl service:

def scrape_hashtag(hashtag: str, session: httpx.AsyncClient, page_size=12, page_limit: Optional[int] = None):
    """scrape user's post data"""
    base_url = "https://www.instagram.com/graphql/query/?query_hash=174a5243287c5f3a7de741089750ab3b&variables="
    variables = {
        "tag_name": hashtag,
        "first": page_size,
        "after": None,
    }
    page = 1
    while True:
        result = session.get(base_url + quote(json.dumps(variables)))
        posts = json.loads(result.content)["data"]["hashtag"]["edge_hashtag_to_media"]
        for post in posts['edges']:
            yield post["node"]
        page_info = posts["page_info"]
        if not page_info["has_next_page"]:
            break
        variables["after"] = page_info["end_cursor"]
        page += 1
        if page > page_limit:
            break
Run Code & Example Output
# Example usage:
if __name__ == "__main__":
    with httpx.Client(
        timeout=httpx.Timeout(20.0),
    ) as session:
        for user in scrape_hashtag("cats", session):
            print(user)
[
  {
    "comments_disabled": false,
    "__typename": "GraphImage",
    "id": "2891447792099336443",
    "edge_media_to_caption": {
      "edges": [
        {
          "node": {
            "text": "🥰\n.\sofinstagram #cats #beautyfullcat #beautifulcatsoftheworld #mycat #prettycat #cats #catsofinstagram #beautifulcatsofinstagram  #catoftheday #catstagram #catlife #catlovers #bestmeow #katzen #ilovemycats #ilovemycat #katzenliebe #katzenleben #katzenaufinstagram #katzenfotografie  #instacat # katze #katzenwelt #catlove #catfluencer#rescuecat #adoptedcat #adoptedcatsofinstagram #adoptedcatsarethebest"
          }
        }
      ]
    },
    "shortcode": "CggfHKGqyD7",
    "edge_media_to_comment": { "count": 0 },
    "taken_at_timestamp": 1658907458,
    "dimensions": { "height": 1350, "width": 1080 },
    "display_url": "https://scontent-vie1-1.cdninstagram.com/v/t51.2885-15/295609100_475025094450455_8311596005796267513_n.webp?stp=dst-jpg_e35_p1080x1080&_nc_ht=scontent-vie1-1.cdninstagram.com&_nc_cat=111&_nc_ohc=Y-hZeZUhkzYAX_mIOop&edm=AA0rjkIBAAAA&ccb=7-5&oh=00_AT-EeW536WMUxlQ3iG6S-LzW2HoLtmSI0Ss_VIxzZJ4Y-A&oe=62E87315&_nc_sid=d997c6",
    "edge_liked_by": {"count": 0},
    "edge_media_preview_like": {"count": 0},
    "owner": {"id": "51742215330"},
    "thumbnail_src": "https://scontent-vie1-1.cdninstagram.com/v/t51.2885-15/295609100_475025094450455_8311596005796267513_n.webp?stp=c0.180.1440.1440a_dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-vie1-1.cdninstagram.com&_nc_cat=111&_nc_ohc=Y-hZeZUhkzYAX_mIOop&edm=AA0rjkIBAAAA&ccb=7-5&oh=00_AT8rtjj_08vk70Qk4AOEgatMsuAVOOJuk8-FFyKHH0uEKQ&oe=62E87315&_nc_sid=d997c6",
    "thumbnail_resources": [
        {
        "src": "https://scontent-vie1-1.cdninstagram.com/v/t51.2885-15/295609100_475025094450455_8311596005796267513_n.webp?stp=c0.180.1440.1440a_dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-vie1-1.cdninstagram.com&_nc_cat=111&_nc_ohc=Y-hZeZUhkzYAX_mIOop&edm=AA0rjkIBAAAA&ccb=7-5&oh=00_AT8rtjj_08vk70Qk4AOEgatMsuAVOOJuk8-FFyKHH0uEKQ&oe=62E87315&_nc_sid=d997c6",
        "config_width": 640,
        "config_height": 640
      },
      "..."
    ],
    "is_video": false,
    "accessibility_caption": null
  },
]

Above, we are using the GraphQl endpoint which takes in a few variables: tag name, page size and offset. Using these few parameters we can paginate through Instagram hashtag-marked posts and find users (see the owner.id field) or just collect posts themselves!

By Location

illustration of instagrams hashtag search

Alternatively, we can also find posts by location by using /explore/locations REST endpoint. For example, we could find all posts tagged with London location by scraping explore/locations/213385402/london-united-kingdom/?__a=1

Though, for this, we need to know the location's numeric ID. For London, we can see it's 213385402, but how do we find it for any other location?

For this, we need another endpoint - /web/search/topsearch/, which allows us to search top results from a given query. To find the ID of London we'd use URL web/search/topsearch/?query=london which will return us the top user, hashtag and location results matching this query:

"places": [
    {
      "place": {
        "location": {
          "pk": "213385402",
          "short_name": "London",
          "facebook_places_id": 106078429431815,
          "external_source": "facebook_places",
          "name": "London, United Kingdom",
          "address": "",
          "city": "",
          "has_viewer_saved": false,
          "lng": -0.1094,
          "lat": 51.5141
        },
        "title": "London, United Kingdom",
        "subtitle": "",
        "media_bundles": [],
        "slug": "london-united-kingdom"
      },
      "position": 51
    }
  ],

We can see the location ID is under pk or facebook_places_id fields (which are interchangeable in this scenario).
Let's put this together in Python:

import httpx


def find_location_id(query: str, session: httpx.Client):
    """finds most likely location ID from given location name"""
    resp = session.get(f"https://www.instagram.com/web/search/topsearch/?query={query}")
    data = resp.json()
    try:
        first_result = sorted(data["places"], key=lambda place: place["position"])[0]
        return first_result["place"]["location"]["pk"]
    except IndexError:
        print(f'no locations matching query "{query}" were found')
        return


def scrape_users_by_location(location_id: str, session: httpx.Client, page_limit=None):
    url = f"https://www.instagram.com/explore/locations/{location_id}/?__a=1"
    page = 1
    next_id = ""
    while True:
        resp = session.get(url + (f"&max_id={next_id}" if next_id else ""))
        data = resp.json()["native_location_data"]
        print(f"scraped location {location_id} page {page}")
        for section in data["recent"]["sections"]:
            for media in section["layout_content"]["medias"]:
                yield media["media"]["user"]["username"]
        next_id = data["recent"]["next_max_id"]
        if not next_id:
            print(f"no more results after page {page}")
            break
        if page_limit and page_limit < page:
            print(f"reached page limit {page}")
            break
        page += 1
Run Code & Example Output
if __name__ == "__main__":
    with httpx.Client(
        timeout=httpx.Timeout(20.0)
    ) as session:
        location_name = "London"
        location_id = find_location_id(location_name, session=session)
        print(f'resolved location id from {location_name} to {location_id}')
        for username in scrape_users_by_location(location_id, session=session):
            print(username)
[
  "username1",
  "username2",
  "username3",
  "..."
]

In the example above, we created two functions that defined the logic we've described earlier: one to retrieve location ID from location string and another to retrieve all usernames of recent posts tagged with this location.

note: there's a lot more information in recent post data than just the usernames, we just kept it brief for example purposes but post images, captions and even comment information can be found there.

Scraping User Data

Google's Instagram page

To retrieve Instagram user's profile page data we can use internal API endpoint:

def scrape_user(username: str, session: ScrapflyClient):
    """scrape user's data"""
    result = session.scrape(ScrapeConfig(
        url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}", 
        headers={"x-ig-app-id": "936619743392459"},
        asp=True
    ))
    data = json.loads(result.content)
    return data['data']['user']
Run Code & Example Output
if __name__ == "__main__":
    with httpx.Client(
        timeout=httpx.Timeout(20.0),
    ) as session:
        user = scrape_user("google", session)

This approach will return Instagram user data such as bio description, follower counts, profile pictures etc:

{
  "biography": "Google unfiltered—sometimes with filters.",
  "external_url": "https://linkin.bio/google",
  "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1",
  "edge_followed_by": {
    "count": 13015078
  },
  "fbid": "17841401778116675",
  "edge_follow": {
    "count": 33
  },
  "full_name": "Google",
  "highlight_reel_count": 5,
  "id": "1067259270",
  "is_business_account": true,
  "is_professional_account": true,
  "is_supervision_enabled": false,
  "is_guardian_of_viewer": false,
  "is_supervised_by_viewer": false,
  "is_embeds_disabled": false,
  "is_joined_recently": false,
  "guardian_id": null,
  "is_verified": true,
  "profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83",
  "username": "google",
  ...
}

This is a great, easy method to scrape Instagram profiles - it even includes the details of the first 12 posts including photos and videos!

That being said, to retrieve the rest of the post details and post comments we need to take a look at another endpoint that allows access to the whole post history.

Scraping User Posts

To retrieve the user's posts and post comments, we'll be using yet another GraphQl endpoint that requires three variables: the user's ID which we got from scraping the user's profile previously, page size and page offset cursor:

{
  "id": "NUMERIC USER ID",
  "first": 12,
  "after": "CURSOR ID FOR PAGING"
}

For example, if we would like to retrieve instagram posts create by Google we first have to retrieve this user's ID and then compile our graphql request.

Google's Instagram page - we can access all of this post data in JSON format

In Google's example, the graphql URL would be:

https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={id:1067259270,first: 12}

We can try in our browser, and we should see a JSON returned with the data of the first 12 posts which include details like:

  • Post photos and videos
  • The first page of post's comments
  • Post metadata such as view and comment counts

However, to retrieve all posts we need to implement pagination logic as all of the information is scattered through multiple pages.

import json
from urllib.parse import quote


def scrape_user_posts(user_id: str, session: httpx.Client, page_size=12):
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    while True:
        resp = session.get(base_url + quote(json.dumps(variables)))
        posts = resp.json()["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield post["node"]
        page_info = posts["page_info"]
        if not page_info["has_next_page"]:
            break
        variables["after"] = page_info["end_cursor"]
Run Code & Example Output
import json
import httpx

if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        posts = list(scrape_user_posts("1067259270", session, page_limit=3))
        print(json.dumps(posts, indent=2, ensure_ascii=False))
[
  {
  "__typename": "GraphImage",
  "id": "2890253001563912589",
  "dimensions": {
    "height": 1080,
    "width": 1080
  },
  "display_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-C93CjLzMapgPHOinoltBXypU_wi7s6zzLj1th-s9p-Q&oe=62E80627&_nc_sid=86f79a",
  "display_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 640,
      "config_height": 640
    },
    "..."
  ],
  "is_video": false,
  "tracking_token": "eyJ2ZXJzaW9uIjo1LCJwYXlsb2FkIjp7ImlzX2FuYWx5dGljc190cmFja2VkIjp0cnVlLCJ1dWlkIjoiOWJiNzUyMjljMjU2NDExMTliOGI4NzM5MTE2Mjk4MTYyODkwMjUzMDAxNTYzOTEyNTg5In0sInNpZ25hdHVyZSI6IiJ9",
  "edge_media_to_tagged_user": {
    "edges": [
      {
        "node": {
          "user": {
            "full_name": "Jahmar Gale | Data Analyst",
            "id": "51661809026",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/284007837_5070066053047326_6283083692098566083_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=KXI8oOdZRb4AX8w28nr&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-4iYsawdTCHI5a2zD_PF9F-WCyKnTIPuvYwVAQo82l_w&oe=62E7609B&_nc_sid=86f79a",
            "username": "datajayintech"
          },
          "x": 0.68611115,
          "y": 0.32222223
        }
      },
      "..."
    ]
  },
  "accessibility_caption": "A screenshot of a tweet from @DataJayInTech, which says: \"A recruiter just called me and said The Google Data Analytics Certificate is a good look. This post is to encourage YOU to finish the course.\" The background of the image is red with white, yellow, and blue geometric shapes.",
  "edge_media_to_caption": {
    "edges": [
      {
        "node": {
          "text": "Ring, ring — opportunity is calling📱\nStart your Google Career Certificate journey at the link in bio. #GrowWithGoogle"
        }
      },
      "..."
    ]
  },
  "shortcode": "CgcPcqtOTmN",
  "edge_media_to_comment": {
    "count": 139,
    "page_info": {
      "has_next_page": true,
      "end_cursor": "QVFCaU1FNGZiNktBOWFiTERJdU80dDVwMlNjTE5DWTkwZ0E5NENLU2xLZnFLemw3eTJtcU54ZkVVS2dzYTBKVEppeVpZbkd4dWhQdktubW1QVzJrZXNHbg=="
    },
    "edges": [
      {
        "node": {
          "id": "18209382946080093",
          "text": "@google your company is garbage for meddling with supposedly fair elections...you have been exposed",
          "created_at": 1658867672,
          "did_report_as_spam": false,
          "owner": {
            "id": "39246725285",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/115823005_750712482350308_4191423925707982372_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=104&_nc_ohc=4iOCWDHJLFAAX-JFPh7&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9sH7npBTmHN01BndUhYVreHOk63OqZ5ISJlzNou3QD8A&oe=62E87360&_nc_sid=86f79a",
            "username": "bud_mcgrowin"
          },
          "viewer_has_liked": false
        }
      },
      "..."
    ]
  },
  "edge_media_to_sponsor_user": {
    "edges": []
  },
  "comments_disabled": false,
  "taken_at_timestamp": 1658765028,
  "edge_media_preview_like": {
    "count": 9251,
    "edges": []
  },
  "gating_info": null,
  "fact_check_overall_rating": null,
  "fact_check_information": null,
  "media_preview": "ACoqbj8KkijDnBOfpU1tAkis8mcL2H0zU8EMEqh1Dc56H0/KublclpoejKoo3WtylMgQ4HeohW0LKJ+u7PueaX+z4v8Aa/OmoNJJ6kqtG3UxT0pta9xZRxxswzkDjJrIoatuawkpq6NXTvuN9f6VdDFeAMAdsf8A16oWDKFYMQMnuR6e9Xd8f94fmtax2OGqnzsk3n/I/wDsqN7f5H/2VR74/wC8PzWlEkY7g/iv+NVcys+wy5JML59P89zWDW3dSx+UwGMnjjH9KxKynud1BWi79wpQM+g+tJRUHQO2+4pCuO4pKKAFFHP+RSUUgP/Z",
  "owner": {
    "id": "1067259270",
    "username": "google"
  },
  "location": null,
  "viewer_has_liked": false,
  "viewer_has_saved": false,
  "viewer_has_saved_to_collection": false,
  "viewer_in_photo_of_you": false,
  "viewer_can_reshare": true,
  "thumbnail_src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
  "thumbnail_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9nmASHsbmNWUQnwOdkGE4PvE8b27MqK-gbj5z0YLu8qg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 150,
      "config_height": 150
    },
    "..."
  ]
},
...
]

Building a Profile - Hashtag Mentions

Now that we can scrape all user posts, we can do a common analytics exercise: scrape all posts and extract hashtag mentions.

For this, let's scrape all posts, extract mentioned hashtags from the post description and count everything up:

from collections import Counter


def scrape_hashtag_mentions(user_id, session: httpx.AsyncClient, page_limit:int=None):
    """find all hashtags user mentioned in their posts"""
    hashtags = Counter()
    hashtag_pattern = re.compile(r"#(\w+)")
    for post in scrape_user_posts(user_id, session=session, page_limit=page_limit):
        desc = post['edge_media_to_caption']['edges'][0]['node']['text']
        found = hashtag_pattern.findall(desc)
        for tag in found:
            hashtags[tag] += 1
    return hashtags
Run Code & Example Output
import json
import httpx

if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        # if we only know the username but not user id we can scrape
        # the user profile to find the id:
        user_id = scrape_user("google")["id"]  # will result in: 1067259270
        # then we can scrape the hashtag profile
        hashtags = scrape_hastag_mentions(user_id, session, page_limit=5)
        # order results and print them as JSON:
        print(json.dumps(dict(hashtags.most_common()), indent=2, ensure_ascii=False))
{
    "MadeByGoogle": 10,
    "TeamPixel": 5,
    "GrowWithGoogle": 4,
    "Pixel7": 3,
    "LifeAtGoogle": 3,
    "SaferWithGoogle": 3,
    "Pixel6a": 3,
    "DoodleForGoogle": 2,
    "MySuperG": 2,
    "ShotOnPixel": 1,
    "DayInTheLife": 1,
    "DITL": 1,
    "GoogleAustin": 1,
    "Austin": 1,
    "NestWifi": 1,
    "NestDoorbell": 1,
    "GoogleATAPAmbientExperiments": 1,
    "GoogleATAPxKOCHE": 1,
    "SoliATAP": 1,
    "GooglePixelWatch": 1,
    "Chromecast": 1,
    "DooglersAroundTheWorld": 1,
    "GoogleSearch": 1,
    "GoogleSingapore": 1,
    "InternationalDogDay": 1,
    "Doogler": 1,
    "BlackBusinessMonth": 1,
    "PixelBuds": 1,
    "HowTo": 1,
    "Privacy": 1,
    "Settings": 1,
    "GoogleDoodle": 1,
    "NationalInternDay": 1,
    "GoogleInterns": 1,
    "Sushi": 1,
    "StopMotion": 1,
    "LetsInternetBetter": 1
}

With this simple analytics script, we've collected profile hashtags that we can use to determine the interestest of any public Instagram account.


With this last piece of code, we're able to find users through the location or hashtag usage and scrape their profile data as well as all of their posts. To scale this scraper up let's take a look at how to avoid being blocked with ScrapFLy next.

Blocking / Login Requirement

Scraping Instagram seems to be easy though unfortunately, Instagram started restricting public access to its public data. Often allowing users few requests per hour and for anything more requiring a login.

Instagram redirects to login page if scraping is detected

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around Instagram's blocking:

For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Instagram web scraper all we need to do is replace httpx requests with scrapfly-sdk requests. Let's take a look at full scraper code with ScrapFly integration

Full Scraper Code

Here's the final instagram scraper code we covered in this tutorial. Our scraper covers how to extract data from instagram profiles and posts, as well as how to find instagram users and posts:

Full Scraper Code with ScrapFly
import json
from typing import Optional
from urllib.parse import quote

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse


def find_location_id(query: str, session: ScrapflyClient):
    """finds most likely location ID from given location name"""
    result = session.scrape(
        ScrapeConfig(
            f"https://www.instagram.com/web/search/topsearch/?query={query}",
            asp=True,
            proxy_pool="public_residential_pool",
            country="US",
        )
    )
    data = json.loads(result.content)
    try:
        first_result = sorted(data["places"], key=lambda place: place["position"])[0]
        return first_result["place"]["location"]["pk"]
    except IndexError:
        print(f'no locations matching query "{query}" were found')
        return


def scrape_users_by_location(location_id: str, session: ScrapflyClient, page_limit: Optional[int] = None):
    url = f"https://www.instagram.com/explore/locations/{location_id}/?__a=1"
    page = 1
    next_id = ""
    while True:
        resp = session.scrape(
            ScrapeConfig(url + (f"&max_id={next_id}" if next_id else ""), asp=True)
        ).upstream_result_into_response()
        data = resp.json()["native_location_data"]
        print(f"scraped location {location_id} page {page}")
        for section in data["recent"]["sections"]:
            for media in section["layout_content"]["medias"]:
                yield media["media"]["user"]["username"]
        next_id = data["recent"]["next_max_id"]
        if not next_id:
            print(f"no more results after page {page}")
            break
        if page_limit and page_limit < page:
            print(f"reached page limit {page}")
            break
        page += 1


def scrape_user(username: str, session: ScrapflyClient):
    """scrape user's data"""
    result = session.scrape(
        ScrapeConfig(
            url=f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
            headers={"x-ig-app-id": "936619743392459"},
            asp=True,
        )
    )
    data = json.loads(result.content)
    return data["data"]["user"]


def scrape_user_posts(user_id: str, session: ScrapflyClient, page_size=12, page_limit: Optional[int] = None):
    """scrape user's post data"""
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    page = 1
    while True:
        result = session.scrape(ScrapeConfig(base_url + quote(json.dumps(variables)), asp=True))
        posts = json.loads(result.content)["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield post["node"]
        page_info = posts["page_info"]
        if not page_info["has_next_page"]:
            break
        variables["after"] = page_info["end_cursor"]
        page += 1
        if page > page_limit:
            break


def scrape_hashtag_mentions(user_id, session: ScrapflyClient, page_limit:Optional[int]=None):
    """find all hashtags user mentioned in their posts"""
    hashtags = Counter()
    hashtag_pattern = re.compile(r"#(\w+)")
    for post in scrape_user_posts(user_id, session=session, page_limit=page_limit):
        desc = post['edge_media_to_caption']['edges'][0]['node']['text']
        found = hashtag_pattern.findall(desc)
        print(found)
        for tag in found:
            hashtags[tag] += 1
    return hashtags


def scrape_hashtag(hashtag: str, session: ScrapflyClient, page_size=12, page_limit: Optional[int] = None):
    """scrape user's post data"""
    base_url = "https://www.instagram.com/graphql/query/?query_hash=174a5243287c5f3a7de741089750ab3b&variables="
    variables = {
        "tag_name": hashtag,
        "first": page_size,
        "after": None,
    }
    page = 1
    while True:
        result = session.scrape(ScrapeConfig(base_url + quote(json.dumps(variables)), asp=True))
        posts = json.loads(result.content)["data"]["hashtag"]["edge_hashtag_to_media"]
        for post in posts["edges"]:
            yield post["node"]
        page_info = posts["page_info"]
        if not page_info["has_next_page"]:
            break
        variables["after"] = page_info["end_cursor"]
        page += 1
        if page > page_limit:
            break


if __name__ == "__main__":
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=20) as session:
        result_location = find_location_id("London, United Kingdom", session)
        result_location_users = list(scrape_users_by_location(result_location, session, page_limit=3))
        result_hashtag_users = list(scrape_hashtag("webscraping", session, page_limit=3))
        result_user = scrape_user("google", session)
        result_user_posts = list(scrape_user_posts(result_user["id"], session, page_limit=3))
        print("done")

In the example above we're using ScrapFly's Anti Bot Protection Bypass feature to get around Instagram's login requirement. To enable this all we had to do is replace a few lines of code and every Instagram page could be accessed without logging in!

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping instagram.com:

Yes. Instagram's data is publicly available so scraping instagram.com at slow, respectful rates would fall under the ethical scraping definition. However, when working with personal data we need to be aware of local copyright and user data laws like GDPR in the EU. For more see our Is Web Scraping Legal? article.

How to get Instagram user ID from username?

To get the private user ID from the public username we can scrape user profile using our scrape_user function and the private id will be located in the id field:

with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
    user_id = scrape_user('google')['id']
    print(user_id)

How to get Instagram username from user ID?

To get the public username from Instagram's private user ID we can take advantage of public iPhone API https://i.instagram.com/api/v1/users/<USER_ID>/info/:

import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920"
resp = httpx.get(iphone_api.format("1067259270"), headers={"User-Agent": iphone_user_agent})
print(resp.json()['user']['username'])

Magic parameter __a=1 is no longer working?

Instagram has been rolling out new changes and slowly retiring this feature. However, in this article we've covered two alternatives for ?__a=1 features which are the /v1/ API endpoints and GraphQl endpoints which perform even better!

Summary

In this Instagram scraping tutorial, we've taken a look at how to find Instagram posts and users using hashtag or location lookup, how to scrape user's profile and post data. For this, we used multiple public API and GraphQl endpoints that generate even more data than we can see on the page itself!

Finally, to start scaling the scraper we took a look at how to scrape Instagram without login by taking advantage of ScrapFly's smart scraper blocking bypass systems. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to Scrape Redfin Real Estate Property Data in Python

Tutorial on how to scrape Redfin.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape Real Estate Property Data using Python

Introduction to scraping real estate property data. What is it, why and how to scrape it? We'll also list dozens of popular scraping targets and common challenges.

How to Scrape Idealista.com in Python - Real Estate Property Data

In this scrape guide we'll be taking a look at Idealista.com - biggest real estate website in Spain, Portugal and Italy.