How to Scrape Goat.com for Fashion Apparel Data in Python
Goat.com is a rising storefront for luxury fashion apparel items. It's known for high quality apparel data so in this tutorial we'll take a look how to scrape it using Python.
Meta has just launched a Twitter alternative called Threads. Today, we'll take a look at how to scrape it using Python.
Threads just like Twitter is a microblogging platform that contains valuable public data used in sentiment analysis, market research and brand awareness.
To scrape it we'll be using Python with hidden web data scraping technique and a few popular community packages so let's dive in and see how to write a Threads scraper in Python from ground up!
🌍 Note that currently Threads is not available in Europe. However, it can still be scraped using proxies or Scrapfly web scraping API
Threads contain a lot of public data that can be used in a variety of ways. From sentiment analysis for market research or brand awareness to finding new leads or keeping track of public figures.
Since all of this data is publicly available it can be legally scraped and used in any research or analysis.
Threads is a javascript application. In fact, threads.net doesn't even load without javascript enabled. So, we'll be using a headless browser to scrape this complex javascript page. This will also help us bypass any scraper blocking techniques that Threads might be using.
For this, we'll be using:
nested_lookup
to parse JSON datasets.All of which can be installed using the pip install command
:
$ pip install playwright nested_lookup jmespath "scrapfly-sdk[all]"
To start, let's take a look at how to scrape a Thread - that's what Threads call a post.
Threads is using hidden web data to load post information. In other words, it hides the data in a <script>
element as JSON and when the page loads it expands it to the visible HTML part of the page.
To reverse engineer this we can use Browser Developer Tools which allows us to inspect the whole page using the Elements
explorer.
So, let's take an thread URL threads.net/t/CuVdfsNtmvh/ and take a look at the page:
Above, we took a piece of text from the visible part of the page and searched it in the Elements explorer of Chrome devtools. We can see that the data is hidden in a <script>
element:
<script type="application/json" data-content-len="71122" data-sjs>
{"require":[["ScheduledServerJS","handle", ...
</script>
To scrape this, we'll have to:
Playwright
).parsel
html parser.<script>
element with hidden data.nested_lookup
and jmespath
.In Python and Playwright or Scrapfly-SDK this is as simple as this short snippet:
import json
from typing import Dict
import jmespath
from parsel import Selector
from nested_lookup import nested_lookup
from playwright.sync_api import sync_playwright
def parse_thread(data: Dict) -> Dict:
"""Parse Twitter tweet JSON dataset for the most important fields"""
result = jmespath.search(
"""{
text: post.caption.text,
published_on: post.taken_at,
id: post.id,
pk: post.pk,
code: post.code,
username: post.user.username,
user_pic: post.user.profile_pic_url,
user_verified: post.user.is_verified,
user_pk: post.user.pk,
user_id: post.user.id,
has_audio: post.has_audio,
reply_count: view_replies_cta_string,
like_count: post.like_count,
images: post.carousel_media[].image_versions2.candidates[1].url,
image_count: post.carousel_media_count,
videos: post.video_versions[].url
}""",
data,
)
result["videos"] = list(set(result["videos"] or []))
if result["reply_count"]:
result["reply_count"] = int(result["reply_count"].split(" ")[0])
result[
"url"
] = f"https://www.threads.net/@{result['username']}/post/{result['code']}"
return result
def scrape_thread(url: str) -> dict:
"""Scrape Threads post and replies from a given URL"""
with sync_playwright() as pw:
# start Playwright browser
browser = pw.chromium.launch()
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
# go to url and wait for the page to load
page.goto(url)
# wait for page to finish loading
page.wait_for_selector("[data-pressable-container=true]")
# find all hidden datasets
selector = Selector(page.content())
hidden_datasets = selector.css('script[type="application/json"][data-sjs]::text').getall()
# find datasets that contain threads data
for hidden_dataset in hidden_datasets:
# skip loading datasets that clearly don't contain threads data
if '"ScheduledServerJS"' not in hidden_dataset:
continue
if "thread_items" not in hidden_dataset:
continue
data = json.loads(hidden_dataset)
# datasets are heavily nested, use nested_lookup to find
# the thread_items key for thread data
thread_items = nested_lookup("thread_items", data)
if not thread_items:
continue
# use our jmespath parser to reduce the dataset to the most important fields
threads = [parse_thread(t) for thread in thread_items for t in thread]
return {
# the first parsed thread is the main post:
"thread": threads[0],
# other threads are replies:
"replies": threads[1:],
}
raise ValueError("could not find thread data in page")
if __name__ == "__main__":
print(scrape_thread("https://www.threads.net/t/CuVdfsNtmvh/"))
import json
from typing import Dict
import jmespath
from nested_lookup import nested_lookup
from scrapfly import ScrapflyClient, ScrapeConfig
SCRAPFLY = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def parse_thread(data: Dict) -> Dict:
"""Parse Twitter tweet JSON dataset for the most important fields"""
result = jmespath.search(
"""{
text: post.caption.text,
published_on: post.taken_at,
id: post.id,
pk: post.pk,
code: post.code,
username: post.user.username,
user_pic: post.user.profile_pic_url,
user_verified: post.user.is_verified,
user_pk: post.user.pk,
user_id: post.user.id,
has_audio: post.has_audio,
reply_count: view_replies_cta_string,
like_count: post.like_count,
images: post.carousel_media[].image_versions2.candidates[1].url,
image_count: post.carousel_media_count,
videos: post.video_versions[].url
}""",
data,
)
result["videos"] = list(set(result["videos"] or []))
if result["reply_count"]:
result["reply_count"] = int(result["reply_count"].split(" ")[0])
result[
"url"
] = f"https://www.threads.net/@{result['username']}/post/{result['code']}"
return result
async def scrape_thread(url: str) -> dict:
"""Scrape Threads post and replies from a given URL"""
_xhr_calls = []
result = await SCRAPFLY.async_scrape(
ScrapeConfig(
url,
asp=True, # enables scraper blocking bypass if any
country="US", # use US IP address as threads is only available in select countries
)
)
hidden_datasets = result.selector.css(
'script[type="application/json"][data-sjs]::text'
).getall()
# find datasets that contain threads data
for hidden_dataset in hidden_datasets:
# skip loading datasets that clearly don't contain threads data
if '"ScheduledServerJS"' not in hidden_dataset:
continue
if "thread_items" not in hidden_dataset:
continue
data = json.loads(hidden_dataset)
# datasets are heavily nested, use nested_lookup to find
# the thread_items key for thread data
thread_items = nested_lookup("thread_items", data)
if not thread_items:
continue
# use our jmespath parser to reduce the dataset to the most important fields
threads = [parse_thread(t) for thread in thread_items for t in thread]
return {
"thread": threads[0],
"replies": threads[1:],
}
raise ValueError("could not find thread data in page")
# Example use:
if __name__ == "__main__":
import asyncio
print(asyncio.run(scrape_thread("https://www.threads.net/t/CuVdfsNtmvh")))
{
"thread": {
"text": "Which photo is your favourite? 1-7? \ud83d\udc22\u2063\n\ud83d\udcf8 benjhicks\n#discoverocean #sea #turtle",
"published_on": 1688602283,
"id": "3140546036288089057_1774281006",
"pk": "3140546036288089057",
"code": "CuVdfsNtmvh",
"username": "discoverocean",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357905677_687132593226696_1257316420350467802_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=by6HPomtlEIAX_TU79A&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCfprFQLeN-e8lZ43JhekQD-OYc4V_DoJSkXk-3ltkGbg&oe=64AB8DD1&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "1774281006",
"user_id": null,
"has_audio": null,
"reply_count": 16,
"like_count": 144,
"images": [
"https://scontent.cdninstagram.com/v/t51.2885-15/357772782_912233233209916_8851335975797563318_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=104&_nc_ohc=Q73aTaIITm8AX_UzzWO&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzUzNDE2OQ%3D%3D.2-ccb7-5&oh=00_AfBpx6VE73c1Gu3ylPBM8pDcn7DEDCAVWHy9K5U-W6mleA&oe=64AC2328&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358022591_941966633571788_5547079452309312243_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=108&_nc_ohc=r_1ZLy3Td78AX-qSkIs&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDU1MDIzNTM4NA%3D%3D.2-ccb7-5&oh=00_AfD4RyAnSa7ePpJmnM4xAv48amHT3UWdS25bujVLbRffYA&oe=64ABE0A6&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358011677_1054586392177359_6122190932692744913_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=6nv5cgppJpEAX_CmOtz&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzUzMDM2Mw%3D%3D.2-ccb7-5&oh=00_AfDaTLEOiFiPBH5y_W7lBvqp3eyRp77O4cz8YmnC68vwuw&oe=64AB5295&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357874691_794366742382971_4750937386069249406_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=8AEPbAEdQJgAX9vSNny&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzM4MjA4NQ%3D%3D.2-ccb7-5&oh=00_AfD_ypTxWKR-_m540i6U2O26refGTeGW81NJDSSFMu5Naw&oe=64AC05B6&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357754045_1930616370638926_4456682519945030375_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=110&_nc_ohc=VN-gy2aFec8AX8spJW1&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUyNTA0NzEzNg%3D%3D.2-ccb7-5&oh=00_AfCSlz4NO0VK0Ywx5Yfp3mdzt7s0M_8ENOww88tlDMyA7g&oe=64AC0932&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358194070_1647893849030205_4282736720026792649_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=109&_nc_ohc=ifvwRADgo2UAX-wQ6Qk&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDY4NDQ3NTMzOA%3D%3D.2-ccb7-5&oh=00_AfBZo9CtsRFKFHAZES5ZasLtSBs34aNet-3AjiDCPWzIWg&oe=64AAE25A&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357789529_1308879830047253_4895055335614315234_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=105&_nc_ohc=NyC9t4Eo4TMAX_YPdo7&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzUxNTc5OA%3D%3D.2-ccb7-5&oh=00_AfAP8fbmp-_jZO0PrE5xkMkKwB97ByzAed2gE2KTjDBPxA&oe=64ABEE08&_nc_sid=10d13b"
],
"image_count": 7
},
"replies": [
{
"text": "Love this! We support marine life! \u270c\ufe0f\ud83d\udc20",
"published_on": 1688603130,
"id": "3140553139307363960_45245856548",
"pk": "3140553139307363960",
"code": "CuVfHDapz54",
"username": "oceanworksco",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358008903_944340296646263_8516557699473359063_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=111&_nc_ohc=YWiOT-bjT_cAX8fsA2t&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfBqrHOp3r71dukqsXXEpDCq3ddKa_NpOPdIxR5Fg9OByw&oe=64AC2E19&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "45245856548",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 1,
"images": null,
"image_count": null
},
{
"text": "Todas",
"published_on": 1688603356,
"id": "3140555036030885778_2090923955",
"pk": "3140555036030885778",
"code": "CuVfip4L1-S",
"username": "pablofalbor",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358186313_651064206955489_8262161722122762505_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=101&_nc_ohc=Dujmzu5txOgAX_EOXE2&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfDOHi0VzyaiChzr0H8jrR3kDAHF4xGPhCZmHKqtiKzbYA&oe=64ABD6F3&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "2090923955",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 1,
"images": null,
"image_count": null
},
{
"text": "Love this \ud83d\udc22",
"published_on": 1688604045,
"id": "3140560817265724205_1790824391",
"pk": "3140560817265724205",
"code": "CuVg2yEokst",
"username": "discoverwildlife",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358119544_1467765200719254_4291393964835766691_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=106&_nc_ohc=gSqO4D5dMRgAX9Xc892&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCj0F23-hG45FreKDSsyS4I8rMYr4RE4KHjxaPOj3veiA&oe=64AB778E&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "1790824391",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 7,
"images": null,
"image_count": null
},
{
"text": "\ud83d\udc22",
"published_on": 1688604213,
"id": "3140562225906498931_1640691155",
"pk": "3140562225906498931",
"code": "CuVhLR-Kr1z",
"username": "wild.awful",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358104111_1471081573662795_5830252528830479175_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=hjLGwe4OjXoAX9sPGiL&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCMuUUk0wX4Gp9pm1CVVZZBOBuWYYitmct5KtR4OVajiA&oe=64AAD131&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "1640691155",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 1,
"images": null,
"image_count": null
},
{
"text": "i love this turtle",
"published_on": 1688604579,
"id": "3140565298276203366_6566866818",
"pk": "3140565298276203366",
"code": "CuVh3_Vq_tm",
"username": "wild.torture",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358160245_236750515875120_1038302928232219193_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=HObBxpp7XHAAX9pBzSo&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfBYbrdIq1hIlr29hkX1-wvQEvn7w5v7Jg4V9_Im03LVQw&oe=64AAA9DE&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "6566866818",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 1,
"images": null,
"image_count": null
},
{
"text": "\ud83d\ude0a\ud83d\ude0a\ud83d\ude0a",
"published_on": 1688604767,
"id": "3140566870183739377_32528229568",
"pk": "3140566870183739377",
"code": "CuViO3SqX_x",
"username": "scaryunderwater",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358356507_287999083804586_2440358040078811348_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=102&_nc_ohc=U7GoLcqeqGIAX9hmPwq&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfDQQDUy-WbMXqgmddO5oDvaX2S-GYC_ZBMwhMEAMQCxPQ&oe=64AAE90F&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "32528229568",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 2,
"images": null,
"image_count": null
},
{
"text": "3",
"published_on": 1688607621,
"id": "3140590810350046199_45979990",
"pk": "3140590810350046199",
"code": "CuVnrPTxUv3",
"username": "floridanativ",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358174537_1728276100918892_1113501204611125734_n.jpg?_nc_ht=scontent.cdninstagram.com&_nc_cat=108&_nc_ohc=QRYXlu05eREAX8H6tom&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfAQDe3lZD2fNZx4eZ_NpCvp8ypBRg0f8zWWFJaXtoMC2g&oe=64AB17EA&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "45979990",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 0,
"images": null,
"image_count": null
},
{
"text": "Lil cutie just sunnin",
"published_on": 1688610280,
"id": "3140613121328550314_206856408",
"pk": "3140613121328550314",
"code": "CuVsv6BuBGq",
"username": "really_mcgina",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358331670_632687988812627_2548015977210590926_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=111&_nc_ohc=TQvG3JbWxyQAX8NojTV&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfBzIEDGNnKD4Mx5f_2pM_wzfqw2SyTQCVYgvgZy1DCRog&oe=64AC38C3&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "206856408",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 0,
"images": null,
"image_count": null
},
{
"text": "Ninja turtles in the making! \ud83d\ude03",
"published_on": 1688613654,
"id": "3140641426530743761_738469198",
"pk": "3140641426530743761",
"code": "CuVzLzTPbXR",
"username": "abhilash.kar",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358004368_937928847493631_8847944530990059088_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=111&_nc_ohc=O6SZma6gyecAX9Xm5HR&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfDzbfDQnvxEuioEJG9HU5wqiR6ZpgthCD0uPtQUG_YJow&oe=64AAD065&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "738469198",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 0,
"images": null,
"image_count": null
},
{
"text": "2",
"published_on": 1688614397,
"id": "3140647653629888574_24834629",
"pk": "3140647653629888574",
"code": "CuV0mavRgg-",
"username": "akhilvinayakmenon",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357797457_6168653289926978_4292997697811583482_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=102&_nc_ohc=sBCCRk6TggMAX_Hsbsx&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfAEZQl-91SN8MdLVXW-MIXa4OlC7kLe2gCzkrFLJibk5Q&oe=64AB6D14&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "24834629",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 0,
"images": null,
"image_count": null
},
{
"text": "The last one",
"published_on": 1688614862,
"id": "3140651558912158984_3578553241",
"pk": "3140651558912158984",
"code": "CuV1fP0PG0I",
"username": "aby_naturography",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357793444_810652800357565_5376031676551631054_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=101&_nc_ohc=pvUe25NQGAMAX8gXGUs&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfAGnCc-04mvNG9kvoioLEYRsuSmFuB4_dZBJgRmNcEXGg&oe=64AB3BC2&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "3578553241",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 0,
"images": null,
"image_count": null
},
{
"text": "I\u2019m going with 2",
"published_on": 1688622014,
"id": "3140711554815324362_8956317",
"pk": "3140711554815324362",
"code": "CuWDITWu3DK",
"username": "bradywillmott",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/358167091_815788336621139_691172760563286451_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=111&_nc_ohc=GI_55N8fd7EAX9dW8mq&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfAhlX-GmlCNBI_pJtVQfKEMkdnv6oZhcX2h7TLPuegoEQ&oe=64ABF67A&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "8956317",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 0,
"images": null,
"image_count": null
}
]
}
Let's quickly unpack what we're doing here.
We first define our parser parse_thread
function which takes the Threads object and uses a simple jmespath
key remap function to reduce the dataset to the most important fields. To scrape the Threads post we're using Playwright - we start a browser in headless mode, navigate to the post URL and wait for it to load. Then, we select all hidden web data elements, find the ones that contain post data and extract them.
Next, let's take a look at how to scrape Threads user profiles.
To scrape Threads profiles we'll be using the same approach we used in scraping Threads posts - we'll scrape the hidden page data. The only difference here is that we'll be scraping a different dataset.
For example, let's use this Threads profile: threads.net/@discoverocean.
Just like with post pages, user data is located in a <script>
element:
<script type="application/json" data-content-len="71122" data-sjs>
{"require":[["ScheduledServerJS","handle", ...
</script>
We can use the same Chrome developer tools approach to figure this out as well:
Except this time around, when parsing the hidden datasets we'll be using nested_lookup
to find dataset userData
key which contains user information. While we're scraping the user's page, we'll also grab their latest threads using the parser from the last section.
So, our process will be:
Playwright
).parsel
html parser.<script>
element with hidden data.userData
dataset using nested_lookup
and jmespath
.Let's take a look at the code:
import json
from typing import Dict
import jmespath
from parsel import Selector
from playwright.sync_api import sync_playwright
from nested_lookup import nested_lookup
# Note: we'll also be using parse_thread function we wrote earlier:
from scrapethread import parse_thread
def parse_profile(data: Dict) -> Dict:
"""Parse Threads profile JSON dataset for the most important fields"""
result = jmespath.search(
"""{
is_private: is_private,
is_verified: is_verified,
profile_pic:text_post_app_ profile_pic_url,
username: username,
full_name: fuhd_ll_name,versions[-1].
bio: biography,
bio_links: bio_links[].url,
followers: follower_count
}""",
data,
)
result["url"] = f"https://www.threads.net/@{result['username']}"
return result
def scrape_profile(url: str) -> dict:
"""Scrape Threads profile and their recent posts from a given URL"""
with sync_playwright() as pw:
# start Playwright browser
browser = pw.chromium.launch()
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto(url)
# wait for page to finish loading
page.wait_for_selector("[data-pressable-container=true]")
selector = Selector(page.content())
parsed = {
"user": {},
"threads": [],
}
# find all hidden datasets
hidden_datasets = selector.css('script[type="application/json"][data-sjs]::text').getall()
for hidden_dataset in hidden_datasets:
# skip loading datasets that clearly don't contain threads data
if '"ScheduledServerJS"' not in hidden_dataset:
continue
if 'userData' not in hidden_dataset and 'thread_items' not in hidden_dataset:
continue
data = json.loads(hidden_dataset)
user_data = nested_lookup('userData', data)
thread_items = nested_lookup('thread_items', data)
if user_data:
parsed['user'] = parse_profile(user_data[0]['user'])
if thread_items:
threads = [
parse_thread(t) for thread in thread_items for t in thread
]
parsed['threads'].extend(threads)
return parsed
if __name__ == "__main__":
print(scrape_profile("https://www.threads.net/@discoverocean"))
import json
from typing import Dict
import jmespath
from nested_lookup import nested_lookup
# Note: we'll also be using parse_thread function we wrote earlier:
from scrapethread import parse_thread
from scrapfly import ScrapflyClient, ScrapeConfig
SCRAPFLY = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def parse_profile(data: Dict) -> Dict:
"""Parse Threads profile JSON dataset for the most important fields"""
result = jmespath.search(
"""{
is_private: text_post_app_is_private,
is_verified: is_verified,
profile_pic: hd_profile_pic_versions[-1].url,
username: username,
full_name: full_name,
bio: biography,
bio_links: bio_links[].url,
followers: follower_count
}""",
data,
)
result["url"] = f"https://www.threads.net/@{result['username']}"
return result
async def scrape_profile(url: str) -> Dict:
"""Scrape Threads profile and their recent posts from a given URL"""
result = await SCRAPFLY.async_scrape(
ScrapeConfig(
url,
asp=True, # enables scraper blocking bypass if any
country="US", # Threads is available only in select countries
)
)
parsed = {
"user": {},
"threads": [],
}
# find all hidden datasets
hidden_datasets = result.selector.css('script[type="application/json"][data-sjs]::text').getall()
for hidden_dataset in hidden_datasets:
# skip loading datasets that clearly don't contain threads data
if '"ScheduledServerJS"' not in hidden_dataset:
continue
if 'userData' not in hidden_dataset and 'thread_items' not in hidden_dataset:
continue
data = json.loads(hidden_dataset)
user_data = nested_lookup('userData', data)
thread_items = nested_lookup('thread_items', data)
if user_data:
parsed['user'] = parse_profile(user_data[0]['user'])
if thread_items:
threads = [
parse_thread(t) for thread in thread_items for t in thread
]
parsed['threads'].extend(threads)
return parsed
# Example use:
if __name__ == "__main__":
import asyncio
result = asyncio.run(scrape_profile("https://www.threads.net/@discoverocean"))
print(json.dumps(result, indent=2))
{
"user": {
"is_private": false,
"is_verified": false,
"profile_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357905677_687132593226696_1257316420350467802_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=by6HPomtlEIAX8r1p-r&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCJ-ImiDOq2Pv-GQ6YgYmY5vYJfZuY4LywkiMgnINmneg&oe=64AB8DD1&_nc_sid=10d13b",
"username": "discoverocean",
"full_name": "Discover Ocean",
"bio": "\ud83d\udc2c| Let's Discover the world underwater together.",
"bio_links": [
""
],
"followers": 12305
},
"threads": [
{
"text": "Bioluminescence \ud83d\udc99\n\ud83d\udccdK.Huraa, Maldives.\n\ud83d\udcf8 nihthu",
"published_on": 1688603704,
"id": "3140557958756538708_1774281006",
"pk": "3140557958756538708",
"code": "CuVgNL4NZlU",
"username": "discoverocean",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357905677_687132593226696_1257316420350467802_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=by6HPomtlEIAX_0Z4PY&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCpxvvZCFBTZtXBmPcFhZgYGRzlIvH3E2pjtLcN5YHSYw&oe=64AB8DD1&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "1774281006",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 126,
"images": [
"https://scontent.cdninstagram.com/v/t51.2885-15/357898561_1646480759110990_2282016183562530065_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=110&_nc_ohc=0dX579m7YCUAX9s_YW1&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1NDA1ODgyNjU2Ng%3D%3D.2-ccb7-5&oh=00_AfBUGmhtEz5IO9qnc6Y_XwRqsWrziJ62FHpj_N7kLqwCoQ&oe=64ABBE16&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358074380_259242933401353_5689864784363913178_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=101&_nc_ohc=BJ3maX3ARAwAX8bemxu&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1NDAwODQ5MDYzOQ%3D%3D.2-ccb7-5&oh=00_AfC2xFvQKw1koUXqFOya3C_I78ctHJKppUtNmoFKUek0Rw&oe=64AAB15A&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357739916_253470737413901_1810678062301795520_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=105&_nc_ohc=AMmWDlQwNPcAX9IMl5k&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1Mzg5OTUzMTY5OA%3D%3D.2-ccb7-5&oh=00_AfAitxWeMuAhfYeMYAaZaBvucmCW4OA43th8-WFRAFkYqA&oe=64ABB088&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357837851_233236956324391_3587300501224619440_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=107&_nc_ohc=56LCrPp8vOUAX_9A0F_&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1Mzg5OTU4MzE1Mg%3D%3D.2-ccb7-5&oh=00_AfBHEx2ub1N-pQoRq7R9IQI_jGAgwWeR5jaU0DRb4I8LeA&oe=64AAC57C&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358188384_241979255265596_1782092362721205032_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=108&_nc_ohc=Z9fQQejhoNYAX_qysEY&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1NDA1MDUwMDkwOA%3D%3D.2-ccb7-5&oh=00_AfB7yz8NFabOQT10ABvQPHidYwSuy9CysGi-CjDwYkboVA&oe=64ABFBBA&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357993013_1020262045642579_8924630979859356844_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=107&_nc_ohc=H3yB2DA_G6oAX8V6rdv&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1NDA1MDY2OTcxOQ%3D%3D.2-ccb7-5&oh=00_AfBv8A5bjqHHycc9w5zAUJtErCTimrVm7llD4TzEOQjgJw&oe=64AB31CF&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358014585_790966502579444_5094475961063879302_n.jpg?stp=dst-jpg_e35_p640x640_sh0.08&_nc_ht=scontent.cdninstagram.com&_nc_cat=108&_nc_ohc=CkUUkzu16lkAX8T3YkC&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1MzkwODAxNjYxNQ%3D%3D.2-ccb7-5&oh=00_AfCX2R6mRh3RW5Uq6U50y5iUUb2y3bqDyr9NlSL48kJ8GA&oe=64AC1C62&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357838724_275617845139635_6311299605731387430_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=SyLYR27zPioAX_MD5-S&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1Mzg5OTUxMTQ5Mg%3D%3D.2-ccb7-5&oh=00_AfBoNX7w-ipIuXs1XCv5x6m44bZqswROHefGby4jUR4IGQ&oe=64AAE09A&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358014588_233666376151325_6905696403552362541_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=105&_nc_ohc=EzISvIKuym8AX_HmoxG&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1Mzg5OTY1NTY0Mg%3D%3D.2-ccb7-5&oh=00_AfB6RldnZGkw7MZlCmgEGTgNljnvuj5KGf1zv_qjeUKzng&oe=64AB82F7&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358513991_670415124933494_7101597399234080270_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=102&_nc_ohc=2URcn43LK38AX_FlHXQ&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU1Nzk1NDAxNjk4ODQ2Mw%3D%3D.2-ccb7-5&oh=00_AfBo4siJyCOiLArC-5nXN3DTo_zha0aBuCJYtz8yaV2HPQ&oe=64AC0E01&_nc_sid=10d13b"
],
"image_count": 10
},
{
"text": "Which photo is your favourite? 1-7? \ud83d\udc22\u2063\n\ud83d\udcf8 benjhicks\n#discoverocean #sea #turtle",
"published_on": 1688602283,
"id": "3140546036288089057_1774281006",
"pk": "3140546036288089057",
"code": "CuVdfsNtmvh",
"username": "discoverocean",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357905677_687132593226696_1257316420350467802_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=by6HPomtlEIAX_0Z4PY&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCpxvvZCFBTZtXBmPcFhZgYGRzlIvH3E2pjtLcN5YHSYw&oe=64AB8DD1&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "1774281006",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 150,
"images": [
"https://scontent.cdninstagram.com/v/t51.2885-15/357772782_912233233209916_8851335975797563318_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=104&_nc_ohc=Q73aTaIITm8AX8QczHk&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzUzNDE2OQ%3D%3D.2-ccb7-5&oh=00_AfAyJ43k3pBg4ZM07H2Umm0vyokyu2ritLRGMF0X8a7BoQ&oe=64AC2328&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358022591_941966633571788_5547079452309312243_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=108&_nc_ohc=r_1ZLy3Td78AX-XD_y0&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDU1MDIzNTM4NA%3D%3D.2-ccb7-5&oh=00_AfD-94nyz4RlnKn2ifxHgcfsveXpOZfueJQk8XfWcJoozw&oe=64ABE0A6&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358011677_1054586392177359_6122190932692744913_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=6nv5cgppJpEAX_8xPtH&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzUzMDM2Mw%3D%3D.2-ccb7-5&oh=00_AfCzw4wnP7xZ8g7sFyxyFdNILrxZJgGnMsLkP225ZmyQFg&oe=64AB5295&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357874691_794366742382971_4750937386069249406_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=8AEPbAEdQJgAX_VTojW&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzM4MjA4NQ%3D%3D.2-ccb7-5&oh=00_AfB0GoLTeEOlGSa-nVwnaNUolPKvLyfymA3ZOA6fOSEs9A&oe=64AC05B6&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357754045_1930616370638926_4456682519945030375_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=110&_nc_ohc=VN-gy2aFec8AX-2tG9c&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUyNTA0NzEzNg%3D%3D.2-ccb7-5&oh=00_AfC7JB-Qv9We_eaIjwSj7k7_xvE70MrYSbAppeTUcgew0Q&oe=64AC0932&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358194070_1647893849030205_4282736720026792649_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=109&_nc_ohc=ifvwRADgo2UAX_bQ8mc&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDY4NDQ3NTMzOA%3D%3D.2-ccb7-5&oh=00_AfAVfiw7tWOKV0CGMMiHtE1FsQtq1RI2CoAz5l0YCIZ-fg&oe=64AAE25A&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357789529_1308879830047253_4895055335614315234_n.jpg?stp=dst-jpg_e35_s480x480&_nc_ht=scontent.cdninstagram.com&_nc_cat=105&_nc_ohc=NyC9t4Eo4TMAX-DfQx9&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDU0NjAzMDUzMzUxNTc5OA%3D%3D.2-ccb7-5&oh=00_AfC2FqMFG5rJMi6CJUH4totseQRsPeyQrsHjxqpGFnUwBQ&oe=64ABEE08&_nc_sid=10d13b"
],
"image_count": 7
},
{
"text": "Paddle-boarding with a pair of humpbacks playing beneath the surface. Meeting humpbacks at their most curious is an unforgettable experience.\n\ud83d\udc0b \ud83d\udcf7 by jaxonark",
"published_on": 1688600963,
"id": "3140534959265324147_1774281006",
"pk": "3140534959265324147",
"code": "CuVa-f7tJxz",
"username": "discoverocean",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357905677_687132593226696_1257316420350467802_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=by6HPomtlEIAX_0Z4PY&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCpxvvZCFBTZtXBmPcFhZgYGRzlIvH3E2pjtLcN5YHSYw&oe=64AB8DD1&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "1774281006",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 62,
"images": [
"https://scontent.cdninstagram.com/v/t51.2885-15/357817778_595268329357119_4687415319333475083_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=nZZHnrG_jGUAX_A728n&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUzNDk1MzI5MjY2MzI5OA%3D%3D.2-ccb7-5&oh=00_AfATr1Ecik-0GIrdmWVz7wJGaj7il6Pgwy8LD3bJXR6ppA&oe=64AB6F72&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358058635_966221361375429_7417166722754460554_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=109&_nc_ohc=jB_sZ6VqGMkAX_3VMJL&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUzNDk1MzI4NDM0MTcwMw%3D%3D.2-ccb7-5&oh=00_AfBSpEG4PFmcR6CbnDv4tCxZXCjHxecHxKcJ8dK0gNS1AQ&oe=64AAD861&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358171575_2112410822424580_1851070397307909623_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=104&_nc_ohc=OGDiLURbXIoAX8y4ChV&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUzNDk1MzQ1MjE3NDA5MQ%3D%3D.2-ccb7-5&oh=00_AfB7DB-8hSC9LzJ9quwxa20N-iT0x7Kyv7ocZTlkEJjoNg&oe=64AB45F3&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357844817_830658022103747_1883555738784429125_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=111&_nc_ohc=bH8Kv_K2PRoAX-ZuYXW&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUzNDk1MzMxNzg5NjI5Ng%3D%3D.2-ccb7-5&oh=00_AfDepuWHZNN1lRDunBmrPX8EXB-Okg2y_6qvNyz6uyx_sg&oe=64AAB252&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358173765_805647647891727_7846009942883222084_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=103&_nc_ohc=fxMvoqF-ivAAX_DdODJ&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUzNDk1MzMwMTA1MjY5NQ%3D%3D.2-ccb7-5&oh=00_AfCQKM3auuqmhthxcPUM1gVFJhQSTHhU8H0EzGpe3ZT6Tg&oe=64ABF7D8&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358177882_974252977031561_2857216000798164470_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=105&_nc_ohc=KU0A0LwQOhkAX-SdAxK&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUzNDk1MzI4NDMzMzgwMw%3D%3D.2-ccb7-5&oh=00_AfCrH9a5-kjGVsxxUxVqozwJ11FF3jFdt1hwYlwFqbPwRg&oe=64AC5596&_nc_sid=10d13b"
],
"image_count": 6
},
{
"text": "1-7? Which photo did you like most?\u2063\nThanks to marysmark for these amazing photography!",
"published_on": 1688599861,
"id": "3140525722451660847_1774281006",
"pk": "3140525722451660847",
"code": "CuVY4FetVAv",
"username": "discoverocean",
"user_pic": "https://scontent.cdninstagram.com/v/t51.2885-19/357905677_687132593226696_1257316420350467802_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=by6HPomtlEIAX_0Z4PY&edm=APs17CUBAAAA&ccb=7-5&oh=00_AfCpxvvZCFBTZtXBmPcFhZgYGRzlIvH3E2pjtLcN5YHSYw&oe=64AB8DD1&_nc_sid=10d13b",
"user_verified": false,
"user_pk": "1774281006",
"user_id": null,
"has_audio": null,
"reply_count": null,
"like_count": 73,
"images": [
"https://scontent.cdninstagram.com/v/t51.2885-15/357893956_819868559810166_372823879976524269_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=106&_nc_ohc=fsCDPZKxTaQAX88IEZV&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUyNTcxNjY3MTk5ODc5MA%3D%3D.2-ccb7-5&oh=00_AfBQaSx8vtLMrS6RUUu1_vxzd9wH8Bs1NWnJM_nbuEqBSw&oe=64ABD93B&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357882655_657623862914793_8081173513090815363_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=109&_nc_ohc=mawPQOyfBNcAX8F_fJd&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUyNTcxNjY2MzQ3NTE3Ng%3D%3D.2-ccb7-5&oh=00_AfAHfqNVuQfgsU43oEIBTwwY35Y7RhfFRH6IA7YwEKJPoA&oe=64AB6533&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357736514_816543509692231_6864635473252476343_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=107&_nc_ohc=vasQBbzmuGwAX8Mbgx8&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUyNTcxNjY3MjAzODE0Mw%3D%3D.2-ccb7-5&oh=00_AfCCmPTgNoOYQnn_xZs66q0EOYKLpiKmDmudodFSTFssww&oe=64AADF3A&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/357778348_194667256596800_2803618999918614763_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=108&_nc_ohc=uooM4JfGh3IAX8kYoj4&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUyNTcxNjgwNjI4NTE3Ng%3D%3D.2-ccb7-5&oh=00_AfC-1j-6IUoVB-4GQRMRhfD0jm7rnV34IP7Ss0f00J8jvw&oe=64ABD793&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358347056_286351437134614_7752000204454823325_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=110&_nc_ohc=RjBOTWid0lMAX8RnuaS&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUyNTcxNjY1NTI0NDg1OA%3D%3D.2-ccb7-5&oh=00_AfCpDhGAb6Sg0hV1tODz2_FM68HHujrqSJUuFqPVOWn4_Q&oe=64AB5B2B&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358392731_961334388530078_9051697274213155358_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=102&_nc_ohc=L9mX0Wornx4AX-xybRM&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUyNTcxNjY5NzE0NTA3Nw%3D%3D.2-ccb7-5&oh=00_AfD4FuslkLG26_f0j5vZxNkPI4gfvd4H6ud384bT6U0zHQ&oe=64ABB213&_nc_sid=10d13b",
"https://scontent.cdninstagram.com/v/t51.2885-15/358194387_513004614310704_6555154952329134811_n.jpg?stp=dst-jpg_e35_p720x720&_nc_ht=scontent.cdninstagram.com&_nc_cat=100&_nc_ohc=AMjV7D9XJJ0AX8go3Wu&edm=APs17CUBAAAA&ccb=7-5&ig_cache_key=MzE0MDUyNTcxNjgwNjIwMDYzMQ%3D%3D.2-ccb7-5&oh=00_AfBG_N_TRoQv86oS3ctLxDfKM5EBmaj27roB-mdtURF3Xw&oe=64AB3BBF&_nc_sid=10d13b"
],
"image_count": 7
}
]
}
Above, just like with scraping Threads posts we started a playwright browser, connected to the page and scraped hidden web data. This data contains all of the Threads user information and their recent posts that we reduced to a single JSON object using jmespath
.
To wrap up our Threads scraper let's take a look at some frequently asked questions about web scraping Threads.
Yes. Threads information and publicly available meaning it can be legally scraped almost everywhere around the world. However this only applies if our scraping speeds do no damage Threads' infrastructure so it's important to keep scraping speed reasonable.
Yes. Threads posts and profile pages are accessible through threads.net without having to login. However, some details like search results and some metadata fields (like follower/following usernames) require login in which is not advisable when web scraping.
No. Currently, Threads.net doesn't offer public API access though it's easy to scrape as described in this tutorial. That being said, it's likely that Threads will offer public API access as it has promised Federation support in the future which is a public data protocol.
Kinda. Threads is a very complex javascript application. So, to scrape it using traditional HTTP requests like Python requests and Beautifulsoup requires a lot of reverse engineering and is not recommended. However, using headless browsers like described in this blog is really easy!
Since search and user discovery is only available on Threads mobile apps and through login we can't safely scrape them using Python. However since Threads is integrated with Instagram we can scrape Instagram to discover Threads content. For that see our guide for scraping Instagram
In this Threads web scraping tutorial we have taken a look at how to scrape Threads posts and user profiles using Python and Playwright.
We used a technique called background request capture which is a prime fit for complex javascript applications such as Meta's Threads.
Finally, we processed the captured data using jmespath
JSON parsing library which makes dataset reshaping a breeze which concluded our Python Threads scraper.
As Threads is a new social network we'll continue to monitor the ways to scrape it and update this guide, so stay tuned!