
Allegro.pl is Poland's largest e-commerce platform, offering millions of products across various categories. Whether you're conducting market research, price monitoring, or competitive analysis, scraping Allegro can provide valuable insights into the Polish e-commerce market.
In this guide, we'll explore how to scrape Allegro.pl using Python with requests and BeautifulSoup4. We'll cover two main scenarios: scraping product listings from a category page and extracting detailed information from individual product pages.
Legal Disclaimer and Precautions
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens who are protected by GDPR.
- Do not repurpose the entire public datasets which can be illegal in some countries.
Prerequisites
Before we start scraping, you'll need to install the required Python packages. These libraries will handle HTTP requests and HTML parsing.
pip install requests beautifulsoup4 lxml
The requests library will handle HTTP requests to fetch web pages, while BeautifulSoup4 will parse the HTML content and extract the data we need. The lxml parser provides better performance for HTML parsing.
Example 1: Scraping Smartphone Category Listings
Our first example will demonstrate how to scrape product listings from Allegro's smartphone category. This will show you how to extract basic product information from category pages.
Basic Setup
We'll start by setting up our scraping environment with proper headers and session management to avoid detection.
import requests
from bs4 import BeautifulSoup
import time
import random
import re
from typing import List, Dict, Optional
def create_session() -> requests.Session:
"""Create a session with proper headers to mimic a real browser"""
session = requests.Session()
# Rotate user agents to avoid detection
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0'
]
session.headers.update({
'User-Agent': random.choice(user_agents),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'pl-PL,pl;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Cache-Control': 'max-age=0'
})
return session
def make_request(session: requests.Session, url: str, delay_range: tuple = (1, 3)) -> Optional[requests.Response]:
"""Make a request with random delay to avoid rate limiting"""
try:
# Add random delay between requests
time.sleep(random.uniform(*delay_range))
response = session.get(url, timeout=30)
response.raise_for_status()
print(f" ✅ Successfully accessed {url}")
return response
except requests.RequestException as e:
print(f" ❌ Error making request to {url}: {e}")
return None
def make_request_with_retry(session: requests.Session, url: str, max_retries: int = 3) -> Optional[requests.Response]:
"""Make request with retry logic for failed requests"""
for attempt in range(max_retries):
try:
response = session.get(url, timeout=30)
response.raise_for_status()
print(f" ✅ Successfully accessed {url}")
return response
except requests.RequestException as e:
if attempt == max_retries - 1:
print(f" ❌ Failed after {max_retries} attempts: {e}")
return None
print(f" ⚠️ Attempt {attempt + 1} failed, retrying...")
time.sleep(random.uniform(2, 5))
return None
The session setup includes rotating user agents and proper headers to mimic real browser behavior. The random delays help avoid rate limiting and detection.
Why this matters: Allegro.pl uses sophisticated bot detection, so we need to appear as human as possible. The Polish language headers and realistic user agents are crucial for avoiding 403 Forbidden errors.
Extracting Product Listings
Now we'll create functions to extract product information from the category page. This is where the real work begins - we need to parse the HTML structure to find and extract data from each product listing.
def extract_product_listings(soup: BeautifulSoup) -> List[Dict]:
"""
Extract product listings from the category page
Args:
soup: BeautifulSoup object of the page
Returns:
List of dictionaries containing product data
"""
listings = []
# Find all product containers - Allegro uses li elements with specific classes
product_items = soup.find_all('li', class_='mb54_5r')
print(f" Found {len(product_items)} product listings")
for item in product_items:
try:
# Extract product title from h2 element
title_elem = item.find('h2', class_='mgn2_14')
if not title_elem:
title_elem = item.find('h2')
if not title_elem:
title_elem = item.find('a', class_='mgn2_14')
title = title_elem.get_text().strip() if title_elem else "N/A"
# Extract price from the price span - look for the correct structure
price_elem = item.find('span', class_='mli8_k4')
if not price_elem:
# Look for price in the specific structure from the HTML
price_container = item.find('div', class_='mli8_k4')
if price_container:
price_spans = price_container.find_all('span')
for span in price_spans:
text = span.get_text().strip()
if re.match(r'\d+\.?\d*', text):
price_elem = span
break
if not price_elem:
price_elem = item.find('span', class_='mgn2_27')
if not price_elem:
price_elem = item.find('span', string=re.compile(r'\d+\.?\d*\s*PLN'))
price = price_elem.get_text().strip() if price_elem else "N/A"
# Extract product link
link_elem = item.find('a', href=True)
link = "N/A"
if link_elem and link_elem.get('href'):
href = link_elem['href']
if href.startswith('http'):
link = href
else:
link = "https://allegro.pl" + href
# Extract seller information (Business/Private)
seller_elem = item.find('span', string=re.compile(r'Business|Private'))
if not seller_elem:
seller_elem = item.find('span', class_='mgmw_3z')
seller = seller_elem.get_text().strip() if seller_elem else "N/A"
# Extract rating
rating_elem = item.find('span', class_='m9qz_yq')
if not rating_elem:
rating_elem = item.find('span', string=re.compile(r'\d+\.\d+'))
rating = rating_elem.get_text().strip() if rating_elem else "N/A"
# Extract number of ratings
ratings_count_elem = item.find('span', class_='mpof_uk')
if not ratings_count_elem:
ratings_count_elem = item.find('span', string=re.compile(r'\(\d+\)'))
ratings_count = ratings_count_elem.get_text().strip() if ratings_count_elem else "N/A"
# Extract product specifications
specs_elem = item.find('div', class_='_1e32a_BBBTh')
specifications = {}
if specs_elem:
spec_spans = specs_elem.find_all('span')
current_key = None
for span in spec_spans:
text = span.get_text().strip()
if text.endswith(':') and len(text) > 1:
current_key = text[:-1] # Remove the colon
elif current_key and text:
specifications[current_key] = text
current_key = None
# Extract image URL
img_elem = item.find('img')
image_url = img_elem.get('src') if img_elem else "N/A"
# Extract delivery info
delivery_elem = item.find('span', string=re.compile(r'delivery'))
if not delivery_elem:
delivery_elem = item.find('span', string=re.compile(r'dostawa'))
delivery = delivery_elem.get_text().strip() if delivery_elem else "N/A"
# Extract condition
condition_elem = item.find('span', string=re.compile(r'New|Used|Exhibition'))
if not condition_elem:
condition_elem = item.find('span', class_='mgmw_wo')
condition = condition_elem.get_text().strip() if condition_elem else "N/A"
# Extract installment information
installment_elem = item.find('span', string=re.compile(r'installments'))
if not installment_elem:
installment_elem = item.find('span', string=re.compile(r'x\s*\d+'))
installment_info = installment_elem.get_text().strip() if installment_elem else "N/A"
# Extract Allegro Smart badge
smart_badge = item.find('img', alt='Allegro Smart!')
has_smart = "Yes" if smart_badge else "No"
# Extract recent purchases
purchases_elem = item.find('span', string=re.compile(r'\d+\s*people\s*have\s*recently\s*purchased'))
recent_purchases = purchases_elem.get_text().strip() if purchases_elem else "N/A"
# Create product data dictionary
product_data = {
'title': title,
'price': price,
'seller': seller,
'rating': rating,
'ratings_count': ratings_count,
'specifications': specifications,
'link': link,
'image_url': image_url,
'delivery': delivery,
'condition': condition,
'installment_info': installment_info,
'allegro_smart': has_smart,
'recent_purchases': recent_purchases
}
listings.append(product_data)
except Exception as e:
print(f" ⚠️ Error extracting product: {e}")
continue
return listings
def scrape_listings(url: str) -> Optional[List[Dict]]:
"""
Scrape product listings from a category page
Args:
url: URL of the category page to scrape
Returns:
List of product dictionaries or None if failed
"""
session = create_session()
response = make_request_with_retry(session, url)
if not response:
return None
soup = BeautifulSoup(response.content, 'html.parser')
listings = extract_product_listings(soup)
return listings
The extraction function looks for product containers using li
elements with the mb54_5r
class and extracts comprehensive information including title, price, seller type (Business/Private), ratings, specifications, delivery information, and product condition. It includes fallback selectors to handle potential changes in Allegro's HTML structure.
Key Features of This Extraction:
- Robust Selectors: Multiple fallback options ensure we don't miss data if Allegro changes their HTML
- Comprehensive Data: We extract everything from basic info to detailed specifications
- Error Handling: Each product extraction is wrapped in try-catch to prevent one bad listing from breaking the entire scrape
- Real-time Data: We capture dynamic elements like recent purchase counts and delivery promises
Main Execution
Let's run the smartphone category scraping example. This will demonstrate how all our functions work together to extract real data from Allegro's smartphone category.
def run_listings_example():
"""Run the smartphone category scraping example"""
print("📱 Starting Allegro Smartphone Category Scraper")
# Target URL for smartphone category
url = "https://allegro.pl/kategoria/smartfony-i-telefony-komorkowe-165"
# Scrape listings
listings = scrape_listings(url)
if listings:
print(f"\n✅ Successfully scraped {len(listings)} product listings!")
# Display first few results
print("\n📋 Sample Results:")
for i, product in enumerate(listings[:5], 1):
print(f" {i}. {product['title']}")
print(f" Price: {product['price']}")
print(f" Seller: {product['seller']}")
print(f" Rating: {product['rating']} ({product['ratings_count']})")
print(f" Condition: {product['condition']}")
print(f" Delivery: {product['delivery']}")
print(f" Installments: {product['installment_info']}")
print(f" Allegro Smart: {product['allegro_smart']}")
print(f" Recent Purchases: {product['recent_purchases']}")
if product.get('specifications'):
specs_str = ", ".join([f"{k}: {v}" for k, v in product['specifications'].items()])
print(f" Specs: {specs_str}")
print()
return listings
else:
print("❌ Failed to scrape listings")
return None
# Run the example
if __name__ == "__main__":
run_listings_example()
This example demonstrates how to scrape product listings from Allegro's smartphone category, extracting comprehensive information including ratings, specifications, delivery options, and seller details.
What We Just Built:
- A robust category scraper that handles Allegro's dynamic content
- Comprehensive data extraction with multiple fallback strategies
- Real-time processing of live marketplace data
- Error-resistant code that continues even if some products fail to parse
Example Output
📱 Starting Allegro Smartphone Category Scraper
✅ Successfully accessed https://allegro.pl/kategoria/smartfony-i-telefony-komorkowe-165
Found 90 product listings
✅ Successfully scraped 90 product listings!
📋 Sample Results:
1. Smartfon Motorola Edge 50 Neo 8 GB / 256 GB 5G szary
Price: Promowane
Seller: Kolor
Rating: 4,95 ((42))
Condition: 4,95
Delivery: dostawa jutro
Installments: x 15 rat
Allegro Smart: No
Recent Purchases: N/A
Example 2: Scraping Individual Product Details
Our second example will show how to extract detailed information from individual product pages, including specifications, descriptions, and seller information. This is where we get the deep, comprehensive data that makes scraping valuable.
Why Individual Product Pages Matter:
- Rich Data: Product pages contain much more detailed information than category listings
- Structured Data: Many e-commerce sites include JSON-LD or microdata for better SEO
- Complete Information: From technical specs to seller reputation, everything is available
- Real-time Availability: Stock levels, pricing, and delivery options are always current
Extracting Basic Product Information
We'll start by extracting the basic product information from the detail page. This function is more sophisticated than the category scraper because product pages have richer, more structured data.
def extract_basic_info(soup: BeautifulSoup) -> Dict:
"""
Extract basic product information from the detail page
Args:
soup: BeautifulSoup object of the page
Returns:
Dictionary containing basic product information
"""
basic_info = {}
# Extract structured data from meta tags
meta_url = soup.find('meta', attrs={'itemprop': 'url'})
meta_sku = soup.find('meta', attrs={'itemprop': 'sku'})
meta_gtin = soup.find('meta', attrs={'itemprop': 'gtin'})
meta_brand = soup.find('meta', attrs={'itemprop': 'brand'})
# Extract offer information
offer_price = soup.find('meta', attrs={'itemprop': 'price'})
offer_currency = soup.find('meta', attrs={'itemprop': 'priceCurrency'})
offer_availability = soup.find('link', attrs={'itemprop': 'availability'})
offer_condition = soup.find('meta', attrs={'itemprop': 'itemCondition'})
# Extract product title from h1 element
title_elem = soup.find('h1', class_='mp4t_0')
if not title_elem:
title_elem = soup.find('h1')
if not title_elem:
title_elem = soup.find('title')
basic_info['title'] = title_elem.get_text().strip() if title_elem else "N/A"
# Extract price from structured data or fallback to HTML
if offer_price:
price_value = offer_price.get('content', '')
currency = offer_currency.get('content', 'PLN') if offer_currency else 'PLN'
basic_info['price'] = f"{price_value} {currency}"
else:
price_elem = soup.find('span', class_='mli8_k4')
if not price_elem:
price_elem = soup.find('span', class_='mgn2_27')
if not price_elem:
price_elem = soup.find('span', string=re.compile(r'\d+\.?\d*\s*PLN'))
basic_info['price'] = price_elem.get_text().strip() if price_elem else "N/A"
# Extract structured data
basic_info['sku'] = meta_sku.get('content', 'N/A') if meta_sku else "N/A"
basic_info['gtin'] = meta_gtin.get('content', 'N/A') if meta_gtin else "N/A"
basic_info['brand'] = meta_brand.get('content', 'N/A') if meta_brand else "N/A"
basic_info['product_url'] = meta_url.get('content', 'N/A') if meta_url else "N/A"
basic_info['availability'] = offer_availability.get('href', 'N/A') if offer_availability else "N/A"
basic_info['condition'] = offer_condition.get('content', 'N/A') if offer_condition else "N/A"
# Extract rating
rating_elem = soup.find('span', class_='mgmw_wo')
if not rating_elem:
rating_elem = soup.find('span', string=re.compile(r'\d+\.\d+'))
basic_info['rating'] = rating_elem.get_text().strip() if rating_elem else "N/A"
# Extract number of ratings
ratings_count_elem = soup.find('span', string=re.compile(r'\d+\s*ratings'))
basic_info['ratings_count'] = ratings_count_elem.get_text().strip() if ratings_count_elem else "N/A"
# Extract product images
image_elements = soup.find_all('img', class_='msub_k4')
if not image_elements:
image_elements = soup.find_all('img', class_='mupj_5k')
images = []
for img in image_elements:
src = img.get('src')
if src and not src.startswith('data:') and 'allegroimg.com' in src:
images.append(src)
basic_info['images'] = images
return basic_info
The basic information extraction focuses on the main product details like title, price, seller, and images. Notice how we prioritize structured data from meta tags - this is more reliable than parsing HTML and often contains cleaner, more standardized information.
Structured Data Advantage:
- Reliability: Meta tags are less likely to change than CSS classes
- Standardization: JSON-LD and microdata follow industry standards
- Completeness: Often includes additional fields like GTIN, brand, and availability
- Performance: Faster to extract than complex HTML parsing
Extracting Product Specifications
Now we'll extract detailed product specifications and features. This is where we get the technical details that make product pages so valuable for market research and competitive analysis.
def extract_specifications(soup: BeautifulSoup) -> Dict:
"""
Extract product specifications from the detail page
Args:
soup: BeautifulSoup object of the page
Returns:
Dictionary containing product specifications
"""
specifications = {}
# Find specifications table
specs_table = soup.find('table', class_='myre_zn')
if not specs_table:
specs_table = soup.find('table')
if specs_table:
# Extract specification rows
rows = specs_table.find_all('tr', class_='q1fzo')
if not rows:
rows = specs_table.find_all('tr')
for row in rows:
try:
# Extract specification name and value from table cells
cells = row.find_all('td')
if len(cells) >= 2:
name = cells[0].get_text().strip()
value = cells[1].get_text().strip()
if name and value:
specifications[name] = value
except Exception as e:
continue
return specifications
def extract_features(soup: BeautifulSoup) -> List[str]:
"""
Extract product features and technical specifications from the description
Args:
soup: BeautifulSoup object of the page
Returns:
List of product features
"""
features = []
# Find description section with technical specifications
description_section = soup.find('div', class_='_0d3bd_K6Qpj')
if not description_section:
description_section = soup.find('div', class_='_0d3bd_am0a-')
if not description_section:
description_section = soup.find('div', string=re.compile(r'Technical specifications'))
if description_section:
# Extract feature items from unordered list
feature_items = description_section.find_all('li')
for item in feature_items:
feature_text = item.get_text().strip()
if feature_text and len(feature_text) > 10: # Filter out very short items
features.append(feature_text)
return features
The specifications extraction looks for detailed product information, while the features extraction focuses on product highlights and key selling points. This dual approach ensures we capture both the technical specifications and the marketing highlights that sellers use to attract buyers.
Why Both Matter:
- Technical Specs: Essential for product comparison and market analysis
- Marketing Features: Shows how sellers position their products
- Complete Picture: Technical + marketing data gives full product understanding
- Competitive Intelligence: See what features competitors emphasize
Extracting Seller Information
We'll also extract comprehensive seller information from the product page. This is crucial for understanding the marketplace dynamics and seller reputation.
def extract_seller_info(soup: BeautifulSoup) -> Dict:
"""
Extract seller information from the product page
Args:
soup: BeautifulSoup object of the page
Returns:
Dictionary containing seller information
"""
seller_info = {}
# Extract purchase count information
purchase_elem = soup.find('span', string=re.compile(r'\d+\s*people\s*have\s*recently\s*purchased'))
seller_info['recent_purchases'] = purchase_elem.get_text().strip() if purchase_elem else "N/A"
# Extract invoice information
invoice_elem = soup.find('td', string=re.compile(r'Invoice'))
if invoice_elem:
invoice_value = invoice_elem.find_next_sibling('td')
if invoice_value:
seller_info['invoice'] = invoice_value.get_text().strip()
else:
seller_info['invoice'] = "N/A"
else:
seller_info['invoice'] = "N/A"
# Extract manufacturer code
code_elem = soup.find('td', string=re.compile(r'Manufacturer code'))
if code_elem:
code_value = code_elem.find_next_sibling('td')
if code_value:
seller_info['manufacturer_code'] = code_value.get_text().strip()
else:
seller_info['manufacturer_code'] = "N/A"
else:
seller_info['manufacturer_code'] = "N/A"
# Extract EAN/GTIN
ean_elem = soup.find('td', string=re.compile(r'EAN'))
if ean_elem:
ean_value = ean_elem.find_next_sibling('td')
if ean_value:
seller_info['ean'] = ean_value.get_text().strip()
else:
seller_info['ean'] = "N/A"
else:
seller_info['ean'] = "N/A"
# Extract delivery information
delivery_elem = soup.find('span', string=re.compile(r'dostawa\s*jutro'))
seller_info['delivery_info'] = delivery_elem.get_text().strip() if delivery_elem else "N/A"
# Extract product variants
variants = []
variant_sections = soup.find_all('div', class_='_563f1_fqgxS')
for section in variant_sections:
variant_type = section.find_previous_sibling('span', class_='_563f1_Cpfka')
if variant_type:
variant_type_text = variant_type.get_text().strip()
variant_options = section.find_all('span', class_='_563f1_4z3uJ')
if variant_options:
options = [opt.get_text().strip() for opt in variant_options]
variants.append({
'type': variant_type_text,
'options': options
})
seller_info['variants'] = variants
# Extract installment information
installment_elem = soup.find('span', string=re.compile(r'x\s*\d+\s*rat'))
seller_info['installment_info'] = installment_elem.get_text().strip() if installment_elem else "N/A"
# Extract Allegro Smart badge
smart_badge = soup.find('img', alt='Allegro Smart!')
seller_info['allegro_smart'] = "Yes" if smart_badge else "No"
# Extract best price guarantee
bpg_elem = soup.find('span', string=re.compile(r'Gwarancja najniższej ceny'))
seller_info['best_price_guarantee'] = "Yes" if bpg_elem else "No"
return seller_info
def scrape_product_details(url: str) -> Optional[Dict]:
"""
Scrape detailed product information from a product page
Args:
url: URL of the product page to scrape
Returns:
Dictionary containing detailed product information or None if failed
"""
session = create_session()
response = make_request_with_retry(session, url)
if not response:
return None
soup = BeautifulSoup(response.content, 'html.parser')
# Extract different types of information
basic_info = extract_basic_info(soup)
specifications = extract_specifications(soup)
features = extract_features(soup)
seller_info = extract_seller_info(soup)
# Combine all data
result = {
'url': url,
**basic_info,
'specifications': specifications,
'features': features,
'seller': seller_info
}
return result
The seller information extraction provides insights into the seller's reputation and location, which is crucial for e-commerce analysis. We're capturing everything from recent purchase counts to product variants and installment options.
Advanced Data Points We Extract:
- Purchase Analytics: Recent purchase counts show product popularity
- Payment Options: Installment plans reveal pricing strategies
- Product Variants: Color, size, and memory options for complete inventory
- Trust Signals: Allegro Smart badges and best price guarantees
- Logistics: Delivery promises and seller response times
Main Execution
Let's run the individual product scraping example. This will show how we can extract comprehensive data from a single product page, demonstrating the full power of our scraping capabilities.
def run_product_details_example():
"""Run the individual product details scraping example"""
print("📱 Starting Allegro Individual Product Scraper")
# Target URL for individual product
url = "https://allegro.pl/oferta/smartfon-xiaomi-14t-pro-12-gb-512-gb-5g-niebieski-17386285003"
# Scrape product details
product_data = scrape_product_details(url)
if product_data:
print(f"\n✅ Successfully scraped product details!")
# Display the results
print("\n📋 Product Information:")
print(f" Title: {product_data.get('title', 'N/A')}")
print(f" Price: {product_data.get('price', 'N/A')}")
print(f" Brand: {product_data.get('brand', 'N/A')}")
print(f" SKU: {product_data.get('sku', 'N/A')}")
print(f" GTIN: {product_data.get('gtin', 'N/A')}")
print(f" Condition: {product_data.get('condition', 'N/A')}")
print(f" Availability: {product_data.get('availability', 'N/A')}")
print(f" Rating: {product_data.get('rating', 'N/A')} ({product_data.get('ratings_count', 'N/A')})")
if product_data.get('specifications'):
print(f"\n🔧 Specifications:")
for key, value in product_data['specifications'].items():
print(f" {key}: {value}")
if product_data.get('features'):
print(f"\n✨ Technical Features:")
for feature in product_data['features'][:10]: # Show first 10 features
print(f" • {feature}")
if len(product_data['features']) > 10:
print(f" ... and {len(product_data['features']) - 10} more features")
if product_data.get('seller'):
print(f"\n👤 Product Information:")
seller = product_data['seller']
print(f" Recent Purchases: {seller.get('recent_purchases', 'N/A')}")
print(f" Delivery Info: {seller.get('delivery_info', 'N/A')}")
print(f" Installment Info: {seller.get('installment_info', 'N/A')}")
print(f" Allegro Smart: {seller.get('allegro_smart', 'N/A')}")
print(f" Best Price Guarantee: {seller.get('best_price_guarantee', 'N/A')}")
print(f" Invoice: {seller.get('invoice', 'N/A')}")
print(f" Manufacturer Code: {seller.get('manufacturer_code', 'N/A')}")
print(f" EAN/GTIN: {seller.get('ean', 'N/A')}")
if seller.get('variants'):
print(f"\n🎨 Product Variants:")
for variant in seller['variants']:
print(f" {variant['type']}: {', '.join(variant['options'])}")
return product_data
else:
print("❌ Failed to scrape product details")
return None
# Run the example
if __name__ == "__main__":
run_product_details_example()
This example demonstrates how to extract comprehensive product information from individual Allegro product pages. We've built a sophisticated scraper that can handle the complexity of modern e-commerce pages.
What Makes This Advanced:
- Structured Data Extraction: Leverages JSON-LD and microdata for reliable data
- Multi-layered Fallbacks: Multiple extraction strategies ensure we get the data
- Rich Information: From technical specs to marketing features and seller analytics
- Real-world Ready: Handles the complexity of actual Allegro product pages
Example Output
📱 Starting Allegro Individual Product Scraper
✅ Successfully accessed https://allegro.pl/oferta/smartfon-xiaomi-14t-pro-12-gb-512-gb-5g-niebieski-17386285003
✅ Successfully scraped product details!
📋 Product Information:
Title: Smartfon Xiaomi 14T Pro 12 GB / 512 GB 5G niebieski
Price: 2135.82 PLN
Brand: Xiaomi
SKU: 17386285003
GTIN: 6941812789353
Condition: http://schema.org/NewCondition
Availability: http://schema.org/InStock
Rating: 14T Pro (N/A)
🔧 Specifications:
Stan: Nowy
Faktura: Wystawiam fakturę VAT
Kod producenta: 6941812789353
Marka telefonu: Xiaomi
Model telefonu: 14T Pro
Typ: Smartfon
EAN (GTIN): 6941812789353
Kolor: niebieski
✨ Technical Features:
• Typ urządzenia: Smartfon
• Seria: 14T Pro
• Procesor: MediaTek Dimensity 9300+
• Układ graficzny: Immortalis-G720 MC12
• Pamięć RAM: 12 GB
• Pamięć wbudowana: 512 GB
• Typ ekranu: Dotykowy, AMOLED
• Częstotliwość odświeżania ekranu: 144 Hz
• Przekątna ekranu: 6,67"
• Rozdzielczość ekranu: 2712 x 1220
... and 21 more features
👤 Product Information:
Recent Purchases: 185 osób kupiło ostatnio
Delivery Info: dostawa jutro
Installment Info: x 15 rat
Allegro Smart: Yes
Best Price Guarantee: Yes
Invoice: N/A
Manufacturer Code: N/A
EAN/GTIN: 6941812789353
🎨 Product Variants:
Wbudowana pamięć: 1 TB, 512 GB
Pamięć RAM: 12 GB
Kolor: czarny, niebieski, szary
Handling Anti-Bot Protection
Allegro.pl implements various anti-bot measures to prevent automated scraping. Here are the key strategies we use to avoid detection. Understanding these techniques is crucial for any serious web scraping project.
Why Anti-Bot Protection Matters:
- Modern Websites: Most e-commerce sites use sophisticated bot detection
- IP Blocking: Getting blocked can stop your scraping project entirely
- Rate Limiting: Too many requests can trigger temporary blocks
- Legal Compliance: Respectful scraping is more likely to be tolerated
For more advanced anti-blocking techniques, check out our comprehensive guide on
5 Tools to Scrape Without Blocking and How it All Works
Tutorial on how to avoid web scraper blocking. What is javascript and TLS (JA3) fingerprinting and what role request headers play in blocking.
which covers TLS fingerprinting, IP rotation, and other detection methods.
User Agent Rotation
def rotate_user_agents():
"""Rotate user agents to avoid detection"""
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0'
]
return random.choice(user_agents)
Rotating user agents helps mimic different browsers and reduces the chance of being detected as a bot. This is one of the most basic but effective anti-detection techniques.
How It Works:
- Browser Diversity: Different user agents represent different browsers and operating systems
- Pattern Avoidance: Bots often use the same user agent repeatedly
- Realistic Behavior: Real users don't all use the same browser version
- Detection Evasion: Bot detection systems look for suspicious user agent patterns
Session Management
def create_advanced_session():
"""Create a session with advanced anti-detection measures"""
session = requests.Session()
# Set rotating user agent
session.headers.update({
'User-Agent': rotate_user_agents(),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'pl-PL,pl;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Cache-Control': 'max-age=0'
})
return session
Advanced session management includes proper headers and Polish language preferences to appear more natural. This creates a more convincing browser fingerprint.
Session Management Benefits:
- Cookie Persistence: Maintains session state across requests
- Connection Reuse: More efficient than creating new connections
- Header Consistency: Real browsers send consistent headers
- Language Localization: Polish headers make requests appear local
Advanced Scraping Techniques
For more robust scraping, consider implementing these advanced techniques. These are the next level of sophistication for production scraping systems.
Proxy Rotation
def use_proxy_rotation():
"""Example of proxy rotation for large-scale scraping"""
proxies = [
'http://proxy1:port',
'http://proxy2:port',
'http://proxy3:port'
]
proxy = random.choice(proxies)
return {'http': proxy, 'https': proxy}
Proxy rotation helps distribute requests across different IP addresses to avoid rate limiting. This is essential for large-scale scraping operations.
Proxy Rotation Benefits:
- IP Distribution: Spreads requests across multiple IP addresses
- Geographic Diversity: Can use proxies from different locations
- Rate Limit Avoidance: Each IP has its own rate limit
- Block Recovery: If one IP gets blocked, others continue working
Retry Logic
def make_request_with_retry(session, url, max_retries=3):
"""Make request with retry logic for failed requests"""
for attempt in range(max_retries):
try:
response = session.get(url, timeout=30)
response.raise_for_status()
return response
except requests.RequestException as e:
if attempt == max_retries - 1:
print(f"Failed after {max_retries} attempts: {e}")
return None
print(f"Attempt {attempt + 1} failed, retrying...")
time.sleep(random.uniform(2, 5))
Retry logic ensures that temporary network issues don't cause scraping failures. This is crucial for maintaining reliability in production environments.
Why Retry Logic Matters:
- Network Resilience: Handles temporary connectivity issues
- Server Recovery: Gives servers time to recover from overload
- Data Completeness: Ensures no data is lost due to transient failures
- Production Reliability: Essential for automated scraping systems
For more advanced data processing and analysis techniques, see our guide on
How to Observe E-Commerce Trends using Web Scraping
In this example web scraping project we'll be taking a look at monitoring E-Commerce trends using Python, web scraping and data visualization tools.
Scraping with Scrapfly
For production scraping of Allegro.pl, consider using Scrapfly, which provides:
- Residential proxies that bypass IP-based blocking
- Automatic bot detection bypass using advanced browser fingerprinting
- JavaScript rendering to handle dynamic content
- Rate limiting and retry logic built-in
- Geolocation targeting for region-specific data
Scrapfly handles the complex anti-bot measures that Allegro implements, allowing you to focus on data extraction rather than infrastructure management.
When to Use Scrapfly:
- Large-scale Projects: When you need to scrape thousands of pages
- Production Systems: For reliable, automated data collection
- Complex Sites: When JavaScript rendering is required
- Geographic Requirements: When you need specific country IPs
Best Practices and Tips
When scraping Allegro.pl, follow these best practices to ensure reliable data collection. These guidelines will help you build robust, ethical scraping systems.
Rate Limiting
- Implement delays between requests (1-3 seconds minimum)
- Use random delays to appear more human-like
- Avoid making too many requests from the same IP
Why Rate Limiting is Critical:
- Server Protection: Prevents overwhelming the target server
- Detection Avoidance: Too many requests trigger bot detection
- Ethical Scraping: Respectful scraping is more likely to be tolerated
- Legal Compliance: Many sites have terms of service about scraping
Error Handling
- Always check HTTP status codes
- Implement retry logic for failed requests
- Handle network timeouts gracefully
Robust Error Handling Benefits:
- Data Completeness: Ensures no data is lost due to transient errors
- System Reliability: Prevents crashes from unexpected responses
- Debugging: Better error messages help identify issues quickly
- Production Ready: Essential for automated systems
Data Validation
- Verify extracted data is not empty or malformed
- Implement fallback selectors for changing HTML structure
- Log extraction errors for debugging
Data Quality Assurance:
- Accuracy: Validated data is more reliable for analysis
- Consistency: Ensures uniform data format across all scrapes
- Change Adaptation: Fallback selectors handle site updates
- Monitoring: Logs help track scraping success rates
For more comprehensive web scraping best practices, see our
Everything to Know to Start Web Scraping in Python Today
Complete introduction to web scraping using Python: http, parsing, AI, scaling and deployment.
Related E-commerce Scraping Guides
If you're interested in scraping other e-commerce platforms, check out these guides:
- Comprehensive guide to scraping AutoScout24 automotive marketplace
How to Scrape AutoScout24
Learn how to scrape AutoScout24 for car listings, prices, specifications, and detailed vehicle information using Python. Complete guide with code examples and anti-blocking techniques.

- Guide to extracting Ticketmaster event data and ticket information
How to Scrape Ticketmaster
Learn how to scrape Ticketmaster for event data including concerts, venues, dates, and ticket information using Python. Complete guide with code examples and anti-blocking techniques.

- Techniques for scraping Mouser electronics components
How to Scrape Mouser.com
Learn how to scrape Mouser.com electronic component data including prices, specifications, and inventory using Python. Complete guide with code examples and anti-blocking techniques.

FAQ
Now let's answer some frequently asked questions.
What makes Allegro.pl scraping particularly challenging?
Allegro.pl uses sophisticated anti-bot protection including IP tracking, JavaScript-rendered content, dynamic HTML structures, and Polish language requirements. The main challenges include 403 Forbidden errors, IP-based blocking, changing CSS selectors, and the need for realistic Polish user behavior patterns. The site also implements rate limiting and bot detection systems that can block automated access.
What advanced data points can I extract from Allegro listings?
Beyond basic product information, you can extract comprehensive data including seller type (Business/Private), detailed ratings with review counts, product specifications (color, screen size, memory, RAM), delivery promises, product condition, installment payment options, Allegro Smart badge status, recent purchase counts, and dynamic pricing information. The listings also include trust signals and seller reputation indicators that are valuable for market analysis.
How do you handle Allegro's dynamic content and changing HTML structure?
Our code implements multiple fallback strategies and robust error handling. We use multiple CSS selectors for each data point, prioritize structured data from meta tags, and wrap each extraction in try-catch blocks. The code also includes retry logic for failed requests and validates extracted data to ensure quality. This multi-layered approach ensures we can adapt to HTML changes while maintaining data extraction reliability.
Summary
This guide demonstrates how to scrape Allegro.pl for product listings and individual product details using Python. We've built a robust scraper that handles Allegro's dynamic content, structured data extraction, and anti-bot protection. The code includes multiple fallback strategies, robust error handling, and real-time data extraction capabilities.
The simple approach using requests and BeautifulSoup provides a good balance of reliability and ease of use, while the anti-blocking techniques help avoid detection. For production use, consider implementing additional features like rate limiting, proxy rotation, and data storage.