🚀 We are hiring! See open positions
How to Scrape Allegro.pl

Allegro.pl is Poland's largest e-commerce platform, offering millions of products across various categories. Whether you're conducting market research, price monitoring, or competitive analysis, scraping Allegro can provide valuable insights into the Polish e-commerce market.

In this guide, we'll explore how to scrape Allegro.pl using Python with requests and BeautifulSoup4. We'll cover two main scenarios: scraping product listings from a category page and extracting detailed information from individual product pages.

Prerequisites

Before we start scraping, you'll need to install the required Python packages. These libraries will handle HTTP requests and HTML parsing.

pip install requests beautifulsoup4 lxml

The requests library will handle HTTP requests to fetch web pages, while BeautifulSoup4 will parse the HTML content and extract the data we need. The lxml parser provides better performance for HTML parsing.

Example 1: Scraping Smartphone Category Listings

Our first example will demonstrate how to scrape product listings from Allegro's smartphone category. This will show you how to extract basic product information from category pages.

Basic Setup

We'll start by setting up our scraping environment with proper headers and session management to avoid detection.

import requests
from bs4 import BeautifulSoup
import time
import random
import re
from typing import List, Dict, Optional

def create_session() -> requests.Session:
    """Create a session with proper headers to mimic a real browser"""
    session = requests.Session()
    
    # Rotate user agents to avoid detection
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0'
    ]
    
    session.headers.update({
        'User-Agent': random.choice(user_agents),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'pl-PL,pl;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Cache-Control': 'max-age=0'
    })
    
    return session

def make_request(session: requests.Session, url: str, delay_range: tuple = (1, 3)) -> Optional[requests.Response]:
    """Make a request with random delay to avoid rate limiting"""
    try:
        # Add random delay between requests
        time.sleep(random.uniform(*delay_range))
        
        response = session.get(url, timeout=30)
        response.raise_for_status()
        
        print(f"  ✅ Successfully accessed {url}")
        return response
        
    except requests.RequestException as e:
        print(f"  ❌ Error making request to {url}: {e}")
        return None

def make_request_with_retry(session: requests.Session, url: str, max_retries: int = 3) -> Optional[requests.Response]:
    """Make request with retry logic for failed requests"""
    for attempt in range(max_retries):
        try:
            response = session.get(url, timeout=30)
            response.raise_for_status()
            print(f"  ✅ Successfully accessed {url}")
            return response
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                print(f"  ❌ Failed after {max_retries} attempts: {e}")
                return None
            print(f"  ⚠️ Attempt {attempt + 1} failed, retrying...")
            time.sleep(random.uniform(2, 5))
    
    return None

The session setup includes rotating user agents and proper headers to mimic real browser behavior. The random delays help avoid rate limiting and detection.

Why this matters: Allegro.pl uses sophisticated bot detection, so we need to appear as human as possible. The Polish language headers and realistic user agents are crucial for avoiding 403 Forbidden errors.

Extracting Product Listings

Now we'll create functions to extract product information from the category page. This is where the real work begins - we need to parse the HTML structure to find and extract data from each product listing.

def extract_product_listings(soup: BeautifulSoup) -> List[Dict]:
    """
    Extract product listings from the category page
    
    Args:
        soup: BeautifulSoup object of the page
        
    Returns:
        List of dictionaries containing product data
    """
    listings = []
    
    # Find all product containers - Allegro uses li elements with specific classes
    product_items = soup.find_all('li', class_='mb54_5r')
    
    print(f"  Found {len(product_items)} product listings")
    
    for item in product_items:
        try:
            # Extract product title from h2 element
            title_elem = item.find('h2', class_='mgn2_14')
            if not title_elem:
                title_elem = item.find('h2')
            if not title_elem:
                title_elem = item.find('a', class_='mgn2_14')
            
            title = title_elem.get_text().strip() if title_elem else "N/A"
            
            # Extract price from the price span - look for the correct structure
            price_elem = item.find('span', class_='mli8_k4')
            if not price_elem:
                # Look for price in the specific structure from the HTML
                price_container = item.find('div', class_='mli8_k4')
                if price_container:
                    price_spans = price_container.find_all('span')
                    for span in price_spans:
                        text = span.get_text().strip()
                        if re.match(r'\d+\.?\d*', text):
                            price_elem = span
                            break
            if not price_elem:
                price_elem = item.find('span', class_='mgn2_27')
            if not price_elem:
                price_elem = item.find('span', string=re.compile(r'\d+\.?\d*\s*PLN'))
            
            price = price_elem.get_text().strip() if price_elem else "N/A"
            
            # Extract product link
            link_elem = item.find('a', href=True)
            link = "N/A"
            if link_elem and link_elem.get('href'):
                href = link_elem['href']
                if href.startswith('http'):
                    link = href
                else:
                    link = "https://allegro.pl" + href
            
            # Extract seller information (Business/Private)
            seller_elem = item.find('span', string=re.compile(r'Business|Private'))
            if not seller_elem:
                seller_elem = item.find('span', class_='mgmw_3z')
            
            seller = seller_elem.get_text().strip() if seller_elem else "N/A"
            
            # Extract rating
            rating_elem = item.find('span', class_='m9qz_yq')
            if not rating_elem:
                rating_elem = item.find('span', string=re.compile(r'\d+\.\d+'))
            rating = rating_elem.get_text().strip() if rating_elem else "N/A"
            
            # Extract number of ratings
            ratings_count_elem = item.find('span', class_='mpof_uk')
            if not ratings_count_elem:
                ratings_count_elem = item.find('span', string=re.compile(r'\(\d+\)'))
            ratings_count = ratings_count_elem.get_text().strip() if ratings_count_elem else "N/A"
            
            # Extract product specifications
            specs_elem = item.find('div', class_='_1e32a_BBBTh')
            specifications = {}
            if specs_elem:
                spec_spans = specs_elem.find_all('span')
                current_key = None
                for span in spec_spans:
                    text = span.get_text().strip()
                    if text.endswith(':') and len(text) > 1:
                        current_key = text[:-1]  # Remove the colon
                    elif current_key and text:
                        specifications[current_key] = text
                        current_key = None
            
            # Extract image URL
            img_elem = item.find('img')
            image_url = img_elem.get('src') if img_elem else "N/A"
            
            # Extract delivery info
            delivery_elem = item.find('span', string=re.compile(r'delivery'))
            if not delivery_elem:
                delivery_elem = item.find('span', string=re.compile(r'dostawa'))
            delivery = delivery_elem.get_text().strip() if delivery_elem else "N/A"
            
            # Extract condition
            condition_elem = item.find('span', string=re.compile(r'New|Used|Exhibition'))
            if not condition_elem:
                condition_elem = item.find('span', class_='mgmw_wo')
            condition = condition_elem.get_text().strip() if condition_elem else "N/A"
            
            # Extract installment information
            installment_elem = item.find('span', string=re.compile(r'installments'))
            if not installment_elem:
                installment_elem = item.find('span', string=re.compile(r'x\s*\d+'))
            installment_info = installment_elem.get_text().strip() if installment_elem else "N/A"
            
            # Extract Allegro Smart badge
            smart_badge = item.find('img', alt='Allegro Smart!')
            has_smart = "Yes" if smart_badge else "No"
            
            # Extract recent purchases
            purchases_elem = item.find('span', string=re.compile(r'\d+\s*people\s*have\s*recently\s*purchased'))
            recent_purchases = purchases_elem.get_text().strip() if purchases_elem else "N/A"
            
            # Create product data dictionary
            product_data = {
                'title': title,
                'price': price,
                'seller': seller,
                'rating': rating,
                'ratings_count': ratings_count,
                'specifications': specifications,
                'link': link,
                'image_url': image_url,
                'delivery': delivery,
                'condition': condition,
                'installment_info': installment_info,
                'allegro_smart': has_smart,
                'recent_purchases': recent_purchases
            }
            
            listings.append(product_data)
            
        except Exception as e:
            print(f"  ⚠️ Error extracting product: {e}")
            continue
    
    return listings

def scrape_listings(url: str) -> Optional[List[Dict]]:
    """
    Scrape product listings from a category page
    
    Args:
        url: URL of the category page to scrape
        
    Returns:
        List of product dictionaries or None if failed
    """
    session = create_session()
    response = make_request_with_retry(session, url)
    
    if not response:
        return None
    
    soup = BeautifulSoup(response.content, 'html.parser')
    listings = extract_product_listings(soup)
    
    return listings

The extraction function looks for product containers using li elements with the mb54_5r class and extracts comprehensive information including title, price, seller type (Business/Private), ratings, specifications, delivery information, and product condition. It includes fallback selectors to handle potential changes in Allegro's HTML structure.

Key Features of This Extraction:

  • Robust Selectors: Multiple fallback options ensure we don't miss data if Allegro changes their HTML
  • Comprehensive Data: We extract everything from basic info to detailed specifications
  • Error Handling: Each product extraction is wrapped in try-catch to prevent one bad listing from breaking the entire scrape
  • Real-time Data: We capture dynamic elements like recent purchase counts and delivery promises

Main Execution

Let's run the smartphone category scraping example. This will demonstrate how all our functions work together to extract real data from Allegro's smartphone category.

def run_listings_example():
    """Run the smartphone category scraping example"""
    print("📱 Starting Allegro Smartphone Category Scraper")
    
    # Target URL for smartphone category
    url = "https://allegro.pl/kategoria/smartfony-i-telefony-komorkowe-165"
    
    # Scrape listings
    listings = scrape_listings(url)
    
    if listings:
        print(f"\n✅ Successfully scraped {len(listings)} product listings!")
        
        # Display first few results
        print("\n📋 Sample Results:")
        for i, product in enumerate(listings[:5], 1):
            print(f"  {i}. {product['title']}")
            print(f"     Price: {product['price']}")
            print(f"     Seller: {product['seller']}")
            print(f"     Rating: {product['rating']} ({product['ratings_count']})")
            print(f"     Condition: {product['condition']}")
            print(f"     Delivery: {product['delivery']}")
            print(f"     Installments: {product['installment_info']}")
            print(f"     Allegro Smart: {product['allegro_smart']}")
            print(f"     Recent Purchases: {product['recent_purchases']}")
            if product.get('specifications'):
                specs_str = ", ".join([f"{k}: {v}" for k, v in product['specifications'].items()])
                print(f"     Specs: {specs_str}")
            print()
        
        return listings
    else:
        print("❌ Failed to scrape listings")
        return None

# Run the example
if __name__ == "__main__":
    run_listings_example()

This example demonstrates how to scrape product listings from Allegro's smartphone category, extracting comprehensive information including ratings, specifications, delivery options, and seller details.

What We Just Built:

  • A robust category scraper that handles Allegro's dynamic content
  • Comprehensive data extraction with multiple fallback strategies
  • Real-time processing of live marketplace data
  • Error-resistant code that continues even if some products fail to parse
Example Output

📱 Starting Allegro Smartphone Category Scraper
  ✅ Successfully accessed https://allegro.pl/kategoria/smartfony-i-telefony-komorkowe-165
  Found 90 product listings

✅ Successfully scraped 90 product listings!

📋 Sample Results:
  1. Smartfon Motorola Edge 50 Neo 8 GB / 256 GB 5G szary
     Price: Promowane
     Seller: Kolor
     Rating: 4,95 ((42))
     Condition: 4,95
     Delivery: dostawa jutro
     Installments: x 15 rat
     Allegro Smart: No
     Recent Purchases: N/A

Example 2: Scraping Individual Product Details

Our second example will show how to extract detailed information from individual product pages, including specifications, descriptions, and seller information. This is where we get the deep, comprehensive data that makes scraping valuable.

Why Individual Product Pages Matter:

  • Rich Data: Product pages contain much more detailed information than category listings
  • Structured Data: Many e-commerce sites include JSON-LD or microdata for better SEO
  • Complete Information: From technical specs to seller reputation, everything is available
  • Real-time Availability: Stock levels, pricing, and delivery options are always current

Extracting Basic Product Information

We'll start by extracting the basic product information from the detail page. This function is more sophisticated than the category scraper because product pages have richer, more structured data.

def extract_basic_info(soup: BeautifulSoup) -> Dict:
    """
    Extract basic product information from the detail page
    
    Args:
        soup: BeautifulSoup object of the page
        
    Returns:
        Dictionary containing basic product information
    """
    basic_info = {}
    
    # Extract structured data from meta tags
    meta_url = soup.find('meta', attrs={'itemprop': 'url'})
    meta_sku = soup.find('meta', attrs={'itemprop': 'sku'})
    meta_gtin = soup.find('meta', attrs={'itemprop': 'gtin'})
    meta_brand = soup.find('meta', attrs={'itemprop': 'brand'})
    
    # Extract offer information
    offer_price = soup.find('meta', attrs={'itemprop': 'price'})
    offer_currency = soup.find('meta', attrs={'itemprop': 'priceCurrency'})
    offer_availability = soup.find('link', attrs={'itemprop': 'availability'})
    offer_condition = soup.find('meta', attrs={'itemprop': 'itemCondition'})
    
    # Extract product title from h1 element
    title_elem = soup.find('h1', class_='mp4t_0')
    if not title_elem:
        title_elem = soup.find('h1')
    if not title_elem:
        title_elem = soup.find('title')
    
    basic_info['title'] = title_elem.get_text().strip() if title_elem else "N/A"
    
    # Extract price from structured data or fallback to HTML
    if offer_price:
        price_value = offer_price.get('content', '')
        currency = offer_currency.get('content', 'PLN') if offer_currency else 'PLN'
        basic_info['price'] = f"{price_value} {currency}"
    else:
        price_elem = soup.find('span', class_='mli8_k4')
        if not price_elem:
            price_elem = soup.find('span', class_='mgn2_27')
        if not price_elem:
            price_elem = soup.find('span', string=re.compile(r'\d+\.?\d*\s*PLN'))
        
        basic_info['price'] = price_elem.get_text().strip() if price_elem else "N/A"
    
    # Extract structured data
    basic_info['sku'] = meta_sku.get('content', 'N/A') if meta_sku else "N/A"
    basic_info['gtin'] = meta_gtin.get('content', 'N/A') if meta_gtin else "N/A"
    basic_info['brand'] = meta_brand.get('content', 'N/A') if meta_brand else "N/A"
    basic_info['product_url'] = meta_url.get('content', 'N/A') if meta_url else "N/A"
    basic_info['availability'] = offer_availability.get('href', 'N/A') if offer_availability else "N/A"
    basic_info['condition'] = offer_condition.get('content', 'N/A') if offer_condition else "N/A"
    
    # Extract rating
    rating_elem = soup.find('span', class_='mgmw_wo')
    if not rating_elem:
        rating_elem = soup.find('span', string=re.compile(r'\d+\.\d+'))
    basic_info['rating'] = rating_elem.get_text().strip() if rating_elem else "N/A"
    
    # Extract number of ratings
    ratings_count_elem = soup.find('span', string=re.compile(r'\d+\s*ratings'))
    basic_info['ratings_count'] = ratings_count_elem.get_text().strip() if ratings_count_elem else "N/A"
    
    # Extract product images
    image_elements = soup.find_all('img', class_='msub_k4')
    if not image_elements:
        image_elements = soup.find_all('img', class_='mupj_5k')
    
    images = []
    for img in image_elements:
        src = img.get('src')
        if src and not src.startswith('data:') and 'allegroimg.com' in src:
            images.append(src)
    
    basic_info['images'] = images
    
    return basic_info

The basic information extraction focuses on the main product details like title, price, seller, and images. Notice how we prioritize structured data from meta tags - this is more reliable than parsing HTML and often contains cleaner, more standardized information.

Structured Data Advantage:

  • Reliability: Meta tags are less likely to change than CSS classes
  • Standardization: JSON-LD and microdata follow industry standards
  • Completeness: Often includes additional fields like GTIN, brand, and availability
  • Performance: Faster to extract than complex HTML parsing

Extracting Product Specifications

Now we'll extract detailed product specifications and features. This is where we get the technical details that make product pages so valuable for market research and competitive analysis.

def extract_specifications(soup: BeautifulSoup) -> Dict:
    """
    Extract product specifications from the detail page
    
    Args:
        soup: BeautifulSoup object of the page
        
    Returns:
        Dictionary containing product specifications
    """
    specifications = {}
    
    # Find specifications table
    specs_table = soup.find('table', class_='myre_zn')
    if not specs_table:
        specs_table = soup.find('table')
    
    if specs_table:
        # Extract specification rows
        rows = specs_table.find_all('tr', class_='q1fzo')
        if not rows:
            rows = specs_table.find_all('tr')
        
        for row in rows:
            try:
                # Extract specification name and value from table cells
                cells = row.find_all('td')
                if len(cells) >= 2:
                    name = cells[0].get_text().strip()
                    value = cells[1].get_text().strip()
                    if name and value:
                        specifications[name] = value
                    
            except Exception as e:
                continue
    
    return specifications

def extract_features(soup: BeautifulSoup) -> List[str]:
    """
    Extract product features and technical specifications from the description
    
    Args:
        soup: BeautifulSoup object of the page
        
    Returns:
        List of product features
    """
    features = []
    
    # Find description section with technical specifications
    description_section = soup.find('div', class_='_0d3bd_K6Qpj')
    if not description_section:
        description_section = soup.find('div', class_='_0d3bd_am0a-')
    if not description_section:
        description_section = soup.find('div', string=re.compile(r'Technical specifications'))
    
    if description_section:
        # Extract feature items from unordered list
        feature_items = description_section.find_all('li')
        
        for item in feature_items:
            feature_text = item.get_text().strip()
            if feature_text and len(feature_text) > 10:  # Filter out very short items
                features.append(feature_text)
    
    return features

The specifications extraction looks for detailed product information, while the features extraction focuses on product highlights and key selling points. This dual approach ensures we capture both the technical specifications and the marketing highlights that sellers use to attract buyers.

Why Both Matter:

  • Technical Specs: Essential for product comparison and market analysis
  • Marketing Features: Shows how sellers position their products
  • Complete Picture: Technical + marketing data gives full product understanding
  • Competitive Intelligence: See what features competitors emphasize

Extracting Seller Information

We'll also extract comprehensive seller information from the product page. This is crucial for understanding the marketplace dynamics and seller reputation.

def extract_seller_info(soup: BeautifulSoup) -> Dict:
    """
    Extract seller information from the product page
    
    Args:
        soup: BeautifulSoup object of the page
        
    Returns:
        Dictionary containing seller information
    """
    seller_info = {}
    
    # Extract purchase count information
    purchase_elem = soup.find('span', string=re.compile(r'\d+\s*people\s*have\s*recently\s*purchased'))
    seller_info['recent_purchases'] = purchase_elem.get_text().strip() if purchase_elem else "N/A"
    
    # Extract invoice information
    invoice_elem = soup.find('td', string=re.compile(r'Invoice'))
    if invoice_elem:
        invoice_value = invoice_elem.find_next_sibling('td')
        if invoice_value:
            seller_info['invoice'] = invoice_value.get_text().strip()
        else:
            seller_info['invoice'] = "N/A"
    else:
        seller_info['invoice'] = "N/A"
    
    # Extract manufacturer code
    code_elem = soup.find('td', string=re.compile(r'Manufacturer code'))
    if code_elem:
        code_value = code_elem.find_next_sibling('td')
        if code_value:
            seller_info['manufacturer_code'] = code_value.get_text().strip()
        else:
            seller_info['manufacturer_code'] = "N/A"
    else:
        seller_info['manufacturer_code'] = "N/A"
    
    # Extract EAN/GTIN
    ean_elem = soup.find('td', string=re.compile(r'EAN'))
    if ean_elem:
        ean_value = ean_elem.find_next_sibling('td')
        if ean_value:
            seller_info['ean'] = ean_value.get_text().strip()
        else:
            seller_info['ean'] = "N/A"
    else:
        seller_info['ean'] = "N/A"
    
    # Extract delivery information
    delivery_elem = soup.find('span', string=re.compile(r'dostawa\s*jutro'))
    seller_info['delivery_info'] = delivery_elem.get_text().strip() if delivery_elem else "N/A"
    
    # Extract product variants
    variants = []
    variant_sections = soup.find_all('div', class_='_563f1_fqgxS')
    for section in variant_sections:
        variant_type = section.find_previous_sibling('span', class_='_563f1_Cpfka')
        if variant_type:
            variant_type_text = variant_type.get_text().strip()
            variant_options = section.find_all('span', class_='_563f1_4z3uJ')
            if variant_options:
                options = [opt.get_text().strip() for opt in variant_options]
                variants.append({
                    'type': variant_type_text,
                    'options': options
                })
    
    seller_info['variants'] = variants
    
    # Extract installment information
    installment_elem = soup.find('span', string=re.compile(r'x\s*\d+\s*rat'))
    seller_info['installment_info'] = installment_elem.get_text().strip() if installment_elem else "N/A"
    
    # Extract Allegro Smart badge
    smart_badge = soup.find('img', alt='Allegro Smart!')
    seller_info['allegro_smart'] = "Yes" if smart_badge else "No"
    
    # Extract best price guarantee
    bpg_elem = soup.find('span', string=re.compile(r'Gwarancja najniższej ceny'))
    seller_info['best_price_guarantee'] = "Yes" if bpg_elem else "No"
    
    return seller_info

def scrape_product_details(url: str) -> Optional[Dict]:
    """
    Scrape detailed product information from a product page
    
    Args:
        url: URL of the product page to scrape
        
    Returns:
        Dictionary containing detailed product information or None if failed
    """
    session = create_session()
    response = make_request_with_retry(session, url)
    
    if not response:
        return None
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract different types of information
    basic_info = extract_basic_info(soup)
    specifications = extract_specifications(soup)
    features = extract_features(soup)
    seller_info = extract_seller_info(soup)
    
    # Combine all data
    result = {
        'url': url,
        **basic_info,
        'specifications': specifications,
        'features': features,
        'seller': seller_info
    }
    
    return result

The seller information extraction provides insights into the seller's reputation and location, which is crucial for e-commerce analysis. We're capturing everything from recent purchase counts to product variants and installment options.

Advanced Data Points We Extract:

  • Purchase Analytics: Recent purchase counts show product popularity
  • Payment Options: Installment plans reveal pricing strategies
  • Product Variants: Color, size, and memory options for complete inventory
  • Trust Signals: Allegro Smart badges and best price guarantees
  • Logistics: Delivery promises and seller response times

Main Execution

Let's run the individual product scraping example. This will show how we can extract comprehensive data from a single product page, demonstrating the full power of our scraping capabilities.

def run_product_details_example():
    """Run the individual product details scraping example"""
    print("📱 Starting Allegro Individual Product Scraper")
    
    # Target URL for individual product
    url = "https://allegro.pl/oferta/smartfon-xiaomi-14t-pro-12-gb-512-gb-5g-niebieski-17386285003"
    
    # Scrape product details
    product_data = scrape_product_details(url)
    
    if product_data:
        print(f"\n✅ Successfully scraped product details!")
        
        # Display the results
        print("\n📋 Product Information:")
        print(f"  Title: {product_data.get('title', 'N/A')}")
        print(f"  Price: {product_data.get('price', 'N/A')}")
        print(f"  Brand: {product_data.get('brand', 'N/A')}")
        print(f"  SKU: {product_data.get('sku', 'N/A')}")
        print(f"  GTIN: {product_data.get('gtin', 'N/A')}")
        print(f"  Condition: {product_data.get('condition', 'N/A')}")
        print(f"  Availability: {product_data.get('availability', 'N/A')}")
        print(f"  Rating: {product_data.get('rating', 'N/A')} ({product_data.get('ratings_count', 'N/A')})")
        
        if product_data.get('specifications'):
            print(f"\n🔧 Specifications:")
            for key, value in product_data['specifications'].items():
                print(f"  {key}: {value}")
        
        if product_data.get('features'):
            print(f"\n✨ Technical Features:")
            for feature in product_data['features'][:10]:  # Show first 10 features
                print(f"  • {feature}")
            if len(product_data['features']) > 10:
                print(f"  ... and {len(product_data['features']) - 10} more features")
        
        if product_data.get('seller'):
            print(f"\n👤 Product Information:")
            seller = product_data['seller']
            print(f"  Recent Purchases: {seller.get('recent_purchases', 'N/A')}")
            print(f"  Delivery Info: {seller.get('delivery_info', 'N/A')}")
            print(f"  Installment Info: {seller.get('installment_info', 'N/A')}")
            print(f"  Allegro Smart: {seller.get('allegro_smart', 'N/A')}")
            print(f"  Best Price Guarantee: {seller.get('best_price_guarantee', 'N/A')}")
            print(f"  Invoice: {seller.get('invoice', 'N/A')}")
            print(f"  Manufacturer Code: {seller.get('manufacturer_code', 'N/A')}")
            print(f"  EAN/GTIN: {seller.get('ean', 'N/A')}")
            
            if seller.get('variants'):
                print(f"\n🎨 Product Variants:")
                for variant in seller['variants']:
                    print(f"  {variant['type']}: {', '.join(variant['options'])}")
        
        return product_data
    else:
        print("❌ Failed to scrape product details")
        return None

# Run the example
if __name__ == "__main__":
    run_product_details_example()

This example demonstrates how to extract comprehensive product information from individual Allegro product pages. We've built a sophisticated scraper that can handle the complexity of modern e-commerce pages.

What Makes This Advanced:

  • Structured Data Extraction: Leverages JSON-LD and microdata for reliable data
  • Multi-layered Fallbacks: Multiple extraction strategies ensure we get the data
  • Rich Information: From technical specs to marketing features and seller analytics
  • Real-world Ready: Handles the complexity of actual Allegro product pages
Example Output

📱 Starting Allegro Individual Product Scraper
  ✅ Successfully accessed https://allegro.pl/oferta/smartfon-xiaomi-14t-pro-12-gb-512-gb-5g-niebieski-17386285003

✅ Successfully scraped product details!

📋 Product Information:
  Title: Smartfon Xiaomi 14T Pro 12 GB / 512 GB 5G niebieski
  Price: 2135.82 PLN
  Brand: Xiaomi
  SKU: 17386285003
  GTIN: 6941812789353
  Condition: http://schema.org/NewCondition
  Availability: http://schema.org/InStock
  Rating: 14T Pro (N/A)

🔧 Specifications:
  Stan: Nowy
  Faktura: Wystawiam fakturę VAT
  Kod producenta: 6941812789353
  Marka telefonu: Xiaomi
  Model telefonu: 14T Pro
  Typ: Smartfon
  EAN (GTIN): 6941812789353
  Kolor: niebieski

✨ Technical Features:
  • Typ urządzenia: Smartfon
  • Seria: 14T Pro
  • Procesor: MediaTek Dimensity 9300+
  • Układ graficzny: Immortalis-G720 MC12
  • Pamięć RAM: 12 GB
  • Pamięć wbudowana: 512 GB
  • Typ ekranu: Dotykowy, AMOLED
  • Częstotliwość odświeżania ekranu: 144 Hz
  • Przekątna ekranu: 6,67"
  • Rozdzielczość ekranu: 2712 x 1220
  ... and 21 more features

👤 Product Information:
  Recent Purchases: 185 osób kupiło ostatnio
  Delivery Info: dostawa jutro
  Installment Info: x 15 rat
  Allegro Smart: Yes
  Best Price Guarantee: Yes
  Invoice: N/A
  Manufacturer Code: N/A
  EAN/GTIN: 6941812789353

🎨 Product Variants:
  Wbudowana pamięć: 1 TB, 512 GB
  Pamięć RAM: 12 GB
  Kolor: czarny, niebieski, szary

Handling Anti-Bot Protection

Allegro.pl implements various anti-bot measures to prevent automated scraping. Here are the key strategies we use to avoid detection. Understanding these techniques is crucial for any serious web scraping project.

Why Anti-Bot Protection Matters:

  • Modern Websites: Most e-commerce sites use sophisticated bot detection
  • IP Blocking: Getting blocked can stop your scraping project entirely
  • Rate Limiting: Too many requests can trigger temporary blocks
  • Legal Compliance: Respectful scraping is more likely to be tolerated

For more advanced anti-blocking techniques, check out our comprehensive guide on

5 Tools to Scrape Without Blocking and How it All Works

Tutorial on how to avoid web scraper blocking. What is javascript and TLS (JA3) fingerprinting and what role request headers play in blocking.

5 Tools to Scrape Without Blocking and How it All Works

which covers TLS fingerprinting, IP rotation, and other detection methods.

User Agent Rotation

def rotate_user_agents():
    """Rotate user agents to avoid detection"""
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0'
    ]
    return random.choice(user_agents)

Rotating user agents helps mimic different browsers and reduces the chance of being detected as a bot. This is one of the most basic but effective anti-detection techniques.

How It Works:

  • Browser Diversity: Different user agents represent different browsers and operating systems
  • Pattern Avoidance: Bots often use the same user agent repeatedly
  • Realistic Behavior: Real users don't all use the same browser version
  • Detection Evasion: Bot detection systems look for suspicious user agent patterns

Session Management

def create_advanced_session():
    """Create a session with advanced anti-detection measures"""
    session = requests.Session()
    
    # Set rotating user agent
    session.headers.update({
        'User-Agent': rotate_user_agents(),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'pl-PL,pl;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Cache-Control': 'max-age=0'
    })
    
    return session

Advanced session management includes proper headers and Polish language preferences to appear more natural. This creates a more convincing browser fingerprint.

Session Management Benefits:

  • Cookie Persistence: Maintains session state across requests
  • Connection Reuse: More efficient than creating new connections
  • Header Consistency: Real browsers send consistent headers
  • Language Localization: Polish headers make requests appear local

Advanced Scraping Techniques

For more robust scraping, consider implementing these advanced techniques. These are the next level of sophistication for production scraping systems.

Proxy Rotation

def use_proxy_rotation():
    """Example of proxy rotation for large-scale scraping"""
    proxies = [
        'http://proxy1:port',
        'http://proxy2:port',
        'http://proxy3:port'
    ]
    
    proxy = random.choice(proxies)
    return {'http': proxy, 'https': proxy}

Proxy rotation helps distribute requests across different IP addresses to avoid rate limiting. This is essential for large-scale scraping operations.

Proxy Rotation Benefits:

  • IP Distribution: Spreads requests across multiple IP addresses
  • Geographic Diversity: Can use proxies from different locations
  • Rate Limit Avoidance: Each IP has its own rate limit
  • Block Recovery: If one IP gets blocked, others continue working

Retry Logic

def make_request_with_retry(session, url, max_retries=3):
    """Make request with retry logic for failed requests"""
    for attempt in range(max_retries):
        try:
            response = session.get(url, timeout=30)
            response.raise_for_status()
            return response
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                print(f"Failed after {max_retries} attempts: {e}")
                return None
            print(f"Attempt {attempt + 1} failed, retrying...")
            time.sleep(random.uniform(2, 5))

Retry logic ensures that temporary network issues don't cause scraping failures. This is crucial for maintaining reliability in production environments.

Why Retry Logic Matters:

  • Network Resilience: Handles temporary connectivity issues
  • Server Recovery: Gives servers time to recover from overload
  • Data Completeness: Ensures no data is lost due to transient failures
  • Production Reliability: Essential for automated scraping systems

For more advanced data processing and analysis techniques, see our guide on

In this example web scraping project we'll be taking a look at monitoring E-Commerce trends using Python, web scraping and data visualization tools.

How to Observe E-Commerce Trends using Web Scraping

Scraping with Scrapfly

For production scraping of Allegro.pl, consider using Scrapfly, which provides:

  • Residential proxies that bypass IP-based blocking
  • Automatic bot detection bypass using advanced browser fingerprinting
  • JavaScript rendering to handle dynamic content
  • Rate limiting and retry logic built-in
  • Geolocation targeting for region-specific data

Scrapfly handles the complex anti-bot measures that Allegro implements, allowing you to focus on data extraction rather than infrastructure management.

When to Use Scrapfly:

  • Large-scale Projects: When you need to scrape thousands of pages
  • Production Systems: For reliable, automated data collection
  • Complex Sites: When JavaScript rendering is required
  • Geographic Requirements: When you need specific country IPs

Best Practices and Tips

When scraping Allegro.pl, follow these best practices to ensure reliable data collection. These guidelines will help you build robust, ethical scraping systems.

Rate Limiting

  • Implement delays between requests (1-3 seconds minimum)
  • Use random delays to appear more human-like
  • Avoid making too many requests from the same IP

Why Rate Limiting is Critical:

  • Server Protection: Prevents overwhelming the target server
  • Detection Avoidance: Too many requests trigger bot detection
  • Ethical Scraping: Respectful scraping is more likely to be tolerated
  • Legal Compliance: Many sites have terms of service about scraping

Error Handling

  • Always check HTTP status codes
  • Implement retry logic for failed requests
  • Handle network timeouts gracefully

Robust Error Handling Benefits:

  • Data Completeness: Ensures no data is lost due to transient errors
  • System Reliability: Prevents crashes from unexpected responses
  • Debugging: Better error messages help identify issues quickly
  • Production Ready: Essential for automated systems

Data Validation

  • Verify extracted data is not empty or malformed
  • Implement fallback selectors for changing HTML structure
  • Log extraction errors for debugging

Data Quality Assurance:

  • Accuracy: Validated data is more reliable for analysis
  • Consistency: Ensures uniform data format across all scrapes
  • Change Adaptation: Fallback selectors handle site updates
  • Monitoring: Logs help track scraping success rates

For more comprehensive web scraping best practices, see our

Everything to Know to Start Web Scraping in Python Today

Complete introduction to web scraping using Python: http, parsing, AI, scaling and deployment.

Everything to Know to Start Web Scraping in Python Today

If you're interested in scraping other e-commerce platforms, check out these guides:

  • Comprehensive guide to scraping AutoScout24 automotive marketplace

How to Scrape AutoScout24

Learn how to scrape AutoScout24 for car listings, prices, specifications, and detailed vehicle information using Python. Complete guide with code examples and anti-blocking techniques.

How to Scrape AutoScout24
  • Guide to extracting Ticketmaster event data and ticket information

How to Scrape Ticketmaster

Learn how to scrape Ticketmaster for event data including concerts, venues, dates, and ticket information using Python. Complete guide with code examples and anti-blocking techniques.

How to Scrape Ticketmaster
  • Techniques for scraping Mouser electronics components

How to Scrape Mouser.com

Learn how to scrape Mouser.com electronic component data including prices, specifications, and inventory using Python. Complete guide with code examples and anti-blocking techniques.

How to Scrape Mouser.com

FAQ

Now let's answer some frequently asked questions.

What makes Allegro.pl scraping particularly challenging?

Allegro.pl uses sophisticated anti-bot protection including IP tracking, JavaScript-rendered content, dynamic HTML structures, and Polish language requirements. The main challenges include 403 Forbidden errors, IP-based blocking, changing CSS selectors, and the need for realistic Polish user behavior patterns. The site also implements rate limiting and bot detection systems that can block automated access.

What advanced data points can I extract from Allegro listings?

Beyond basic product information, you can extract comprehensive data including seller type (Business/Private), detailed ratings with review counts, product specifications (color, screen size, memory, RAM), delivery promises, product condition, installment payment options, Allegro Smart badge status, recent purchase counts, and dynamic pricing information. The listings also include trust signals and seller reputation indicators that are valuable for market analysis.

How do you handle Allegro's dynamic content and changing HTML structure?

Our code implements multiple fallback strategies and robust error handling. We use multiple CSS selectors for each data point, prioritize structured data from meta tags, and wrap each extraction in try-catch blocks. The code also includes retry logic for failed requests and validates extracted data to ensure quality. This multi-layered approach ensures we can adapt to HTML changes while maintaining data extraction reliability.

Summary

This guide demonstrates how to scrape Allegro.pl for product listings and individual product details using Python. We've built a robust scraper that handles Allegro's dynamic content, structured data extraction, and anti-bot protection. The code includes multiple fallback strategies, robust error handling, and real-time data extraction capabilities.

The simple approach using requests and BeautifulSoup provides a good balance of reliability and ease of use, while the anti-blocking techniques help avoid detection. For production use, consider implementing additional features like rate limiting, proxy rotation, and data storage.

Explore this Article with AI

Related Knowledgebase

Related Articles