     [Blog](https://scrapfly.io/blog)   /  [python](https://scrapfly.io/blog/tag/python)   /  [Web Scraping for Lead Generation: Build Your Own B2B Database](https://scrapfly.io/blog/posts/how-to-scrape-leads)   # Web Scraping for Lead Generation: Build Your Own B2B Database

 by [Ziad Shamndy](https://scrapfly.io/blog/author/ziad) Jun 23, 2026 28 min read [\#python](https://scrapfly.io/blog/tag/python) [\#scrapeguide](https://scrapfly.io/blog/tag/scrapeguide) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-leads "Share on LinkedIn")    

 

 

         

   **AI Web Scraping API**AI-powered web scraping with intelligent data extraction and natural language queries.

 

 [ Learn More  ](https://scrapfly.io/ai-web-scraping-api) [  Docs ](https://scrapfly.io/docs/scrape-api/getting-started) 

 

 

Most B2B teams rent the same five databases and still cannot find the niche accounts their ICP actually lives in. The fix is to build the database yourself by scraping public sources. Writing the scraper is the easy part.

An LLM produces a working parser in seconds. Running dozens of scrapers without getting blocked, then resolving the same company across sources, is the real engineering problem. That is what this guide addresses.



## Key Takeaways

- Building your own lead database beats renting one when your ICP is niche, when you need freshness you control, or when aggregator coverage is thin.
- The most productive public sources are company websites and team pages, professional platforms like LinkedIn and Crunchbase, local directories like Google Maps and Yelp, job boards for hiring signals, social platforms, and government registries.
- Writing the scraper is now trivial. Running many source scrapers without getting blocked, and surviving layout drift over time, is the real engineering challenge.
- Cross-source deduplication and scheduled re-crawls are the steps most builds skip. They are also the steps that determine whether the database stays useful past the first week.
- Compliance is operational, not a one-time checkbox. Collect only publicly available data, treat EU B2B contacts as personal data under GDPR, and honor opt-outs immediately.

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.







## Why Build Your Own B2B Lead Database Instead of Renting One from ZoomInfo or Apollo?

Build your own when rented databases miss your niche, when you need freshness you control, or when per-lead cost at scale no longer makes sense. Rent when you need broad, ready-made coverage fast.

Apollo, ZoomInfo, and Sales Navigator work well for common ICPs. The gaps appear in narrow verticals and specific geographies. A developer on r/b2b\_sales described building lists for Australian energy developers, data center operators with active expansion projects, and beverage co-packers across three countries. None of those ICPs exist as a clean segment in any aggregator.

Building gives you coverage of any public source, not just sources the aggregator indexed. It also gives you control over the refresh cadence and the schema, including custom fields like technology stack or open-position count.

The tradeoff is time. A rented database is ready today. A scraped one takes a week to build and ongoing work to maintain. Many teams run both rented for broad outreach, scraped for the niche segments the aggregators miss.

One caveat worth stating LinkedIn and ZoomInfo prohibit scraping in their Terms of Service. Scrapfly has guides for [scraping LinkedIn](https://scrapfly.io/blog/posts/how-to-scrape-linkedin) and [scraping ZoomInfo](https://scrapfly.io/blog/posts/how-to-scrape-zoominfo), but the recommended build path runs through cleaner public sources with far less legal exposure.

| Scenario | Build your own | Rent a database |
|---|---|---|
| Niche vertical or geography | Wins | Thin coverage |
| Broad outreach at speed | Slower to start | Wins |
| Custom fields and scoring | Full control | Limited |
| Freshness-critical data | Full control | Varies by plan |
| Tight per-lead budget at scale | Infrastructure cost | Per-seat / per-lead cost |

With the build-vs-rent case settled, the next question is where to find the data.



## Which Public Sources Can You Scrape for B2B Leads?

Before picking sources, define the target record. A usable lead has three layers company-level firmographics, contact-level details, and signal data that makes the record actionable.

Here is the schema that drives the rest of the pipeline:

python```python
from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class LeadRecord:
    # Company-level
    domain: str              # dedup key
    company_name: str
    industry: Optional[str] = None
    employee_count: Optional[int] = None
    revenue_range: Optional[str] = None
    location: Optional[str] = None
    founding_year: Optional[int] = None
    funding_stage: Optional[str] = None
    tech_stack: List[str] = field(default_factory=list)
    # Contact-level
    contact_name: Optional[str] = None
    contact_title: Optional[str] = None
    contact_email: Optional[str] = None
    contact_phone: Optional[str] = None
    linkedin_url: Optional[str] = None
    # Signal data
    open_positions: int = 0
    recent_news: List[str] = field(default_factory=list)
    review_activity: Optional[str] = None
    # Metadata
    source_urls: List[str] = field(default_factory=list)
    last_enriched: Optional[str] = None
    quality_score: float = 0.0
```



The `domain` field is the dedup key across all sources. Every field added from a new source updates this one record rather than creating a duplicate.

The table below maps each source category to what it yields, the ICP it fits best, and the friction you should expect:

| Source | What it yields | Best ICP fit | Typical friction |
|---|---|---|---|
| Company websites / team pages | Names, titles, emails, tech stack, job signals | Any B2B | Low to none |
| LinkedIn | People, titles, company pages, connections | All B2B verticals | Login wall, rate limits |
| Crunchbase | Funding, investors, employee range, leadership | Tech, startups, VC-backed | Moderate rate limits |
| ZoomInfo | Revenue range, org hierarchy, verified contacts | Enterprise sales | Login wall, aggressive anti-bot |
| G2 | Software buyers, company size, tech stack | SaaS vendors | Moderate |
| Wellfound / AngelList | Startup founders, role data, equity signals | Startup recruiting | Moderate |
| SimilarWeb | Traffic, category, tech signals | Digital/SaaS | Moderate |
| Google Maps | Name, phone, address, website, rating | Local/SMB | JS rendering, rate limits |
| Yelp | Same as Maps plus review signals | Local services | Moderate anti-bot |
| Yellow Pages | Phone, address, category | Local B2B | Low |
| TripAdvisor | Hospitality operators, ratings, contact | Hospitality vertical | Low to moderate |
| Trustpilot | Review activity, company size signals | Any B2B with reviews | Low |
| Indeed / Glassdoor | Hiring companies, roles, tech stack, culture | Growth-stage buyers | JS rendering |
| Twitter / X | Company activity, founders, social signals | SMB, creator economy | Rate limits, auth |
| YouTube | Creator and brand contact signals | Creator economy | Low |
| Instagram / TikTok | Brand accounts, creator contacts | D2C, creator economy | Auth walls |
| Reddit | Community operators, niche founders | Developer, niche B2B | Low |
| Government registries | Legal entity, officers, procurement activity | Regulated industries | Portal friction |

### Company Websites and Team Pages

Company websites are the most authoritative source in the pipeline. The `/about`, `/team`, and `/contact` pages carry leadership names, titles, and email addresses. The `/careers` page reveals growth stage and technology requirements from job descriptions. The footer usually holds the legal entity name, social links, and sometimes a direct phone number.

No login wall blocks this data. The friction is scale visiting hundreds of company sites requires rotating IPs and handling inconsistent HTML across domains. That is why the extraction section covers AI-based field normalization.

For a more detailed implementation, including complete code examples and extraction techniques, refer to our dedicated tutorial.

[Web Scraping Emails using PythonIn this tutorial we'll take a look at email scraping. How to crawl pages and extract email addresses using Python and what are some popular challenges.](https://scrapfly.io/blog/posts/how-to-scrape-emails-using-python)

### Professional and Company-Intel Platforms

LinkedIn gives you decision maker names, titles, and company pages with employee counts and recent posts. [Scraping LinkedIn](https://scrapfly.io/blog/posts/how-to-scrape-linkedin) requires session management and careful rate limiting because the platform throttles hard and bans scrapers that exceed its request thresholds. For API alternatives, see the [LinkedIn API and alternatives guide](https://scrapfly.io/blog/posts/guide-to-linkedin-api-and-alternatives).

Crunchbase exposes funding rounds, investor names, and leadership profiles for technology companies. It is the best source for funding stage and investor data. See the [Crunchbase scraping guide](https://scrapfly.io/blog/posts/how-to-scrape-crunchbase).

ZoomInfo carries revenue ranges and organizational hierarchy but sits behind an aggressive login wall. Use it selectively and read the [ZoomInfo scraping guide](https://scrapfly.io/blog/posts/how-to-scrape-zoominfo) before building against it.

G2 is valuable for software buyer leads company names, sizes, and the software categories they have reviewed. See [how to scrape G2 company data and reviews](https://scrapfly.io/blog/posts/how-to-scrape-g2-company-data-and-reviews).

Wellfound (formerly AngelList) surfaces startup founders and early employees. It is the best source for pre-Series B company data. See [how to scrape Wellfound](https://scrapfly.io/blog/posts/how-to-scrape-wellfound-aka-angellist).

SimilarWeb provides traffic estimates and technology signals useful for digital-first ICP scoring. See [how to scrape SimilarWeb](https://scrapfly.io/blog/posts/how-to-scrape-similarweb).

### Local Business Directories

Google Maps is the strongest source for local and service-business leads. Each listing exposes the business name, phone number, address, website URL, and aggregate rating. The [Google Maps scraping guide](https://scrapfly.io/blog/posts/how-to-scrape-google-maps) covers the full extraction flow.

Yelp adds review-based signals on top of directory data and covers service verticals that Maps underrepresents. See [how to scrape Yelp](https://scrapfly.io/blog/posts/how-to-scrape-yelpcom).

Yellow Pages remains a reliable source for local B2B contacts with minimal anti-bot protection. See [how to scrape Yellow Pages](https://scrapfly.io/blog/posts/how-to-scrape-yellowpages).

TripAdvisor is the go-to source for hospitality and venue operators. See [how to scrape TripAdvisor](https://scrapfly.io/blog/posts/how-to-scrape-tripadvisor). Trustpilot adds review-activity signals that work as a proxy for company size and customer volume. See [how to scrape Trustpilot](https://scrapfly.io/blog/posts/how-to-scrape-trustpilot-com-reviews).

### Hiring Signals from Job Boards

Indeed, Glassdoor, and Google Jobs surface which companies are actively expanding and what roles and technologies they need. A spike in engineering postings at a company that has never hired engineers before is a buying signal for developer tools.

Job postings also expose technology requirements. A posting that asks for Salesforce admin experience tells you the company runs Salesforce. That single field improves ICP scoring more than most firmographic lookups.

### Social and Creator Platforms

Twitter/X, YouTube, Instagram, TikTok, and Reddit are best for creator-economy and personal-brand leads rather than enterprise contacts. They also surface company activity signals that do not show up in directories.

See the guides for [scraping Twitter/X](https://scrapfly.io/blog/posts/how-to-scrape-twitter), [scraping YouTube](https://scrapfly.io/blog/posts/how-to-scrape-youtube), [scraping Instagram](https://scrapfly.io/blog/posts/how-to-scrape-instagram), [scraping TikTok](https://scrapfly.io/blog/posts/how-to-scrape-tiktok-python-json), and [scraping Reddit](https://scrapfly.io/blog/posts/how-to-scrape-reddit-social-data).

### Government Registries and Public Records

Business registries, SAM.gov, and IRS Form 990 filings expose legal entity names, registered officers, and procurement activity. These sources are especially valuable for regulated industries and government contracting leads.

Government portals often introduce additional challenges, such as ASP.NET session tokens, CAPTCHAs, and aggressive cloud-IP blocking. Because of this friction and the need to normalize records across jurisdictions, treat government registries as a secondary enrichment source rather than the primary discovery channel. For more details, see the public records scraping guide.

With the source map in hand, the next challenge is extracting from sources that actively resist automated access.



## How Do You Extract Lead Data Without Getting Blocked?

Start with the cheapest path that works check for a hidden JSON endpoint in the browser's network tab before writing an HTML parser. Then escalate to managed unblocking, AI-based extraction, and orchestrated crawling when sources block cloud IPs or render with JavaScript.

An LLM can write a working scraper for any single site in seconds. The bottleneck is not writing scrapers. It is running dozens of them continuously across sources that actively block you and change their markup without warning.

The failure modes at multi-source scale are predictable:

- **Cloud-IP and datacenter blocks.** Scrapers that work locally return 403s when running from AWS or GCP ranges. LinkedIn, Google Maps, and most major directories block datacenter IPs aggressively.
- **JavaScript-rendered listings.** Directory results and profile pages often load after the initial HTML response. A requests-only scraper sees an empty shell where the data should be.
- **Per-site rate limits and login walls.** LinkedIn throttles hard. Safe scraping means small batches, IP rotation, and exponential backoff on 429 responses.
- **Layout drift across sources.** Every source uses different HTML structure. Selectors written for one site break when the site redeploys. Multiply by a dozen sources and maintenance becomes the job.

See [how to find and scrape hidden APIs](https://scrapfly.io/blog/posts/how-to-scrape-hidden-apis) for the network-tab approach. For a thorough breakdown of blocking techniques, see [how to bypass anti-bot protection when web scraping](https://scrapfly.io/blog/posts/how-to-bypass-anti-bot-protection-when-web-scraping).

The escalation path maps directly to the friction type. For sources that block cloud IPs, render with JavaScript, or enforce rate limits, use the [Web Scraping API](https://scrapfly.io/products/web-scraping-api) with ASP enabled. ASP rotates residential IPs, renders JavaScript in a real browser context, and manages the challenge response automatically.

python```python
import asyncio
from scrapfly import ScrapflyClient, ScrapeConfig

SCRAPFLY = ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY")

async def fetch_lead_page(url: str) -> str:
    result = await SCRAPFLY.async_scrape(ScrapeConfig(
        url,
        asp=True,        # Anti Scraping Protection bypass
        render_js=True,  # Full browser rendering
        country="US",    # Route through residential IPs in target market
    ))
    return result.content

html = asyncio.run(fetch_lead_page("https://web-scraping.dev/products"))
print(html[:500])
```



For heterogeneous HTML across many sources, per-site selectors multiply and break constantly. The [AI Extraction API](https://scrapfly.io/products/extraction-api) solves this by reading the page and returning the target schema fields directly, without selectors.

python```python
from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig

SCRAPFLY = ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY")

def ai_extract_lead(url: str) -> dict:
    result = SCRAPFLY.scrape(ScrapeConfig(
        url,
        asp=True,
        render_js=True,
    ))
    extraction = SCRAPFLY.extract(ExtractionConfig(
        body=result.content,
        content_type="text/html",
        extraction_prompt=(
            "Extract the following fields: company name, contact email, "
            "contact phone, location (city and country), and a short business description."
        ),
    ))
    return extraction.result

fields = ai_extract_lead("https://web-scraping.dev/products")
print(fields)
```



For paginated directories and scheduled re-crawls, the [Crawler API](https://scrapfly.io/products/crawler-api) orchestrates the URL queue and handles pagination automatically. For a comparison of when to use each tool, see [Scraper API vs Crawler API](https://scrapfly.io/blog/posts/scraper-api-vs-crawler-api).

The decision is straightforward plain requests for unprotected pages and hidden JSON endpoints; Web Scraping API with ASP for blocked, JS-rendered, or rate-limited sources; AI Extraction when you are reconciling many differently structured pages; Crawler API to orchestrate paginated directories at scale.

With reliable extraction in place, the next layer is finding and verifying the contact details that make a record actionable.



## How Do You Find and Verify Contact Emails and Phone Numbers?

Pull contact details from the company's own pages first, fall back to email-pattern inference, then verify every address before it enters the database. High bounce rates are the fastest way to get an outreach domain flagged as spam.

The most reliable email sources are the company's own `/contact`, `/about`, and `/team` pages. These pages expose `mailto:` links and common address patterns like `info@`, `hello@`, `sales@`, and `first.last@domain.com`. The footer often holds a direct phone number and social links.

python```python
import re
import asyncio
import httpx

EMAIL_RE = re.compile(r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}")

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
}

async def extract_emails(url: str) -> list:
    async with httpx.AsyncClient(headers=HEADERS, follow_redirects=True, timeout=15) as client:
        resp = await client.get(url)
    # mailto: links first (highest precision)
    mailto = re.findall(r'mailto:([^"\'> ]+)', resp.text)
    # regex pass over full page text
    plain = EMAIL_RE.findall(resp.text)
    # strip mailto query strings and trailing punctuation
    cleaned = {e.split("?")[0].strip("\\/ \t") for e in mailto + plain}
    emails = {e.lower() for e in cleaned if "@" in e}
    # remove image and asset false positives
    return [e for e in emails if not e.endswith((".png", ".jpg", ".gif", ".svg", ".css"))]

results = asyncio.run(extract_emails("https://scrapfly.io/contact"))
print(results)
```



When the company page does not expose a direct email, infer the pattern from a known name and the company domain. Most B2B companies use `firstname@domain.com`, `firstname.lastname@domain.com`, or `firstlast@domain.com`. You can verify which pattern is live without sending a real message.

Verification prevents bounces. The self-hosted path runs an MX record lookup to confirm the domain accepts mail, then an SMTP RCPT check to confirm the mailbox exists without delivering anything. Paid verifiers like NeverBounce and ZeroBounce handle this at scale and flag catch-all domains, which accept any address and cannot be verified individually.

Phone numbers exposed by Google Maps and directories arrive in inconsistent formats. Normalize them to E.164 format (`+1XXXXXXXXXX`) for storage so CRM deduplication works correctly across sources.

Verified contacts are only useful if they resolve to a single canonical company record. That is the job the next section covers.



## How Do You Deduplicate and Resolve the Same Company Across Sources?

Deduplicate by matching on the most reliable key you have the normalized domain. Fall back to fuzzy name matching, then composite matching. The same company appears as "Acme Corp," "Acme Corporation," and "acme-corp.com" across sources, and treating them as three records breaks every downstream process.

This is the step almost every build skips. Pulling from six source types guarantees duplicates and conflicting field values. There is no shared company ID across public sources, so you build your own matching logic.

The matching strategies in confidence order

**Domain matching (highest confidence).** Strip the protocol, remove `www.`, lowercase everything, and match exactly. Two records with the same normalized domain are the same company.

**Name matching (medium confidence).** Strip legal suffixes (Inc, LLC, Ltd, Corp, GmbH, Co, PLC), normalize whitespace, lowercase, and fuzzy-match against a threshold. A similarity score above 0.85 is a confident match for most company names.

**Composite matching (highest precision).** Combine normalized domain, normalized name, and city. Require two of three to match for a merge. This catches cases where the domain differs (a subsidiary uses a different domain) but name and location align.

python```python
import re
from urllib.parse import urlparse
from difflib import SequenceMatcher

LEGAL_SUFFIXES = re.compile(
    r"\b(inc|llc|ltd|corp|co|gmbh|plc|sas|bv|ag|sa)\b\.?", re.IGNORECASE
)

def normalize_domain(raw: str) -> str:
    if not raw:
        return ""
    parsed = urlparse(raw if "//" in raw else f"https://{raw}")
    host = parsed.netloc or parsed.path
    return re.sub(r"^www\.", "", host).lower().strip("/")

def normalize_name(name: str) -> str:
    name = LEGAL_SUFFIXES.sub("", name).strip()
    return re.sub(r"[^a-z0-9 ]", "", name.lower()).strip()

def name_similarity(a: str, b: str) -> float:
    return SequenceMatcher(None, normalize_name(a), normalize_name(b)).ratio()

def is_duplicate(a: dict, b: dict) -> bool:
    domain_a = normalize_domain(a.get("domain", ""))
    domain_b = normalize_domain(b.get("domain", ""))
    if domain_a and domain_b and domain_a == domain_b:
        return True
    score = name_similarity(a.get("company_name", ""), b.get("company_name", ""))
    if score > 0.85:
        return True
    return False


if __name__ == "__main__":
    pairs = [
        (
            {"domain": "https://www.acme.com", "company_name": "Acme Corp"},
            {"domain": "acme.com",             "company_name": "Acme Corporation"},
        ),
        (
            {"domain": "alpha.io",  "company_name": "Alpha Inc"},
            {"domain": "beta.io",   "company_name": "Beta LLC"},
        ),
        (
            {"domain": "",          "company_name": "Globex Corporation"},
            {"domain": "",          "company_name": "Globex Corp"},
        ),
    ]
    for a, b in pairs:
        dup = is_duplicate(a, b)
        sim = name_similarity(a["company_name"], b["company_name"])
        print(f"{a['company_name']!r:30} vs {b['company_name']!r:30}  duplicate={dup}  name_sim={sim:.2f}")
```



After matching, decide which field value wins when sources conflict. A practical rule prefer the company's own website for description and contact details, the government registry for the legal entity name, and the directory for phone number and address. Track provenance by storing a `source_url` and `retrieved_at` timestamp per attribute so you know where each value came from and how old it is.



Scrapfly

#### Scale your web scraping effortlessly

Scrapfly handles proxies, browsers, and anti-bot bypass — so you can focus on data.

[Try Free →](https://scrapfly.io/register)## How Do You Enrich Records with Firmographics and Buying Signals?

Enrichment starts with a seed and layers on firmographics, technology stack, and buying signals by scraping the company's own surfaces. The goal is a scored record that can be filtered and prioritized before outreach.

The seed-to-enriched flow has four stages. First, fetch the company site and extract the base firmographic fields using AI Extraction. Second, detect the technology stack from HTML signatures. Third, count open job postings as a growth signal. Fourth, compute a quality score that weights each field category.

Technology-stack detection scans the page HTML for known markers HubSpot tracking scripts, Salesforce widgets, Shopify checkout URLs, Segment analytics. A simple regex pass against known patterns is enough for scoring purposes.

Buying signals make a record actionable. A company that posted five new engineering roles in the past 30 days is expanding. A company that posted its first sales hire is entering a new market. A spike in Glassdoor reviews often precedes a technology procurement cycle.

python```python
import asyncio
import sqlite3

from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig

SCRAPFLY = ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY")

DB_PATH = "leads.db"
QUALITY_THRESHOLD = 60.0

# In production: populate from a Google Maps or directory scrape
SEED_URLS = [
    "https://web-scraping.dev/products",   # neutral demo target
]


def _init_db():
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS leads (
            domain TEXT PRIMARY KEY,
            company_name TEXT,
            contact_email TEXT,
            quality_score REAL,
            last_enriched TEXT
        )
    """)
    conn.commit()
    conn.close()


async def ai_extract_lead(url: str) -> dict:
    result = await SCRAPFLY.async_scrape(ScrapeConfig(
        url,
        asp=True,
        render_js=True,
    ))
    extraction = await SCRAPFLY.async_extraction(ExtractionConfig(
        body=result.content,
        content_type="text/html",
        extraction_prompt=(
            "Extract company contact information and return a JSON object with these fields: "
            "company_name, domain (website domain only, e.g. example.com), "
            "contact_email, contact_phone, contact_name, contact_title, location."
        ),
    ))
    return extraction.data


def detect_tech_stack(html: str) -> list:
    signatures = {
        "HubSpot": "js.hs-scripts.com",
        "Salesforce": "salesforce.com/servlet",
        "Segment": "cdn.segment.com",
        "Shopify": "cdn.shopify.com",
        "React": "react.development.js",
    }
    return [name for name, marker in signatures.items() if marker in html]


def score_record(record: dict) -> float:
    score = 0.0
    if record.get("company_name"): score += 20
    if record.get("contact_email"): score += 25
    if record.get("contact_name"): score += 15
    if record.get("tech_stack"): score += 20
    if record.get("open_positions", 0) > 0: score += 20
    return score


def upsert_lead(record: dict):
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        INSERT INTO leads (domain, company_name, contact_email, quality_score, last_enriched)
        VALUES (:domain, :company_name, :contact_email, :quality_score, datetime('now'))
        ON CONFLICT(domain) DO UPDATE SET
            company_name = excluded.company_name,
            contact_email = excluded.contact_email,
            quality_score = excluded.quality_score,
            last_enriched = excluded.last_enriched
    """, record)
    conn.commit()
    conn.close()


async def run_pipeline(seed_urls: list) -> list:
    records = []
    seen_domains = set()

    for url in seed_urls:
        # 1. Fetch and extract fields
        result = await SCRAPFLY.async_scrape(ScrapeConfig(
            url, asp=True, render_js=True,
        ))
        fields = await ai_extract_lead(url)
        if not fields.get("domain"):
            continue

        # 2. Dedup check
        if fields["domain"] in seen_domains:
            continue
        seen_domains.add(fields["domain"])

        # 3. Enrich and score
        fields["tech_stack"] = detect_tech_stack(result.content)
        fields["quality_score"] = score_record(fields)
        fields["last_enriched"] = "2026-06-09"
        fields["source_urls"] = [url]

        # 4. Store
        upsert_lead(fields)
        records.append(fields)

    return records


if __name__ == "__main__":
    _init_db()
    results = asyncio.run(run_pipeline(SEED_URLS))
    print(f"Pipeline complete: {len(results)} records written")
    for r in results:
        import json
        print(json.dumps(r, indent=2))

```



Third-party enrichment APIs like Hunter.io for emails or Clearbit for firmographics are optional accelerators. They are useful when a source has no contact page and pattern inference produces no confident match. They are not a replacement for owning the extraction pipeline.

See the [AI Extraction API](https://scrapfly.io/products/extraction-api) for the field-extraction step in the enrichment flow.

Enriched records are only valuable if they stay current. The next section covers storage, delivery, and the maintenance loop that keeps the database fresh.



## How Do You Store, Refresh, and Deliver Leads to Your CRM?

Store the normalized records in a simple structured store, push qualified leads to your CRM using domain and email as the merge key, then keep the database current with scheduled re-crawls. A one-time pull is a depreciating asset.

A CSV or SQLite table keyed on normalized domain is enough for a starter build. SQLite handles hundreds of thousands of records without configuration, supports upserts natively, and lives in a single file. Graduating to a real database is a separate decision that belongs in a later sprint, not the initial build.

CRM delivery maps your schema fields to the CRM's contact and company objects. For HubSpot, Salesforce, or Airtable, the pattern is the same map fields, set a quality threshold, and upsert on (domain, email) so re-runs update existing records instead of creating duplicates.

python```python
import sqlite3
import httpx

DB_PATH = "leads.db"
QUALITY_THRESHOLD = 60.0
HUBSPOT_TOKEN = "YOUR_HUBSPOT_TOKEN"

def upsert_lead(record: dict):
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        INSERT INTO leads (domain, company_name, contact_email, quality_score, last_enriched)
        VALUES (:domain, :company_name, :contact_email, :quality_score, datetime('now'))
        ON CONFLICT(domain) DO UPDATE SET
            company_name = excluded.company_name,
            contact_email = excluded.contact_email,
            quality_score = excluded.quality_score,
            last_enriched = excluded.last_enriched
    """, record)
    conn.commit()
    conn.close()

    if record.get("quality_score", 0) >= QUALITY_THRESHOLD and record.get("contact_email"):
        push_to_hubspot(record)

def push_to_hubspot(record: dict):
    payload = {
        "properties": {
            "company": record.get("company_name", ""),
            "email": record.get("contact_email", ""),
            "website": f"https://{record.get('domain', '')}",
        }
    }
    httpx.post(
        "https://api.hubapi.com/crm/v3/objects/contacts",
        json=payload,
        headers={"Authorization": f"Bearer {HUBSPOT_TOKEN}"},
        timeout=10,
    )
```



Contact data decays fast. People change jobs, companies relocate, phone numbers are reassigned. A lead database from six months ago has already lost meaningful accuracy. Most B2B contact data decays at a rate of 25 to 30 percent per year, which means roughly one in three records goes stale within 12 months.

Keep the database fresh with four simple rules:

- Add a `last_enriched` timestamp and let `quality_score` drop as records age.
- Re-crawl high-value accounts quarterly; re-crawl job boards more often.
- Watch bounce rates - above 5 percent on a segment means it's time to re-crawl.
- On re-crawl, update only changed fields instead of overwriting the whole record.

The [Crawler API](https://scrapfly.io/products/crawler-api) handles the scoped re-crawl across a list of company URLs. For delta detection patterns, see [Crawler APIs for monitoring website changes](https://scrapfly.io/blog/posts/crawler-apis-for-monitoring-website-changes-maintaining-ai-chatbots). The re-crawl cadence itself runs from your own scheduler, whether that is a cron job, a GitHub Action, or a workflow runner.



## Worked Example Building a Niche B2B Lead List End to End

A real build runs one niche query through the whole pipeline discover companies from a directory, visit each company site for contacts and signals, dedup against what you already have, enrich and score, and write the normalized record.

The niche for this example is local software consultancies in a single city. The discovery source is Google Maps (see the [Google Maps scraping guide](https://scrapfly.io/blog/posts/how-to-scrape-google-maps) for the full extraction). The pipeline visits each company's own site for contact details and signals, then writes a scored record.

The output record for each company looks like this:

python```python
{
    "domain": "example-consultancy.com",
    "company_name": "Example Consultancy LLC",
    "location": "Austin, TX",
    "contact_name": "Jane Smith",
    "contact_title": "Founder",
    "contact_email": "jane@example-consultancy.com",
    "employee_count": 12,
    "tech_stack": ["HubSpot", "React"],
    "open_positions": 3,
    "quality_score": 100.0,
    "source_urls": [
        "https://maps.google.com/?cid=...",
        "https://example-consultancy.com/contact"
    ],
    "last_enriched": "2026-06-09",
}
```



The orchestration loop below ties the stages together. In production, replace `SEED_URLS` with a list extracted from the Google Maps scraper for your target search query.

python```python
import asyncio
import sqlite3

from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig

SCRAPFLY = ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY")

DB_PATH = "leads.db"
QUALITY_THRESHOLD = 60.0

# In production: populate from a Google Maps or directory scrape
SEED_URLS = [
    "https://web-scraping.dev/products",   # neutral demo target
]


def _init_db():
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS leads (
            domain TEXT PRIMARY KEY,
            company_name TEXT,
            contact_email TEXT,
            quality_score REAL,
            last_enriched TEXT
        )

    """)
    conn.commit()
    conn.close()


async def ai_extract_lead(url: str) -> dict:
    result = await SCRAPFLY.async_scrape(ScrapeConfig(
        url,
        asp=True,
        render_js=True,
    ))
    extraction = SCRAPFLY.extract(ExtractionConfig(
        body=result.content,
        content_type="text/html",
        extraction_prompt=(
            "Extract company contact information and return a JSON object with these fields: "
            "company_name, domain (website domain only, e.g. example.com), "
            "contact_email, contact_phone, contact_name, contact_title, location."
        ),
    ))
    return extraction.data


def detect_tech_stack(html: str) -> list:
    signatures = {
        "HubSpot": "js.hs-scripts.com",
        "Salesforce": "salesforce.com/servlet",
        "Segment": "cdn.segment.com",
        "Shopify": "cdn.shopify.com",
        "React": "react.development.js",
    }
    return [name for name, marker in signatures.items() if marker in html]


def score_record(record: dict) -> float:
    score = 0.0
    if record.get("company_name"): score += 20
    if record.get("contact_email"): score += 25
    if record.get("contact_name"): score += 15
    if record.get("tech_stack"): score += 20
    if record.get("open_positions", 0) > 0: score += 20
    return score


def upsert_lead(record: dict):
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        INSERT INTO leads (domain, company_name, contact_email, quality_score, last_enriched)
        VALUES (:domain, :company_name, :contact_email, :quality_score, datetime('now'))
        ON CONFLICT(domain) DO UPDATE SET
            company_name = excluded.company_name,
            contact_email = excluded.contact_email,
            quality_score = excluded.quality_score,
            last_enriched = excluded.last_enriched
    """, record)
    conn.commit()
    conn.close()


async def run_pipeline(seed_urls: list) -> list:
    records = []
    seen_domains = set()

    for url in seed_urls:
        # 1. Fetch and extract fields
        result = await SCRAPFLY.async_scrape(ScrapeConfig(
            url, asp=True, render_js=True,
        ))
        fields = await ai_extract_lead(url)
        if not fields.get("domain"):
            continue

        # 2. Dedup check
        if fields["domain"] in seen_domains:
            continue
        seen_domains.add(fields["domain"])

        # 3. Enrich and score
        fields["tech_stack"] = detect_tech_stack(result.content)
        fields["quality_score"] = score_record(fields)
        fields["last_enriched"] = "2026-06-09"
        fields["source_urls"] = [url]

        # 4. Store
        upsert_lead(fields)
        records.append(fields)

    return records


if __name__ == "__main__":
    _init_db()
    results = asyncio.run(run_pipeline(SEED_URLS))
    print(f"Pipeline complete: {len(results)} records written")

```



This pattern scales horizontally. Add more seed URLs from more sources, run the pipeline on a schedule, and the database grows and stays current automatically. The key insight from r/b2b\_sales, targeted lists built from public sources consistently outperform mass-scraped generic data for conversion, even when they are smaller.

Extraction and enrichment are technical problems with clear solutions. Compliance is an operational discipline that runs in parallel.



## How Do You Stay Compliant When Scraping B2B Leads?

Compliance runs in parallel with every scrape and every outreach it is not a one-time legal review. The short rules:

- **Public data only.** Only scrape what an unauthenticated browser can see. Never access login-walled databases.
- **GDPR.** EU work emails are personal data. Include your organization name, purpose, and a clear opt-out in every cold message.
- **CCPA.** California residents can request to know what you hold and ask for deletion even in B2B contexts.
- **Opt-outs.** Keep a blocklist, check it before every send, and honor removal requests within 30 days.
- **ToS and robots.txt.** Respect `robots.txt`. Scraping a platform that prohibits it carries contract risk even when the data is public hiQ Labs v. LinkedIn (2022) is the cautionary example.

The lowest-risk sources are company websites, government registries, and local business directories.



## FAQ

Is it legal to scrape public data for lead generation?Scraping publicly available business data is generally low-risk in the US. In hiQ Labs v. LinkedIn (Ninth Circuit, 2022), the court found that scraping public pages likely does not violate the Computer Fraud and Abuse Act. That ruling is not a blanket green light hiQ later lost on breach-of-contract grounds, Terms of Service violations still carry legal risk, and GDPR treats EU B2B contacts as personal data regardless of how the data was collected.







Can I use AI or ChatGPT to build a lead database?An LLM can write the scraper and normalize extracted fields. It cannot keep dozens of source scrapers running past anti-bot systems or surviving layout drift. The reliability layer managed unblocking, AI Extraction, and scheduled re-crawls, is what makes an AI-written pipeline hold up in production.







Can I build a B2B lead database for free?Plain requests against unprotected sources cost nothing beyond compute time. Managed unblocking and AI Extraction add cost but earn it back quickly once you scale across blocked, JS-rendered, and heterogeneous sources where unmanaged scrapers fail consistently.







How is this different from buying a list from ZoomInfo or Apollo?Buying gives instant broad coverage. Building lets you reach niche ICPs the aggregators miss, control freshness, add custom fields, and pay infrastructure cost instead of per-lead cost. Many teams do both.







What is the best source for B2B contact emails?A company's own `/contact`, `/about`, and `/team` pages are the most authoritative source. Verify every address before outreach to keep bounce rates low. Purchased email lists are often stale and carry deliverability risk.









## Conclusion

Building a B2B lead database from public sources is the way to reach the niche accounts rented databases miss. The hard part is never writing the scraper. It is running many of them reliably, resolving the same company across sources, and keeping the records current as people change jobs and companies move.

Start with hidden APIs and plain requests for the easy sources. Escalate to the [Web Scraping API](https://scrapfly.io/products/web-scraping-api) with ASP for sources that block cloud IPs or render with JavaScript. Use the [AI Extraction API](https://scrapfly.io/products/extraction-api) to normalize heterogeneous HTML across sources without maintaining per-site selectors.

Scrapfly's Web Scraping API with ASP, AI Extraction API, and [Crawler API](https://scrapfly.io/products/crawler-api) are the production-grade backbone for the sources that block cloud IPs and the re-crawl layer that keeps the database fresh.



Legal Disclaimer and PrecautionsThis tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect:

- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens protected by GDPR.
- Do not repurpose *entire* public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow. For more you should consult a lawyer.

 

   Table of Contents















 

  Table of Contents- [Key Takeaways](#key-takeaways)
- [Why Build Your Own B2B Lead Database Instead of Renting One from ZoomInfo or Apollo?](#why-build-your-own-b2b-lead-database-instead-of-renting-one-from-zoominfo-or-apollo)
- [Which Public Sources Can You Scrape for B2B Leads?](#which-public-sources-can-you-scrape-for-b2b-leads)
- [Company Websites and Team Pages](#company-websites-and-team-pages)
- [Professional and Company-Intel Platforms](#professional-and-company-intel-platforms)
- [Local Business Directories](#local-business-directories)
- [Hiring Signals from Job Boards](#hiring-signals-from-job-boards)
- [Social and Creator Platforms](#social-and-creator-platforms)
- [Government Registries and Public Records](#government-registries-and-public-records)
- [How Do You Extract Lead Data Without Getting Blocked?](#how-do-you-extract-lead-data-without-getting-blocked)
- [How Do You Find and Verify Contact Emails and Phone Numbers?](#how-do-you-find-and-verify-contact-emails-and-phone-numbers)
- [How Do You Deduplicate and Resolve the Same Company Across Sources?](#how-do-you-deduplicate-and-resolve-the-same-company-across-sources)
- [How Do You Enrich Records with Firmographics and Buying Signals?](#how-do-you-enrich-records-with-firmographics-and-buying-signals)
- [How Do You Store, Refresh, and Deliver Leads to Your CRM?](#how-do-you-store-refresh-and-deliver-leads-to-your-crm)
- [Worked Example Building a Niche B2B Lead List End to End](#worked-example-building-a-niche-b2b-lead-list-end-to-end)
- [How Do You Stay Compliant When Scraping B2B Leads?](#how-do-you-stay-compliant-when-scraping-b2b-leads)
- [FAQ](#faq)
- [Conclusion](#conclusion)
 
    Join the Newsletter  Get monthly web scraping insights 

 

  



Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 

## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-leads) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-leads) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-leads) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-leads) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-scrape-leads) 



 ## Related Articles

 [     

 api data-parsing 

### 10 Best Public Data Sources for Lead Generation in 2026

A ranked directory of 10 public data sources for B2B lead generation, with the fields, access method, and freshness of e...

 

 ](https://scrapfly.io/blog/posts/best-public-data-sources-for-lead-generation) [  

 python crawling 

### Guide to List Crawling: Everything You Need to Know

Complete list crawling tutorial assess site defenses, bypass anti-bot systems, choose tools (Beautiful Soup, Playwright,...

 

 ](https://scrapfly.io/blog/posts/guide-to-list-crawling) [  

 http python 

### Web Scraping with Python

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and ...

 

 ](https://scrapfly.io/blog/posts/web-scraping-with-python) 

  



   



 Scale your web scraping effortlessly, **1,000 free credits** [Start Free](https://scrapfly.io/register)