[Blog](https://scrapfly.io/blog)   /  [api](https://scrapfly.io/blog/tag/api)   /  [10 Best Public Data Sources for Lead Generation in 2026](https://scrapfly.io/blog/posts/best-public-data-sources-for-lead-generation)   # 10 Best Public Data Sources for Lead Generation in 2026

 by [Mayada Shaaban](https://scrapfly.io/blog/author/mayada-shaaban-90143e67) Jul 24, 2026 19 min read [\#api](https://scrapfly.io/blog/tag/api) [\#data-parsing](https://scrapfly.io/blog/tag/data-parsing) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation "Share on LinkedIn") [  ](https://x.com/intent/tweet?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation&text=10%20Best%20Public%20Data%20Sources%20for%20Lead%20Generation%20in%202026 "Share on X") [  ](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation "Share on Facebook")    

 
Summarize this article with

 [  ](https://chat.openai.com/?q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation) [  ](https://claude.ai/new?q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation) [  ](https://x.com/i/grok?text=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation) [  ](https://www.perplexity.ai/search/new?q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation) [  ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fbest-public-data-sources-for-lead-generation) 


   **Extraction API**Extract structured data from any web page using AI, LLMs, or custom rules.

 
 [ Learn More  ](https://scrapfly.io/products/extraction-api) [  Docs ](https://scrapfly.io/docs/extraction-api/getting-started) 

 
Search "public data sources for lead generation" and most results sell you a subscription. The top hit is a two-year-old Reddit thread whose top reply is "just buy the database, it's $99." Yet plenty of lead-grade data is public.

The catch is which sources give contactable records versus market stats, and how to get the data out. Below are 10 public sources ranked by lead-gen value, each with its fields, access method, freshness, and an honest caveat.

[Web Scraping Emails using PythonIn this tutorial we'll take a look at email scraping. How to crawl pages and extract email addresses using Python and what are some popular challenges.](https://scrapfly.io/blog/posts/how-to-scrape-emails-using-python)


## Key Takeaways

- **Registries and SEC EDGAR rank highest** for B2B leads: identity, execs, financials.
- **Public sources give firmographics, not contacts**; add emails and phones separately.
- **Access is the real difference**: free APIs, bulk downloads, or scrape-only.
- **Local directories and licensing boards** win the niche and local SMB lists.
- **Census and BLS size markets**; registry and directory sources produce the lead records.

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.


## Which Public Data Source Is Best for Lead Generation?

For B2B lead building, official business registries are the best starting point, with SEC EDGAR layered on top for US public companies.

Local directories like Google Maps and Yelp win for local SMB lists. Census and BLS size markets rather than find individual leads.

Match the source to the job you're doing:

- **Company universe:** official business registries.
- **Decision-makers:** LinkedIn public profiles.
- **Local SMBs:** Google Maps, Yelp, and Yellow Pages.
- **Government buyers:** SAM.gov and USASpending.
- **Nonprofits:** IRS 990 filings.
- **Timing signals:** job boards.
- **Funding stage:** Crunchbase.

The table below covers all 10 sources at a glance. The access column is the one most directory pages skip, and it decides how much engineering each source costs you.

| \# | Source | Best for | Key fields you get | Access method | Freshness |
|---|---|---|---|---|---|
| 1 | Official business registries | Company universe | Legal name, registration number, status, address, directors | API + bulk (UK); scrape (US states) | Daily (UK), varies (US) |
| 2 | SEC EDGAR | Public-company financials | Revenue, headcount, executives, events | Free API + bulk | Real-time on filing |
| 3 | LinkedIn public profiles | Decision-maker mapping | Titles, seniority, career history | Scrape only | Continuous |
| 4 | Local directories | Local SMB lists | Name, address, phone, website, reviews | Scrape only | Continuous |
| 5 | SAM.gov + USASpending | Government contractors | Registrations, awards, amounts, agencies | Free API + bulk | Near real-time |
| 6 | IRS 990 filings | Nonprofit prospecting | Revenue, officers, compensation, mission | Free API + bulk | 1-2 year lag |
| 7 | Licensing boards | Regulated professions | Name, license status, address, phone | Scrape only (most) | Varies by state |
| 8 | Job boards | Hiring and growth signals | Titles, locations, tech stack, post dates | Scrape only | Continuous |
| 9 | Crunchbase | Funding-stage targeting | Funding rounds, investors, acquisitions | Paid API; web scrape | Continuous |
| 10 | Census + BLS | Market sizing | Businesses by industry and county, wages | Free API + bulk | Periodic, release lag |

Each entry below explains the fields, access path, and caveat in more depth.


## How Did We Rank These Public Data Sources?

We ranked each source on four things: lead-gen yield, access friction, freshness, and coverage breadth. Yield asks whether it produces records you can act on. Friction covers API, bulk, or scrape access, plus auth and rate limits.

A source with a free API of company records outranks one that buries the same data behind 50 state search forms.

We included only genuinely public or free-to-access sources. We exclude the commercial ZoomInfo and Apollo category by definition, since this list covers what you can get without buying. Those databases get one honest comparison later.

We also left out dataset portals built for data science, such as Kaggle and AWS Open Data. They're useful for training models, but the data shape is wrong for lead generation.

One more note on neutrality: Scrapfly isn't ranked here. It appears only in the extraction section, where the access method matters.


## 1. Official Business Registries: Best for Building Your Company Universe

Official business registries are the best foundation for a B2B company universe, because every registered company has to file with one. Most countries run a mandatory company register.

The UK's Companies House is the model, with more than 5 million companies and daily updates. It also offers a free API and a free bulk data product.

The fields are exactly what firmographic targeting needs. You get legal name, registration number, and status (active or dissolved).

Beyond the basics, you also get registered address, industry code, incorporation date, and named directors or controlling owners. That's enough to build and segment a clean company list.

Two caveats matter. First, registries hold firmographics only, so there are no emails or phone numbers here. Second, 50 Secretary of State portals split US coverage, each with a different search interface.

That fragmentation creates a normalization problem you must solve before the data is usable. The UK API needs a free registered key, while US state portals are mostly scrape-only.

Registries give you the "who exists" layer, and the next source goes far deeper for US public companies, down to financials and named executives.


## 2. SEC EDGAR: Best for Public-Company Financials and Executives

SEC EDGAR is the best free source for public-company intelligence, giving no-auth access to every US public company filing. The 10-K and 10-Q reports carry revenue, headcount, and risk factors.

Proxy statements list executives and their compensation. The 8-K filings flag leadership changes and acquisitions, which are strong timing signals.

The full-text search API plus bulk archives make this practical at scale, and filings appear effectively in real time on their filing dates. You can query by keyword and form type, then pull the companies that match.

python```python
import httpx

SEC_UA = {"User-Agent": "Scrapfly Lead Research research@scrapfly.io"}

def edgar_full_text_search(query, form="10-K", limit=3):
    url = "https://efts.sec.gov/LATEST/search-index"
    params = {"q": query, "forms": form}
    resp = httpx.get(url, params=params, headers=SEC_UA, timeout=30)
    resp.raise_for_status()
    rows = []
    for hit in resp.json()["hits"]["hits"][:limit]:
        src = hit["_source"]
        rows.append({
            "company": src["display_names"][0],
            "form": src["file_type"],
            "filed": src["file_date"],
        })
    return rows

for row in edgar_full_text_search('"artificial intelligence"'):
    print(row)
```


The script above searches every 10-K for a phrase and returns the filing company, form, and date. It uses the public EDGAR full-text endpoint with the descriptive User-Agent the SEC asks automated clients to send.

text```text
{'company': 'Artificial Intelligence Technology Solutions Inc.  (AITX)  (CIK 0001498148)', 'form': 'EX-99', 'filed': '2021-06-01'}
{'company': 'Catalyst Crew Technologies Corp.  (CCTC)  (CIK 0001477960)', 'form': '10-K/A', 'filed': '2026-05-04'}
```


Running it returns live filing records you can act on.

The one caveat: EDGAR covers public companies only, so it's no help for SMB prospecting, and the filings are dense. The next source flips the focus from companies to the people who decide.


## 3. LinkedIn Public Profiles: Best for Decision-Maker Mapping

LinkedIn's public layer is the best source for mapping decision-makers to companies. Public profiles and company pages carry job titles, seniority, career history, and approximate headcount.

That data is exactly what you need to find the right contact inside a target account.

In practice this is scrape-only, and the protection is heavy. LinkedIn runs some of the strongest anti-bot defenses of any source on this list.

The legal picture depends on profile visibility settings. We keep the legal detail in the FAQ and the disclaimer rather than ruling on it here.

Treat LinkedIn as an identification layer. Use it to map who holds which role, then add verified contact data from elsewhere.

For the collection mechanics, see \[%tref how-to-scrape-linkedin "our LinkedIn scraping guide"\]. For the official endpoint limits, see \[%tref guide-to-linkedin-api-and-alternatives "the LinkedIn API guide"\].

LinkedIn maps people at companies of any size, and the next source is where local small businesses show up.


## 4. Local Business Directories: Best for Local SMB Lead Lists

Local directories are the richest public source of small-business records, covering the long tail that commercial databases miss.

Google Maps, Yelp, and Yellow Pages together list business name, category, address, phone, website, opening hours, and review volume. Review volume doubles as a rough size and quality proxy.

Access is scrape-only at scale. The Google Maps API exists, but its pricing makes building large lists from it uneconomical, so directory pages are where the value sits.

The phone and website fields are the real prize, since they're the closest thing to contact data in this whole list.

Watch two things. Listings decay, so closed businesses linger for months. The same business also appears across all three directories, so dedupe before you act.

We have working extractors for [How to Ensure Web Scrapped Data Quality](https://scrapfly.io/blog/posts/how-to-ensure-web-scrapped-data-quality)

Local directories cover commercial businesses, and the next source covers a buyer most lists ignore: the government and its contractors.


## 5. SAM.gov and USASpending: Best for Government Contractors

SAM.gov and USASpending are the best public sources for selling into the government market. SAM.gov lists every entity registered to do business with the US federal government, including registrations and certifications.

USASpending records who won what, with contract awards, amounts, and agencies, behind a no-key public API.

The lead-gen angle is twofold. You can sell to government contractors as a B2B-adjacent market. You can also qualify any prospect by the federal contract revenue they've already won.

A company with active awards has budget and a track record you can read for free.

python```python
import httpx

def usaspending_top_recipients(naics="541512", limit=3):
    url = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
    payload = {
        "filters": {
            "award_type_codes": ["A", "B", "C", "D"],
            "naics_codes": [naics],
            "time_period": [{"start_date": "2024-10-01", "end_date": "2025-09-30"}],
        },
        "fields": ["Recipient Name", "Award Amount", "Awarding Agency"],
        "sort": "Award Amount",
        "order": "desc",
        "limit": limit,
    }
    resp = httpx.post(url, json=payload, timeout=30)
    resp.raise_for_status()
    return resp.json()["results"]

for r in usaspending_top_recipients():
    print(r["Recipient Name"], "-", r["Award Amount"], "-", r["Awarding Agency"])
```


The script above pulls the top federal contractors in an industry by award amount, using the USASpending award search endpoint with no API key.

text```text
SCIENCE APPLICATIONS INTERNATIONAL CORPORATION - 2087211597.56 - Department of State
L3HARRIS TECHNOLOGIES, INC. - 1826088581.86 - Department of Commerce
ORACLE HEALTH GOVERNMENT SERVICES, INC. - 1496628663.21 - Department of Veterans Affairs
```


The response is a ranked list of named contractors and dollar amounts.

USASpending throttles under heavy burst use, and SAM.gov's search UI is slow, so bulk extracts beat the interface for full pulls. Government buyers are one niche vertical, and the next source opens another: tax-exempt organizations.


## 6. IRS 990 Filings: Best for Nonprofit Prospecting

IRS 990 filings are the best public source for nonprofit prospecting, because every US tax-exempt organization has to file one. A 990 reports revenue, expenses, named officers with their compensation, and the organization's mission.

ProPublica's Nonprofit Explorer makes all of it searchable through a free API.

Nonprofits buy software, services, and insurance like any other organization, and their 990s hand you the budget and the decision-makers for free. That combination of disclosed finances and named officers is rare among public sources.

python```python
import httpx

def propublica_nonprofits(query, limit=3):
    url = "https://projects.propublica.org/nonprofits/api/v2/search.json"
    resp = httpx.get(url, params={"q": query}, timeout=30)
    resp.raise_for_status()
    rows = []
    for org in resp.json()["organizations"][:limit]:
        rows.append({"name": org["name"], "ein": org["ein"], "state": org.get("state")})
    return rows

for row in propublica_nonprofits("foundation"):
    print(row)
```


The script above searches the Nonprofit Explorer by name and returns each organization with its EIN and state.

text```text
{'name': 'Foundation Foundation', 'ein': 475374786, 'state': 'CA'}
{'name': 'The Foundation Foundation', 'ein': 203487592, 'state': 'AZ'}
{'name': 'Springwell Foundationi', 'ein': 932444083, 'state': 'CA'}
```


It returns matching filers you can pull full financials for by EIN.

The trade-off is lag, since 990 data typically runs one to two years behind, and you get officer names without direct contact data. The next source trades national coverage for a different edge: regulated, license-gated professions.


## 7. Professional Licensing Boards: Best for Regulated Professions

Professional licensing boards are the best public source for reaching regulated trades, because the license itself is a public record. Every US state licenses contractors, medical professionals, attorneys, insurance agents, and cosmetologists.

Each state runs a public lookup with name, license status, business address, and sometimes a phone number.

The payoff is prebuilt vertical lists. You can assemble every licensed HVAC contractor in Texas or every active CPA in a county.

Commercial databases cover that kind of niche roster poorly or let it go stale. For trade-specific outreach, the precision is hard to match.

The friction is real, though. There's 50-state fragmentation with no shared schema, most portals are scrape-only, and some rate-limit aggressively. You'll build per-state extractors and normalize the output yourself.

Licensing data targets niche verticals, and the next source adds the dimension of timing, telling you which companies are moving right now.


Scrapfly

#### Extract structured data automatically?

Scrapfly's Extraction API uses AI to turn any webpage into structured data — no selectors needed.

[Try Free →](https://scrapfly.io/register)## 8. Job Boards: Best for Hiring and Growth Signals

Job boards are the best public source for buying-intent and growth signals, because a job posting shows where a company is investing. Indeed and Glassdoor postings reveal a lot.

A company hiring five SDRs is scaling sales. One hiring DevOps is spending on servers, and posting keywords expose the tech stack.

Access is scrape-only at scale, and postings are time-sensitive, so freshness is the entire value. A two-week-old posting is still a live signal, while a two-year-old one is noise.

The caveat is that postings are signals, not records. Pair them with a registry or directory source to attach the signal to a real company entity.

The mechanics for each board are here: [Guide to Understanding and Developing LLM Agents](https://scrapfly.io/blog/posts/practical-guide-to-llm-agents)


## 9. Crunchbase: Best for Funding-Stage Targeting

Crunchbase is the best public source for targeting companies by funding stage. Its public profiles carry funding rounds, investors, and acquisition news. Together they are the standard proxy for "has budget and is growing."

A company that recently closed a Series B is a different prospect than a bootstrapped shop.

On access, the free tier is browsable, the official API costs money, and you can scrape the public profile layer.

Collecting the public pages reaches Crunchbase without a paid plan. Our \[%tref how-to-scrape-crunchbase "Crunchbase scraping guide"\] shows how.

The coverage skews toward funded startups, and data quality drops off outside funded tech. Don't expect it to cover traditional or local businesses well.

Funding data points you at growing companies, and the final source zooms out to whole markets.


## 10. US Census Bureau and BLS: Best for Market Sizing

The US Census Bureau and the Bureau of Labor Statistics are the best public sources for sizing markets, not finding leads. County Business Patterns from the Census counts businesses by industry and county.

Demographic and income data describes territories, and BLS series track employment and wages. All of it ships through free APIs and bulk files.

Use these for total addressable market sizing, territory planning, and validating where your ideal customer profile concentrates. If you're deciding which metros to staff or which verticals to chase, this is the data that settles it.

The limit is that everything here is aggregate, so there are no individual company records, and most series carry a release lag.

Census and BLS tell you where the opportunity is, while the registry and directory sources earlier produce the records you contact. With all 10 sources covered, the next question is when public data beats a bought list.


## Public Data vs. Purchased Databases: When Is Each Right?

Back to that Reddit thread: "just buy the database, it's $99." The replies underneath asked the obvious question, which database, and nobody could name a good one.

That gap is the whole decision in miniature, so here's an honest framework instead of a verdict.

Buying wins when speed matters most. A purchased list gets you to a first campaign fast and often ships with verified emails and phones. It makes sense for small one-off lists where building a pipeline isn't worth it.

Public data plus extraction wins on the other axes:

- **Niche, local, and regulated verticals** that vendors cover poorly or keep stale.
- **Freshness-sensitive signals** like hiring or new filings that a quarterly list can't track.
- **Cost at scale**, where per-record pricing stops making sense.
- **Differentiation**, because when every competitor buys the same list, that list is the floor, not an edge.

Most teams land on a hybrid: public sources for the company universe and timing signals, purchased enrichment for verified contact fields.

One named contrast, once: the ZoomInfo and Apollo class of databases sell exactly that contact layer. There's some irony that ZoomInfo's own public layer is a scrape target, as shown in \[%tref how-to-scrape-zoominfo "our ZoomInfo guide"\].

One more free option worth naming: many US public libraries give patrons free access to commercial business databases like Data Axle.

The caveats are honest ones, with no email fields and records that can run stale, but for a one-off local list it costs nothing. Whichever mix you choose, the scrape-only sources raise the same question of how to get the data out at scale.


## How Do You Extract Data From These Sources at Scale?

Pulling data from these 10 sources comes down to three access tiers, the same access column from the table earlier. Tier one is the free APIs (Companies House, USASpending, ProPublica, Census).

Tier two is bulk downloads (registry data products, EDGAR archives). Tier three is scrape-only (directories, the LinkedIn public layer, job boards, and most licensing portals).

For tiers one and two, the advice is neutral. Prefer the API when one exists, and respect the documented rate limits.

Reach for bulk files when you need the full universe, since they beat paginating an endpoint thousands of times. These tiers need no special tooling beyond an HTTP client and patience.

Tier three is where it gets hard, because the scrape-only sources are exactly the ones with serious anti-bot protection. Expect these failure modes:

- Datacenter IPs get blocked on directory sites like Maps and Yelp.
- JavaScript-rendered listings return empty HTML shells without a real browser.
- Licensing portals ban you after a burst of requests.
- Search-result pagination triggers CAPTCHAs.

This is the tier where Scrapfly fits, and only this tier. The next section explains how its APIs handle the fetching and the cross-portal normalization that tier three demands.


## Collecting Public Lead Data With Scrapfly


Scrapfly's \[%url <https://scrapfly.io/products/web-scraping-api> "Web Scraping API"\] is a single HTTP endpoint for collecting web data at scale, with a **99.99% success rate** across **130M+ proxies in 190+ countries**.

- \[%url <https://scrapfly.io/docs/scrape-api/anti-scraping-protection> "Anti-Scraping Protection bypass"\] - automatically defeats Cloudflare, DataDome, PerimeterX, Akamai, and 90+ other bot systems.
- \[%url <https://scrapfly.io/docs/scrape-api/proxy> "Smart proxy rotation"\] - residential and datacenter pools with country and ASN level geo-targeting.
- \[%url <https://scrapfly.io/docs/scrape-api/javascript-rendering> "JavaScript rendering"\] - render SPAs and dynamic pages through real cloud browsers.
- \[%url <https://scrapfly.io/docs/scrape-api/javascript-scenario> "Browser automation scenarios"\] - scroll, click, fill forms, and wait for elements without managing a browser fleet.
- \[%url [https://scrapfly.io/docs/scrape-api/getting-started#api\_param\_format](https://scrapfly.io/docs/scrape-api/getting-started#api_param_format) "Format conversion"\] - return pages as HTML, JSON, clean text, or LLM ready Markdown.
- \[%url <https://scrapfly.io/docs/scrape-api/session> "Session management"\] - keep cookies, headers, and IPs consistent across multi step flows.
- \[%url [https://scrapfly.io/docs/scrape-api/getting-started#api\_param\_cache](https://scrapfly.io/docs/scrape-api/getting-started#api_param_cache) "Smart caching"\] - cache successful responses to cut cost on repeat scraping jobs.
- \[%url <https://scrapfly.io/docs/sdk/python> "Python"\], \[%url <https://scrapfly.io/docs/sdk/typescript> "TypeScript"\], \[%url <https://scrapfly.io/docs/sdk/scrapy> "Scrapy"\], and \[%url <https://scrapfly.io/docs/integration/getting-started> "no-code integrations"\] including Make, n8n, Zapier, LangChain, and LlamaIndex.

Scrapfly also pairs with the \[%url <https://scrapfly.io/docs/extraction-api/getting-started> "AI Extraction API"\]. It normalizes records that differ across every state portal and directory schema.

That cross-portal normalization is the exact pain of 50-state registries and licensing boards. The per-source guides linked above carry working code for the scrape-only tier in Python, TypeScript, [Go](https://go.dev/), and [Rust](https://www.rust-lang.org/).


### Power your scraping with Scrapfly

Forget about getting blocked. Scrapfly handles anti-bot bypasses, browser rendering, and proxy rotation so you can focus on the data.


[Try for FREE!](https://scrapfly.io/register)


## FAQ

Is it legal to scrape public data for lead generation?Public-record and public-page data is generally accessible, but rules govern how you use contact data, and GDPR and CCPA standards differ by region. Check your region's rules and the site's terms before you collect or contact.


What business records are public in the US?Public US business records include company registrations from Secretaries of State, SEC filings, professional licenses, permits, court records, federal procurement awards, and nonprofit 990 filings. Most are searchable online for free.


Can ChatGPT or AI tools build a lead list from public data?Large language models can classify and normalize records, but they can't fetch protected pages at scale on their own. Pair an extraction pipeline that handles the fetching with AI parsing that structures the results.


What is the best free alternative to commercial lead databases?The strongest free combination is official business registries for the company universe, local directories for SMBs, and licensing boards for regulated verticals. Together they cover most of what a paid list offers, minus the verified contact fields.


## Summary

Public data covers far more lead-grade ground than the "buy a list" reflex suggests. Official registries and SEC EDGAR build your company universe.

Local directories and licensing boards reach the SMB and regulated long tail. Job boards and Crunchbase add timing and funding signals, while Census and BLS size the market around all of it.

The deciding factor between any two sources is access. Free APIs from Companies House, USASpending, and ProPublica cost you only an HTTP client. The scrape-only sources behind anti-bot protection are where the real engineering lives.

Most teams end up hybrid: public sources for the company universe and signals, purchased enrichment for verified contacts.

For the scrape-only tier, the Web Scraping API plus AI Extraction handle the fetching and the cross-portal normalization. That frees your time for the leads instead of the plumbing.


Legal Disclaimer and PrecautionsThis tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect:

- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens protected by GDPR.
- Do not repurpose *entire* public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow. For more you should consult a lawyer.

 
   [  Add as a preferred source ](https://google.com/preferences/source?q=scrapfly.io) Table of Contents


  Table of Contents- [Key Takeaways](#key-takeaways)
- [Which Public Data Source Is Best for Lead Generation?](#which-public-data-source-is-best-for-lead-generation)
- [How Did We Rank These Public Data Sources?](#how-did-we-rank-these-public-data-sources)
- [1. Official Business Registries: Best for Building Your Company Universe](#1-official-business-registries-best-for-building-your-company-universe)
- [2. SEC EDGAR: Best for Public-Company Financials and Executives](#2-sec-edgar-best-for-public-company-financials-and-executives)
- [3. LinkedIn Public Profiles: Best for Decision-Maker Mapping](#3-linkedin-public-profiles-best-for-decision-maker-mapping)
- [4. Local Business Directories: Best for Local SMB Lead Lists](#4-local-business-directories-best-for-local-smb-lead-lists)
- [5. SAM.gov and USASpending: Best for Government Contractors](#5-sam-gov-and-usaspending-best-for-government-contractors)
- [6. IRS 990 Filings: Best for Nonprofit Prospecting](#6-irs-990-filings-best-for-nonprofit-prospecting)
- [7. Professional Licensing Boards: Best for Regulated Professions](#7-professional-licensing-boards-best-for-regulated-professions)
- [8. Job Boards: Best for Hiring and Growth Signals](#8-job-boards-best-for-hiring-and-growth-signals)
- [9. Crunchbase: Best for Funding-Stage Targeting](#9-crunchbase-best-for-funding-stage-targeting)
- [10. US Census Bureau and BLS: Best for Market Sizing](#10-us-census-bureau-and-bls-best-for-market-sizing)
- [Public Data vs. Purchased Databases: When Is Each Right?](#public-data-vs-purchased-databases-when-is-each-right)
- [How Do You Extract Data From These Sources at Scale?](#how-do-you-extract-data-from-these-sources-at-scale)
- [Collecting Public Lead Data With Scrapfly](#collecting-public-lead-data-with-scrapfly)
- [FAQ](#faq)
- [Summary](#summary)
 
    Join the Newsletter  Get monthly web scraping insights 

 
Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 
 ## Related Articles

 [     

 python scrapeguide 

### Web Scraping for Lead Generation: Build Your Own B2B Database

 Learn how to build a B2B lead database by scraping public sources. Covers source discovery, extraction without getting ...

 
 ](https://scrapfly.io/blog/posts/how-to-scrape-leads) [  

 python scrapeguide 

### How to Scrape Wellfound Company Data and Job Listings

Tutorial for web scraping Wellfound.com (previously angel.co) tech startup company and job directory using Python.

 
 ](https://scrapfly.io/blog/posts/how-to-scrape-wellfound-aka-angellist) [     

 python data-parsing 

### How to Scrape Government and Public Records Data

Map the public-data sources worth scraping, the friction each portal throws at you, and how to get past it and normalize...

 
 ](https://scrapfly.io/blog/posts/scraping-government-public-records-data) 

  ## Related Questions

- [ Q How to Send a HEAD Request With cURL? ](https://scrapfly.io/blog/answers/how-to-send-curl-head-requests)
- [ Q What are scrapy pipelines and how to use them? ](https://scrapfly.io/blog/answers/what-are-scrapy-pipelines-and-how-to-use-them)
 
  
 Extract structured data with AI, **1,000 free credits** [Start Free](https://scrapfly.io/register)