[Blog](https://scrapfly.io/blog)   /  [data-parsing](https://scrapfly.io/blog/tag/data-parsing)   /  [How to Scrape Government and Public Records Data](https://scrapfly.io/blog/posts/scraping-government-public-records-data)   # How to Scrape Government and Public Records Data

 by [Mayada Shaaban](https://scrapfly.io/blog/author/mayada-shaaban-90143e67) Jun 30, 2026 17 min read [\#data-parsing](https://scrapfly.io/blog/tag/data-parsing) [\#hidden-api](https://scrapfly.io/blog/tag/hidden-api) [\#python](https://scrapfly.io/blog/tag/python) [\#scrapeguide](https://scrapfly.io/blog/tag/scrapeguide) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data "Share on LinkedIn") [  ](https://x.com/intent/tweet?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data&text=How%20to%20Scrape%20Government%20and%20Public%20Records%20Data "Share on X") [  ](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data "Share on Facebook")    

 
Summarize this article with

 [  ](https://chat.openai.com/?q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data) [  ](https://claude.ai/new?q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data) [  ](https://x.com/i/grok?text=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data) [  ](https://www.perplexity.ai/search/new?q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data) [  ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20article%20and%20explain%20how%20Scrapfly%20helps%20me%20scrape%20any%20website%20at%20scale%20and%20bypass%20anti-bot%20systems%20for%20my%20use%20case%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fscraping-government-public-records-data) 


   **Web Scraping API**Scrape any website with anti-bot bypass, proxy rotation, and JS rendering.

 
 [ Learn More  ](https://scrapfly.io/products/web-scraping-api) [  Docs ](https://scrapfly.io/docs/scrape-api/getting-started) 

 
The first time you scrape a government site, it works locally then dies in production. Once requests come from a cloud IP, the portal returns empty responses. The records are public by design; the technology guarding them is the real barrier.

So let's map it out. You'll see which government and public-record sources are worth scraping and why those portals fight back, then how to get past the blocks and make 50 inconsistent formats usable.

[10 Best Public Data Sources for Lead Generation in 2026A ranked directory of 10 public data sources for B2B lead generation, with the fields, access method, and freshness of each.](https://scrapfly.io/blog/posts/best-public-data-sources-for-lead-generation)


## Key Takeaways

- **Public records are public by law.** The barrier is technical, not legal access.
- **The highest-value targets:** federal contracts, business registries, and court dockets.
- **Check bulk downloads and hidden JSON APIs first.** Parse HTML only when neither exists.
- **The real work is not fetching,** but normalizing records that differ across states.
- **Cloud-IP blocks** are why managed unblocking and AI extraction earn their place.

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.


## Why Is Government and Public Records Data Worth Scraping?

Government data is valuable for two reasons: it is authoritative and it is universal. Companies file it under legal obligation rather than self-reporting it. Every business, license, and court case generates a public record the public can use.

That makes the data intent-rich. A party sits on a register because of what it does: a new LLC files with a Secretary of State, a contractor holds a state license, a vendor bids on a federal contract. Each record reflects a real action.

The value shows up across use cases. State business registries surface new-company signals. SAM.gov and USASpending.gov expose procurement intelligence worth over $700 billion a year. Court dockets and SEC EDGAR filings carry due-diligence signals.

So if all this is free and legal to collect, why is it hard? The data spreads across thousands of separate systems, many built on aging technology that chokes on fast bots.


## Which Government and Public Records Sources Can You Scrape?

The highest-value sources fall into three buckets: federal databases, state and local portals, and nonprofit registries. Each holds different contents behind a different access method, so your approach depends on which bucket your target sits in.

The matrix below maps each source family to what it contains, how you get the data, and the friction you should expect. Use it to jump straight to your target.

| Source family | What it contains | Access method | Typical friction |
|---|---|---|---|
| SAM.gov | Open contract opportunities over $25k | API (key required) | Slow key approval, rate limits |
| USASpending.gov | Awarded federal contracts and grants | API (no auth) | Pagination, undocumented throttle |
| SEC EDGAR | 10-K, 10-Q, 8-K, 13F, Form 4 filings | API and bulk | Required user-agent header |
| State business registries | Company registrations, officers, status | Scrape or paid bulk | 50 separate systems, legacy tech |
| Licensing boards | Doctors, lawyers, contractors, advisors | Scrape-only | JavaScript search, captcha |
| Court dockets | Cases, parties, filings, dispositions | PACER and county portals | Logins, no date search, fees |
| IRS Form 990 / ProPublica | Nonprofit officers, finances, boards | API and bulk | Low friction |
| FCC and FDA databases | Licenses, approvals, enforcement | API and scrape | Mixed formats |


### Federal Sources: SAM.gov, USASpending.gov, SEC EDGAR, FCC and FDA

Federal data is the most API-friendly bucket, though not uniformly. SAM.gov lists open opportunities over $25,000, and USASpending.gov exposes awarded contracts through a no-auth endpoint.

SEC EDGAR serves filings like the 10-K, 10-Q, and 13F. It caps requests at 10 per second and needs an identifying user-agent header.

Beyond those, the FCC and FDA publish licensing, approval, and enforcement databases. EDGAR and USASpending are pleasant to work with. Many other federal systems are not, and you fall back to scraping.


### State and Local Sources: Business Registries, Licensing Boards, and Court Dockets

This is where friction concentrates. With no central federal company register, 50 separate Secretary of State registries each run their own search flow and HTML. Licensing boards add another layer for doctors, lawyers, contractors, and advisors.

Court records are equally fragmented. Federal cases live in PACER at $0.10 per page; state and county dockets spread across thousands of systems, alongside building-permit and property records. These paginated registries fit the Crawler API, covered later.


### Nonprofit and Institutional Sources: IRS Form 990 and ProPublica

Nonprofit data is the lowest-friction bucket. IRS Form 990 filings list a nonprofit's officers, board members, and finances. ProPublica's Nonprofit Explorer aggregates them into a searchable index with its own API.

So why do so many of these portals break a scraper that works perfectly on your machine?


## Why Do Government Portals Block Scrapers or Break Your Code?

Government portals break scrapers more through aging architecture, fragmentation, and inconsistent throttling than deliberate anti-bot defense. Which failure mode you hit decides whether plain code suffices or you need managed unblocking.

Here are the failure modes you'll meet most often, each with its symptom:

- **Datacenter IP ranges get blocked.** A scraper that passes locally returns empty responses from AWS or GCP IPs, and some proxy vendors refuse .gov targets outright (r/webscraping, 2024). The symptom is an empty body with no error code.
- **Legacy ASP.NET session state.** Many state and county portals rely on ViewState, hidden form fields, and tokens like `__RequestVerificationToken`. A naive GET misses the token, so the POST fails with a 403 or a redirect back to the search form.
- **Undocumented rate limits.** SAM.gov returns 429s after roughly 20 requests per minute even though its docs imply 60 (dev.to, April 2026). The symptom is intermittent 429s that backoff fixes.
- **CAPTCHA on bulk access.** Some registries gate bulk or repeated queries behind a captcha challenge once they notice volume.
- **Fragmented architecture.** A national view of any record type means scraping dozens to hundreds of separate systems, each with its own search flow and HTML (OpenCorporates, September 2025).
- **JavaScript-rendered search results.** Modern portals load results after the initial HTML, so a requests-only scraper sees an empty shell.

California's business search shows this plainly. A plain HTTP request returns a near-empty page whose `<body>` is just an empty `<div id="root">`, with the real UI loaded by a JavaScript bundle below it.


Load the same URL in a real browser and that empty `root` fills in with the full search application.


These modes split into two groups: problems plain code can handle, and problems that need managed unblocking. Our guide on how to [bypass anti-bot protection](https://scrapfly.io/blog/posts/how-to-bypass-anti-bot-protection-when-web-scraping) covers the IP and captcha mechanics.

Before reaching for either group, though, start with the cheapest path that works.


## How Do You Start Scraping a Government Site the Right Way?

Before you write a single selector, check for an official bulk download or API. Then look for a hidden JSON endpoint, and parse HTML only when neither exists. This order spares you from maintaining brittle scrapers against data the agency already offers.


### Check for Bulk Downloads and Official APIs First

Many agencies publish the whole dataset behind a polished UI that hides a plain file directory. Bulk files and official APIs are faster, more stable, and lighter on the server than scraping page by page. Always look for them first.

Cost and licensing vary. Some states sell bulk data at widely different prices, and bulk exports often exclude officers or historical records. That gap pushes some teams back to scraping.

When a bulk file exists, harvesting it is a short loop. The example below scans a file-listing page, finds every archive link, and saves the first into a date-stamped folder.

python```python
import os, datetime, requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

index_url = "https://ftp.gnu.org/gnu/ncurses/"   # swap in your agency's bulk-data directory
resp = requests.get(index_url, timeout=30, headers={"User-Agent": "data-collector/1.0"})
soup = BeautifulSoup(resp.text, "html.parser")

links = [urljoin(index_url, a["href"]) for a in soup.select("a[href]")]
files = [u for u in links if u.lower().endswith((".csv", ".zip", ".gz"))]

out_dir = f"bulk_{datetime.date.today():%Y%m%d}"
os.makedirs(out_dir, exist_ok=True)
for url in files[:1]:
    name = url.rsplit("/", 1)[-1]
    with requests.get(url, stream=True, timeout=60) as r:
        with open(os.path.join(out_dir, name), "wb") as f:
            for chunk in r.iter_content(8192):
                f.write(chunk)
    print(f"saved {name} -> {out_dir}/")
```


The script prints `saved ncurses-4.2.tar.gz -> bulk_20260617/` after finding 21 file links. Point `index_url` at a real bulk-data directory, adjust the extensions, and the loop pulls a full dataset. Our [BeautifulSoup guide](https://scrapfly.io/blog/posts/web-scraping-with-python-beautifulsoup) covers the parsing.


Scrapfly

#### Extract structured data automatically?

Scrapfly's Extraction API uses AI to turn any webpage into structured data — no selectors needed.

[Try Free →](https://scrapfly.io/register)### Find the Hidden JSON API Before You Scrape the HTML

When there's no bulk file, open the browser network tab. Many portals render from an internal JSON endpoint you can call directly, which is far more stable than parsing HTML that changes layout.

The USASpending `spending_by_award` endpoint needs no key. SAM.gov's open-opportunities API requires one that takes about 10 business days to approve.

The call below posts a filter body to the public USASpending endpoint and prints the largest awards.

python```python
import requests

url = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
payload = {
    "filters": {
        "award_type_codes": ["A", "B", "C", "D"],
        "time_period": [{"start_date": "2024-10-01", "end_date": "2025-09-30"}],
    },
    "fields": ["Award ID", "Recipient Name", "Awarding Agency", "Award Amount"],
    "page": 1, "limit": 5, "sort": "Award Amount", "order": "desc",
}
resp = requests.post(url, json=payload, timeout=30)
for row in resp.json()["results"]:
    print(row["Recipient Name"], "|", row["Awarding Agency"], "|", row["Award Amount"])
```


This returns real awards, led by `HUMANA GOVERNMENT BUSINESS INC | Department of Defense | 51269205263.03`. No authentication and no HTML parsing, only structured records. Our guide to [scraping hidden APIs](https://scrapfly.io/blog/posts/how-to-scrape-hidden-apis) shows how to find these endpoints on any site.


### Parse and Paginate the HTML When There Is No API

When you must scrape the page, parse it with BeautifulSoup and handle the portal's pagination. That might be query-string page numbers, or offset and cursor patterns.

For court systems with no date search, you reconstruct case-number sequences like `2016-CV-000001` (LSC Civil Court Data Initiative). The pattern below walks a reconstructed range with a polite delay.

python```python
import requests, time

base = "https://www.judici.com/courts/cases/case_history.jsp?court=IL008015J&case={case_id}"
for seq in range(1, 51):
    case_id = f"2016-CV-{seq:06d}"          # reconstruct sequential dockets
    resp = requests.get(base.format(case_id=case_id), timeout=20)
    if resp.status_code == 200:
        parse_docket(resp.text)             # selectors are site-specific
    time.sleep(2)                           # throttle, and run off-peak
```


The selectors stay conceptual because every court portal differs. The discipline holds everywhere: add delays, run off-peak, test on samples, and scrape only public data. Our [Python web scraping guide](https://scrapfly.io/blog/posts/web-scraping-with-python) covers the foundations.

[How to Scrape Hidden Web DataThe visible HTML doesn't always represent the whole dataset available on the page. In this article, we'll be taking a look at scraping of hidden web data. What is it and how can we scrape it using Python?](https://scrapfly.io/blog/posts/how-to-scrape-hidden-web-data)

That covers the easy targets. Portals that block cloud IPs, render with JavaScript, or throw captcha need a different approach.


## How Do You Get Past Anti-Bot Blocks on Government Portals?

When a portal blocks cloud IPs, throws captcha, or renders with JavaScript, plain requests stop working. You escalate to residential IPs and a managed unblocking layer. Reach for a crawler when registries paginate with predictable URLs.

Recall the developer whose scraper died on GCP while two proxy vendors refused .gov access. The fix is residential or mobile IPs plus a request fingerprint that mimics a real browser. Our roundup of [residential proxy providers](https://scrapfly.io/blog/posts/top-5-residential-proxy-providers) covers the IP side.

For JavaScript-rendered, captcha-gated, or rate-limited portals, the Web Scraping API with Anti-Scraping Protection rotates residential IPs, renders JavaScript, and clears the challenge. It returns the rendered HTML or JSON.

For paginated registries with predictable URLs, the [Crawler API](https://scrapfly.io/crawler-api) runs scoped crawls instead of hand-rolled pagination logic.

One rule decides the tool: plain requests is enough for bulk files and open APIs. Reach for managed unblocking on cloud-IP blocks, captcha, JavaScript rendering, or undocumented throttling. The table below maps friction to a tool.

| Friction | Right approach |
|---|---|
| Bulk file available | Plain requests download |
| Open JSON API | Direct API call |
| JavaScript-rendered results | Web Scraping API with ASP |
| CAPTCHA or cloud-IP block | Web Scraping API with ASP |
| Paginated registry, predictable URLs | Crawler API |

For captcha specifically, see our guide on how to [bypass captcha while scraping](https://scrapfly.io/blog/posts/how-to-bypass-captcha-while-web-scraping). To choose between the two products, our [Scraper API vs Crawler API](https://scrapfly.io/blog/posts/scraper-api-vs-crawler-api) comparison breaks down the trade-offs.

[How to Avoid Web Scraper IP Blocking?How IP addresses are used in web scraping blocking. Understanding IP metadata and fingerprinting techniques to avoid web scraper blocks.](https://scrapfly.io/blog/posts/how-to-avoid-web-scraping-blocking-ip-addresses)

Getting the bytes back is only half the job. The harder half is making 50 different formats agree.


## How Do You Normalize Records Across 50 States and Agencies?

The hardest part of public-records scraping is not fetching the data, it is making 50 formats comparable. Every state names, codes, and structures the same field differently, so per-state selectors break constantly. This is the normalization tax.

The heterogeneity is concrete. Business status reads "Active," "In Existence," or "Good Standing" across states. Courts file eviction cases as "unlawful detainer," "forcible entry and detainer," or "dispossessory."

The naive fix fails at scale. Writing and maintaining around 50 custom parsers, plus mapping tables that grow every time a state tweaks a field, is a workstream teams underestimate. Sites change, selectors break, and the maintenance never ends.

AI extraction takes a different route. Instead of per-state selectors, you give a model a schema and let it pull the same fields from any layout. A Florida and a Texas page then produce the same record shape, surviving layout drift without rewrites.

python```python
from scrapfly import ScrapflyClient, ExtractionConfig

client = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")
record_html = """
<div class="record">
  <span class="biz">Acme Holdings LLC</span>
  <span class="standing">In Good Standing</span>
  <span class="jurisdiction">Texas</span>
  <span class="kind">Domestic Limited Liability Company</span>
  <span class="filed">03/14/2019</span>
</div>"""

extraction = client.extract(ExtractionConfig(
    body=record_html,
    content_type="text/html",
    charset="utf-8",
    extraction_prompt=(
        "Extract these canonical fields from this business registry record: "
        "entity_name, status (normalize to active/inactive), state, "
        "entity_type, registration_date (ISO 8601). Return JSON."
    ),
))
print(extraction.extraction_result["data"])
```


This returns `{'entity_name': 'Acme Holdings LLC', 'status': 'active', 'state': 'Texas', 'entity_type': 'Domestic Limited Liability Company', 'registration_date': '2019-03-14'}`.

The model mapped "In Good Standing" to `active` and reformatted the date, with no selector written. The table below shows that across three states.

| State | Raw status | Raw date | Canonical record |
|---|---|---|---|
| Texas | In Good Standing | 03/14/2019 | status: active, date: 2019-03-14 |
| Florida | Active | March 14, 2019 | status: active, date: 2019-03-14 |
| Delaware | In Existence | 14-MAR-19 | status: active, date: 2019-03-14 |

Normalization does not end at field mapping. You still own entity resolution and dedup, since the same firm appears in many states with name variants and no national ID.

You also own provenance, tracking the source URL and retrieval date per attribute. Freshness matters too: ingest deltas where states publish them. The [AI Extraction API](https://scrapfly.io/docs/extraction-api/getting-started) handles only the field mapping.

With normalization handled in principle, one target makes the whole flow concrete.


## Worked Example: Pulling Federal Contracts from SAM.gov and USASpending

Federal contracts live in two systems: SAM.gov for opportunities and USASpending.gov for awards. A useful scraper joins them, starting from the no-auth USASpending API. This pattern is illustrative, not production code.

The two-systems problem drives the design. SAM.gov posts opportunities over $25,000, but its API key takes about 10 business days to approve. USASpending exposes awards with no auth, so you start there and add SAM.gov detail later.

Three handling details map back to earlier sections:

- Pull attachment URLs from the SAM.gov `resourceLinks` field that most scrapers ignore.
- Add exponential backoff for the undocumented 429s at around 20 requests per minute.
- Normalize agency names with a lookup or AI extraction.

python```python
import requests, time

def fetch_awards(naics, pages=2):
    url = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
    records = []
    for page in range(1, pages + 1):
        body = {
            "filters": {"award_type_codes": ["A", "B", "C", "D"], "naics_codes": [naics]},
            "fields": ["Award ID", "Recipient Name", "Awarding Agency", "Award Amount"],
            "page": page, "limit": 100,
        }
        for attempt in range(5):
            r = requests.post(url, json=body, timeout=30)
            if r.status_code == 429:
                time.sleep(2 ** attempt)        # back off the undocumented throttle
                continue
            records.extend(r.json()["results"])
            break
    return records

AGENCY = {"Veterans Affairs, Department of": "VA", "Department of Veterans Affairs": "VA"}

def normalize(record):
    return {
        "opportunity_id": record["Award ID"],
        "agency": AGENCY.get(record["Awarding Agency"], record["Awarding Agency"]),
        "naics_code": "541512",
        "estimated_value": record["Award Amount"],
        "attachments": [],                       # from SAM.gov resourceLinks when a key is available
    }

awards = fetch_awards("541512")
print(normalize(awards[0]))
```


The flow fetches awards, applies backoff on 429s, normalizes the agency name, and emits a record. Each record has `opportunity_id`, `agency`, `naics_code`, `estimated_value`, and `attachments`.

The SAM.gov detail and `resourceLinks` attachments arrive with the key, and the Web Scraping API handles those protected endpoints. This pattern generalizes to registries, licensing boards, and 990s; a future guide ships target-specific scrapers.

## Scrapfly: Government Data Collection at Scale


ScrapFly's [Web Scraping API](https://scrapfly.io/web-scraping-api) is a single HTTP endpoint for collecting web data at scale, with a **99.99% success rate** across **130M+ proxies in 190+ countries**.

- [Anti-Scraping Protection bypass](https://scrapfly.io/docs/scrape-api/anti-scraping-protection) - automatically defeats Cloudflare, DataDome, PerimeterX, Akamai, and 90+ other bot systems.
- [Smart proxy rotation](https://scrapfly.io/docs/scrape-api/proxy) - residential and datacenter pools with country and ASN level geo-targeting.
- [JavaScript rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering) - render SPAs and dynamic pages through real cloud browsers.
- [Browser automation scenarios](https://scrapfly.io/docs/scrape-api/javascript-scenario) - scroll, click, fill forms, and wait for elements without managing a browser fleet.
- [Format conversion](https://scrapfly.io/docs/scrape-api/getting-started#api_param_format) - return pages as HTML, JSON, clean text, or LLM ready Markdown.
- [Session management](https://scrapfly.io/docs/scrape-api/session) - keep cookies, headers, and IPs consistent across multi step flows.
- [Smart caching](https://scrapfly.io/docs/scrape-api/getting-started#api_param_cache) - cache successful responses to cut cost on repeat scraping jobs.
- [Python](https://scrapfly.io/docs/sdk/python), [TypeScript](https://scrapfly.io/docs/sdk/typescript), [Scrapy](https://scrapfly.io/docs/sdk/scrapy), and [no-code integrations](https://scrapfly.io/docs/integration/getting-started) including Make, n8n, Zapier, LangChain, and LlamaIndex.


### Web Scraping API

Scrape any website with our powerful API. Anti-bot bypass, JavaScript rendering, and rotating proxies built-in.


[Try Web Scraping API](https://scrapfly.io/docs/scrape-api/getting-started)


## FAQ

Is it legal to scrape data from government websites?Public records carry low legal risk, but "public" is not "unrestricted." Respect each site's terms and send identifying headers, like SEC EDGAR. In hiQ v. LinkedIn, the Ninth Circuit found scraping public pages likely doesn't break the CFAA.


Do government sites have official APIs or bulk downloads?Many do, and you should always check first. USASpending and SEC EDGAR offer APIs, and states like New York publish open datasets. Others sell bulk files or offer nothing, which is when the Web Scraping API becomes necessary.


Will I get blocked scraping government records, and can the site detect it?Yes. Many portals block datacenter IPs, throttle requests, or gate bulk access behind captcha. Residential IPs, polite rate limiting, and managed unblocking reduce blocks. Scraping from cloud IPs without them returns empty responses fastest.


Are court records publicly accessible?Most US court records are public, but access fragments across PACER at $0.10 per page and thousands of state and county systems. Many require a login or lack a date search. Open-source tools like Juriscraper handle much of the federal and appellate layer.


What is the fine for scraping public data?There is no fine for scraping public data itself. Legal exposure comes from how you access and use it: violating terms of service, exceeding authorized access under the CFAA, or misusing personal data.


## Summary

Government and public records are free, authoritative, and legal to collect. The challenge is technical: fragmentation, aging portals, IP blocks, and the normalization tax.

The path is an escalation. Start with bulk downloads and hidden JSON APIs, and parse HTML when you must. Move to residential IPs and managed unblocking when portals fight back. Lean on AI extraction when you reconcile many sources into one schema.

For easy targets, plain requests and official APIs remain the right call. Other portals block cloud IPs, and many records need normalizing across 50 states. For those, Scrapfly's Web Scraping API, Crawler API, and AI Extraction are the production path.


Legal Disclaimer and PrecautionsThis tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect:

- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens protected by GDPR.
- Do not repurpose *entire* public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow. For more you should consult a lawyer.

 
   [  Add as a preferred source ](https://google.com/preferences/source?q=scrapfly.io) Table of Contents


  Table of Contents- [Key Takeaways](#key-takeaways)
- [Why Is Government and Public Records Data Worth Scraping?](#why-is-government-and-public-records-data-worth-scraping)
- [Which Government and Public Records Sources Can You Scrape?](#which-government-and-public-records-sources-can-you-scrape)
- [Federal Sources: SAM.gov, USASpending.gov, SEC EDGAR, FCC and FDA](#federal-sources-sam-gov-usaspending-gov-sec-edgar-fcc-and-fda)
- [State and Local Sources: Business Registries, Licensing Boards, and Court Dockets](#state-and-local-sources-business-registries-licensing-boards-and-court-dockets)
- [Nonprofit and Institutional Sources: IRS Form 990 and ProPublica](#nonprofit-and-institutional-sources-irs-form-990-and-propublica)
- [Why Do Government Portals Block Scrapers or Break Your Code?](#why-do-government-portals-block-scrapers-or-break-your-code)
- [How Do You Start Scraping a Government Site the Right Way?](#how-do-you-start-scraping-a-government-site-the-right-way)
- [Check for Bulk Downloads and Official APIs First](#check-for-bulk-downloads-and-official-apis-first)
- [Find the Hidden JSON API Before You Scrape the HTML](#find-the-hidden-json-api-before-you-scrape-the-html)
- [Parse and Paginate the HTML When There Is No API](#parse-and-paginate-the-html-when-there-is-no-api)
- [How Do You Get Past Anti-Bot Blocks on Government Portals?](#how-do-you-get-past-anti-bot-blocks-on-government-portals)
- [How Do You Normalize Records Across 50 States and Agencies?](#how-do-you-normalize-records-across-50-states-and-agencies)
- [Worked Example: Pulling Federal Contracts from SAM.gov and USASpending](#worked-example-pulling-federal-contracts-from-sam-gov-and-usaspending)
- [Scrapfly: Government Data Collection at Scale](#scrapfly-government-data-collection-at-scale)
- [FAQ](#faq)
- [Summary](#summary)
 
    Join the Newsletter  Get monthly web scraping insights 

 
Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 
 ## Related Articles

 [     

 api data-parsing 

### 10 Best Public Data Sources for Lead Generation in 2026

A ranked directory of 10 public data sources for B2B lead generation, with the fields, access method, and freshness of e...

 
 ](https://scrapfly.io/blog/posts/best-public-data-sources-for-lead-generation) [  

 python playwright 

### How to Scrape Google Maps

We'll take a look at to find businesses through Google Maps search system and how to scrape their details using either S...

 
 ](https://scrapfly.io/blog/posts/how-to-scrape-google-maps) [  

 curl 

### How to Use cURL GET Requests

Here's everything you need to know about cURL GET requests and some common pitfalls you should avoid.

 
 ](https://scrapfly.io/blog/posts/how-to-use-curl-get-requests) 

  
 Extract structured data with AI, **1,000 free credits** [Start Free](https://scrapfly.io/register)