 # News &amp; Media Web Scraping

##  Every article, as soon as it publishes. 

 Monitor public news sources, extract clean article text, and feed structured data into your pipelines. Scrapfly handles JS rendering, consent banners, and anti-bot protection so you can focus on the signal - not the scraper. Always respect publisher ToS and robots.txt.

 [ Get Free API Key ](https://scrapfly.io/register) [ Web Scraping API ](https://scrapfly.io/products/web-scraping-api) 

 1,000 free credits. No credit card required. 

 

  

 

 

 

---

## Article

\+ metadata extraction model

 



 

## 5B+

scrapes / month platform-wide

 



 

## 99%+

success rate on public web targets

 



 

## JSON

or Markdown output, ready to pipe

 



 

 

 

---

 // FORMULA## Turn every headline into structured signal.

One API call returns clean article text, byline, publish date, and topic - no HTML parsing required.

 `Article URL + Extraction = Clean Article` 

 

---

 COVERAGE## Everything You Need for News Data

From article extraction to freshness pipelines. Always on the public web.

 

 // FEATURED ### Article Extraction

Point Scrapfly at any public article URL and get back clean, structured data. The built-in `article` extraction model pulls headline, author, publish date, topic tags, and full body text. No CSS selector maintenance required.

  **Article URL** any publicly accessible news page 

 

  **AI Extraction** extraction\_model=article - headline, author, date, body 

 

  **Clean Markdown / JSON** structured output ready for search, analysis, or LLMs 

 

 

 



 

 

 ### Topic &amp; Sector Monitoring

Poll sitemaps and RSS feeds on a schedule to catch new articles the moment they appear. Deduplicate by URL hash and forward only net-new stories to your downstream system.

**Sitemap**XML feed support

**Scheduled**poll rate you control

**Dedup**URL-hash based

 

Finance

Politics

Technology

Healthcare

Energy

Markets

 

 



 

 ### Consent Banners &amp; Cookie Walls

Many news sites gate content behind GDPR consent popups or soft cookie walls before rendering the article body. Cloud Browser handles these automatically, clicking through consent flows so your extraction sees the full page. Only works on publicly accessible articles that respect robots.txt access.

  **Consent Popups** GDPR / CCPA consent flows dismissed automatically 

 

  **Reader View Fallback** Extraction API parses visible text even on noisy layouts 

 

 

 



 

 

 ### Sentiment &amp; Entity Signals

Extract clean article text then pass it to your NLP or LLM pipeline for named entity recognition, sentiment scoring, or topic classification. Scrapfly delivers the clean input; the analysis is yours to own.

**NER**entity extraction

**Sentiment**tone scoring

**Topics**classification

 

 



 

 ### Freshness Pipelines

From sitemap discovery to enriched output in a single pipeline. The Scrapfly Crawler handles link discovery and scheduling; extraction runs in parallel; your database gets clean structured rows.

 **Sitemap / RSS**discover new article URLs automatically

 

 **Crawl**follow links, handle pagination

 

 **Extract**AI extraction model: article

 

 **Enrich + Store**structured JSON to your downstream

 

 

 



 

 

 ### Anti-Bot Bypass

News and media sites use a range of bot-detection stacks to protect their infrastructure. Scrapfly bypasses all major vendors at the HTTP and browser level, using genuine Chrome fingerprints - so your scraper looks like a reader, not a robot.

[Cloudflare](https://scrapfly.io/bypass/cloudflare)

[DataDome](https://scrapfly.io/bypass/datadome)

[Akamai](https://scrapfly.io/bypass/akamai)

[PerimeterX](https://scrapfly.io/bypass/perimeterx)

[Imperva](https://scrapfly.io/bypass/incapsula)

[Kasada](https://scrapfly.io/bypass/kasada)

 

 [See full bypass coverage](https://scrapfly.io/bypass) 



 

 

 

---

  - Web Scraping API
- Extraction API
- Screenshot API
- Crawler
- Cloud Browser
 
 

PRODUCTS

## The full Scrapfly platform, ready for news data.

Every product works together. Use the Web Scraping API for individual articles, the Crawler for full-site discovery, or Cloud Browser for JS-heavy pages with consent flows.

   Web Scraping API

One endpoint to fetch any public news page. Anti-bot bypass, proxy rotation, JS rendering, and geo-targeting all handled server-side.

 [ Landing page ](https://scrapfly.io/products/web-scraping-api) [ Documentation ](https://scrapfly.io/docs/scrape-api/getting-started) 

 

   Extraction API

AI-powered field extraction. Pass the article extraction model and get headline, author, date, and body text back as JSON - no CSS selectors.

 [ Landing page ](https://scrapfly.io/products/extraction-api) [ Documentation ](https://scrapfly.io/docs/extraction-api/getting-started) 

 

   Screenshot API

Capture pixel-perfect screenshots of article pages - useful for archiving, compliance, or visual monitoring.

 [ Landing page ](https://scrapfly.io/products/screenshot-api) [ Documentation ](https://scrapfly.io/docs/screenshot-api/getting-started) 

 

   Crawler

Discover and queue every article on a domain. Crawl sitemaps, follow pagination, and schedule recurring runs.

 [ Landing page ](https://scrapfly.io/products/crawler-api) [ Documentation ](https://scrapfly.io/docs/crawler-api/getting-started) 

 

   Cloud Browser

Full stealth Chromium via CDP. Required for JS-heavy news pages, consent flows, or soft-gate article access.

 [ Landing page ](https://scrapfly.io/scrapium) [ Documentation ](https://scrapfly.io/docs/cloud-browser-api/getting-started) 

 

 

 [See all products](https://scrapfly.io/products/web-scraping-api) 

 



 

---

 CODE## Scrape a News Article as Markdown

Fetch a public news article and receive clean Markdown output - ready for search, analysis, or LLMs. Always respect publisher ToS and robots.txt.

 

Anti-bot bypass, JS rendering, and clean Markdown output from a real news article.

     Python TypeScript HTTP / cURL  

    

 ```
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

client = ScrapflyClient(key="API KEY")

api_response: ScrapeApiResponse = client.scrape(
  ScrapeConfig(
    # add a page to scrape
    url='https://www.nytimes.com/2023/12/29/business/dealbook/stock-market-forecasts-2024.html',
    asp=True,  # enable bypass of anti-scraping protection
    render_js=True,  # enable headless browser (if necessary)
    country="US",  # set location for region specific data
    # use AI to extract data
    extraction_model='article' 
  )
)
# use AI extracted data
print(api_response.scrape_result['extracted_data']['data'])
# or parse the html yourself 
print(api_response.content)
```

 ```
import { 
    ScrapflyClient, ScrapeConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

let api_response = await client.scrape(
    new ScrapeConfig({
        // add a scrape url
        url: 'https://www.google.com/search?q=scrapfly',
        asp: true, // enable bypass of anti-scraping protection
        render_js: true,  // enable headless browser (if necessary)
        // use AI to extract data
        extraction_model: 'search_engine_results' 
    })
);
// use AI extracted data
console.log(api_response.result['extracted_data']['data'])
// or parse the HTML yourself
console.log(api_response.result['content'])
```

 ```
http https://api.scrapfly.io/scrape \
key==$SCRAPFLY_KEY \
url==https://www.nytimes.com/2023/12/29/business/dealbook/stock-market-forecasts-2024.html \
asp==true \
render_js==true \
country==US \
extraction_model=article
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

 

 

---

 // AI## Automate with AI &amp; Workflows

 Use AI assistants ([Claude](https://scrapfly.io/docs/mcp/integrations/claude-desktop), [ChatGPT](https://scrapfly.io/docs/mcp/integrations/openai)) or automation tools ([n8n](https://scrapfly.io/docs/mcp/integrations/n8n), [Make](https://scrapfly.io/integration/make), [Zapier](https://scrapfly.io/integration/zapier)) to monitor news sources from a simple prompt.

 

 ### What You Can Do

- **Monitor Headlines:** "Track breaking news about climate change from major outlets"
- **Research Topics:** "Gather latest articles on quantum computing from MIT News and Nature"
- **Track Mentions:** "Alert me when my company is mentioned in tech news"
- **Sentiment Analysis:** "How is the media covering this event across different sources?"
 
 



 

 ### How It Works

1. **Connect** the Scrapfly [MCP Server](https://scrapfly.io/docs/mcp/getting-started) to your AI assistant
2. **Tell it** what news or media you want to monitor
3. **Get updates** - AI gathers and summarizes articles for you
 
Ideal for journalists, researchers, PR teams, and staying informed on public web content.

 [ View AI Assistant Examples ](https://scrapfly.io/docs/mcp/examples) 

 



 

 

 

---

  FAQ## Frequently Asked Questions

 

  ### IS NEWS WEB SCRAPING LEGAL?

 Scraping publicly accessible news pages is generally lawful in most jurisdictions when you respect each publisher's robots.txt, do not reproduce full article bodies for redistribution, and stay within fair-use boundaries. Always review the specific terms of service for each target and consult legal counsel for your use case. For more context see our [web scraping laws](https://scrapfly.io/is-web-scraping-legal) article.

 

   ### HOW DOES SCRAPFLY HANDLE ANTI-BOT PROTECTION ON NEWS SITES?

 News sites often deploy bot-detection stacks at the HTTP and browser level. Scrapfly bypasses these by sending genuine Chrome TLS and HTTP fingerprints (via Curlium) and, when JavaScript execution is required, running a patched stealth Chromium (Scrapium). The correct tool is selected automatically per target - you just call the API. For more, see our [Cloudflare bypass](https://scrapfly.io/bypass/cloudflare) and [DataDome bypass](https://scrapfly.io/bypass/datadome) guides.

 

   ### WHAT IS THE ARTICLE EXTRACTION MODEL?

 The `article` extraction model is a built-in AI schema in the Scrapfly Extraction API. It automatically identifies and returns the headline, author, publish date, section tags, and full body text from any news article page - without you writing CSS selectors or XPath. Pass `extraction_model=article` to the scrape endpoint and the structured data arrives alongside the raw HTML.

 

   ### WHAT IS A WEB SCRAPING API?

 [Web Scraping API](https://scrapfly.io/products/web-scraping-api) is a service that abstracts away the complexities of web scraping - anti-bot bypass, proxy management, JS rendering, and rate limiting - so developers can focus on extracting and using data rather than maintaining infrastructure.

 

   ### CAN I SCRAPE PAYWALLED NEWS ARTICLES?

 Scrapfly works with publicly accessible content on the web. Bypassing hard paywalls that require a paid subscription violates publisher terms of service and is not a supported use case. Many news sites offer free preview paragraphs or metered access - Scrapfly can extract that publicly visible content. For paywalled content, work directly with the publisher's data licensing program.

 

   ### HOW DO I ACCESS THE SCRAPFLY API?

 [Web Scraping API](https://scrapfly.io/products/web-scraping-api) can be called from any HTTP client - curl, httpie, or any language with an HTTP library. For first-class support we provide [Python](https://scrapfly.io/docs/sdk/python) and [TypeScript](https://scrapfly.io/docs/sdk/typescript) SDKs. Sign up for a free account to get 1,000 credits with no credit card required.

 

   ### ARE PROXIES ALONE ENOUGH TO SCRAPE NEWS SITES?

 No. Modern news sites can identify datacenter proxies and flag them even when IP addresses rotate frequently. Effective access requires genuine browser fingerprints at the TLS, HTTP/2, and browser-JS layers - not just an IP change. [Web Scraping API](https://scrapfly.io/products/web-scraping-api) combines proxy routing with full fingerprint coherence so your requests look like a real reader.

 

  

 

  ---

 // START### Monitor the public web.
Start in minutes.

1,000 free credits. No credit card required. Scale to millions of requests per month when you are ready.

 

 [ Get Free API Key ](https://scrapfly.io/register) [See all use cases](https://scrapfly.io/use-case/web-scraping)