WARC Format Reference

The WARC (Web ARChive) format is an industry-standard file format for archiving web content. Scrapfly Crawler API uses WARC files to provide you with complete, archival-quality snapshots of your crawled data.

What is WARC?

WARC (Web ARChive) is an ISO standard (ISO 28500:2017) for archiving web content. It captures complete HTTP request/response pairs, including headers, status codes, and response bodies.

Key Benefits

  • Complete Data - Captures full HTTP transactions (request + response)
  • Industry Standard - Universally supported by archival and analysis tools
  • Compressed Storage - Gzip compression for efficient storage
  • Offline Processing - Query and analyze data without API calls
  • Long-term Archival - Format designed for preservation
  • Tool Ecosystem - Many libraries and tools available

WARC File Structure

A WARC file contains a series of records. Each record has:

  • WARC Headers - Metadata about the record (record type, IDs, timestamps)
  • HTTP Headers - HTTP request or response headers (if applicable)
  • Payload - The actual content (HTML, JSON, binary data, etc.)

Record Types

Record Type Description Content
warcinfo File metadata and crawl information Crawler version, settings, timestamps
request HTTP request sent to the server Request method, URL, headers, body
response HTTP response received from server Status code, headers, response body (HTML, JSON, etc.)
conversion Extracted/converted content Markdown, text, or clean HTML extracted from response

Scrapfly Custom WARC Headers

In addition to standard WARC headers, Scrapfly adds custom metadata to help you analyze and process your crawled data more effectively.

Custom Headers for All Records

Header Type Description
WARC-Scrape-Log-Id String Unique identifier for the scraping log entry. Use this to:
  • Track individual page scrapes
  • Look up detailed logs in dashboard
  • Cross-reference with billing data
WARC-Scrape-Country String (ISO 3166) ISO 3166-1 alpha-2 country code of the proxy used (e.g., US, GB, FR). Useful for analyzing geo-specific content variations.

Custom Headers for Response Records

Header Type Description
WARC-Scrape-Duration Float (seconds) Time taken to complete the HTTP request in seconds (e.g., 1.234). Useful for performance analysis and identifying slow pages.
WARC-Scrape-Retry Integer Number of retry attempts for this request (0 means first attempt succeeded). Helps identify problematic URLs that required retries.

Example WARC Record with Custom Headers

Downloading WARC Files

WARC files are available once your crawler completes (is_finished: true).

The file is returned as crawl.warc.gz (gzip-compressed for efficient transfer).

Reading WARC Files

WARC files can be read using various tools and libraries in different programming languages.

Python - warcio Library

warcio is the recommended Python library for reading WARC files.

Installation

Reading WARC Files

Filtering Specific Records

JavaScript/Node.js - node-warc

node-warc provides WARC parsing for Node.js applications.

Installation

Reading WARC Files

Java - jwat

JWAT is a Java library for reading and writing WARC files.

Maven Dependency

Reading WARC Files

Go - go-warc

gowarc is a Go library for reading and writing WARC files.

Installation

Reading WARC Files

Rust - warc_parser

warc_parser is a high-performance Rust library for reading WARC files, originally developed for Common Crawl.

Installation

Reading WARC Files

Performance Filtering

Rust\'s performance makes it ideal for processing large WARC archives efficiently.

C++ - warcpp

warcpp is a single-header C++ parser for WARC files with modern error handling using std::variant.

Installation

Reading WARC Files

Efficient Error Handling

warcpp uses std::variant for type-safe error handling without exceptions.

PHP - Mixnode WARC Reader

mixnode-warcreader-php provides native PHP support for reading WARC files, both raw and gzipped.

Installation

Reading WARC Files

Filtering Specific Records

Command-Line Tools

warcio (Python CLI)

Extract and inspect WARC files from the command line.

zgrep - Search Compressed WARC

Search for specific content without decompressing.

gunzip - Decompress WARC

Common Use Cases

Long-term Archival

Store complete snapshots of websites for historical preservation, compliance, or research purposes using an industry-standard format.

Offline Analysis

Download once and analyze locally without additional API calls. Perfect for data science, ML training sets, or bulk processing.

Performance Monitoring

Use WARC-Scrape-Duration and WARC-Scrape-Retry to identify slow pages, analyze performance patterns, and optimize crawling strategies.

Geo-specific Analysis

Compare content variations across regions using WARC-Scrape-Country. Analyze geo-blocking, localized pricing, or regional content differences.

Converting WARC to Parquet

Convert WARC archives to Apache Parquet format for efficient querying, analytics, and long-term storage. Parquet's columnar format with bloom filter indexing enables lightning-fast URL lookups and SQL-based analysis.

Python Implementation with Bloom Filters

This example converts WARC to Parquet with bloom filter indexing on URLs for fast lookups.

Installation

Conversion Script

Querying Parquet with DuckDB

Once converted to Parquet, you can query your crawl data with SQL. Bloom filters make URL lookups instant, even on multi-GB files.

Partitioning for Large Crawls

For crawls with millions of URLs, partition by date or country for even faster queries.

Best Practices

Recommended Practices
Keep files compressed

Use .warc.gz for storage efficiency (10x+ compression)

Use streaming readers

Process large files without loading into memory

Index WARC-Scrape-Log-Id

For fast lookups and cross-referencing

Store original WARC files

For audit trails and reprocessing

Leverage custom headers

For analytics and debugging

Common Pitfalls
Don't load entire files into memory

Use streaming iterators instead

Remember to decompress

Use gzip.open before reading

Multiple records per URL

WARC files may contain retries and redirects

Custom headers are optional

Check for None before using

Next Steps

External Resources

Summary