WARC Format Reference
The WARC (Web ARChive) format is an industry-standard file format for archiving web content. Scrapfly Crawler API uses WARC files to provide you with complete, archival-quality snapshots of your crawled data.
WARC files are the most efficient way to retrieve and archive crawled data, especially for large crawls (100s-1000s+ pages). They provide complete HTTP transaction data in a compressed, industry-standard format that can be processed offline without additional API calls.
What is WARC?
WARC (Web ARChive) is an ISO standard (ISO 28500:2017) for archiving web content. It captures complete HTTP request/response pairs, including headers, status codes, and response bodies.
Key Benefits
- Complete Data - Captures full HTTP transactions (request + response)
- Industry Standard - Universally supported by archival and analysis tools
- Compressed Storage - Gzip compression for efficient storage
- Offline Processing - Query and analyze data without API calls
- Long-term Archival - Format designed for preservation
- Tool Ecosystem - Many libraries and tools available
WARC File Structure
A WARC file contains a series of records. Each record has:
- WARC Headers - Metadata about the record (record type, IDs, timestamps)
- HTTP Headers - HTTP request or response headers (if applicable)
- Payload - The actual content (HTML, JSON, binary data, etc.)
Record Types
| Record Type | Description | Content |
|---|---|---|
warcinfo
|
File metadata and crawl information | Crawler version, settings, timestamps |
request
|
HTTP request sent to the server | Request method, URL, headers, body |
response
|
HTTP response received from server | Status code, headers, response body (HTML, JSON, etc.) |
conversion
|
Extracted/converted content | Markdown, text, or clean HTML extracted from response |
Scrapfly Custom WARC Headers
In addition to standard WARC headers, Scrapfly adds custom metadata to help you analyze and process your crawled data more effectively.
Custom Headers for All Records
| Header | Type | Description |
|---|---|---|
WARC-Scrape-Log-Id
|
String |
Unique identifier for the scraping log entry. Use this to:
|
WARC-Scrape-Country
|
String (ISO 3166) |
ISO 3166-1 alpha-2 country code of the proxy used (e.g.,
US,
GB,
FR).
Useful for analyzing geo-specific content variations.
|
Custom Headers for Response Records
| Header | Type | Description |
|---|---|---|
WARC-Scrape-Duration
|
Float (seconds) |
Time taken to complete the HTTP request in seconds (e.g.,
1.234).
Useful for performance analysis and identifying slow pages.
|
WARC-Scrape-Retry
|
Integer |
Number of retry attempts for this request (0
means first attempt succeeded).
Helps identify problematic URLs that required retries.
|
Example WARC Record with Custom Headers
Downloading WARC Files
WARC files are available once your crawler completes (is_finished: true).
The file is returned as
crawl.warc.gz
(gzip-compressed for efficient transfer).
Reading WARC Files
WARC files can be read using various tools and libraries in different programming languages.
Python - warcio Library
warcio is the recommended Python library for reading WARC files.
Installation
Reading WARC Files
Filtering Specific Records
JavaScript/Node.js - node-warc
node-warc provides WARC parsing for Node.js applications.
Installation
Reading WARC Files
Java - jwat
JWAT is a Java library for reading and writing WARC files.
Maven Dependency
Reading WARC Files
Go - go-warc
gowarc is a Go library for reading and writing WARC files.
Installation
Reading WARC Files
Rust - warc_parser
warc_parser is a high-performance Rust library for reading WARC files, originally developed for Common Crawl.
Installation
Reading WARC Files
Performance Filtering
Rust\'s performance makes it ideal for processing large WARC archives efficiently.
C++ - warcpp
warcpp is a single-header C++ parser for WARC files with modern error handling using std::variant.
Installation
Reading WARC Files
Efficient Error Handling
warcpp uses std::variant for type-safe error handling without exceptions.
PHP - Mixnode WARC Reader
mixnode-warcreader-php provides native PHP support for reading WARC files, both raw and gzipped.
Installation
Reading WARC Files
Filtering Specific Records
Command-Line Tools
warcio (Python CLI)
Extract and inspect WARC files from the command line.
zgrep - Search Compressed WARC
Search for specific content without decompressing.
gunzip - Decompress WARC
Common Use Cases
Long-term Archival
Store complete snapshots of websites for historical preservation, compliance, or research purposes using an industry-standard format.
Offline Analysis
Download once and analyze locally without additional API calls. Perfect for data science, ML training sets, or bulk processing.
Performance Monitoring
Use
WARC-Scrape-Duration
and
WARC-Scrape-Retry
to identify slow pages, analyze performance patterns, and optimize crawling strategies.
Geo-specific Analysis
Compare content variations across regions using
WARC-Scrape-Country.
Analyze geo-blocking, localized pricing, or regional content differences.
Converting WARC to Parquet
Convert WARC archives to Apache Parquet format for efficient querying, analytics, and long-term storage. Parquet's columnar format with bloom filter indexing enables lightning-fast URL lookups and SQL-based analysis.
Why Parquet?
- Columnar storage: Query only the columns you need (URL, status, country) without reading entire records
- Bloom filters: O(1) URL lookups instead of scanning entire archives
- Compression: 5-10x better compression than gzipped WARC
- SQL queries: Use DuckDB, ClickHouse, or Spark for complex analysis
- Schema evolution: Add new columns without rewriting data
Python Implementation with Bloom Filters
This example converts WARC to Parquet with bloom filter indexing on URLs for fast lookups.
Installation
Conversion Script
Querying Parquet with DuckDB
Once converted to Parquet, you can query your crawl data with SQL. Bloom filters make URL lookups instant, even on multi-GB files.
Partitioning for Large Crawls
For crawls with millions of URLs, partition by date or country for even faster queries.
Performance Tips
- Bloom filters: Always enable on URL column for O(1) lookups
- Partitioning: Partition large datasets by country or date to query only relevant files
- Compression: Use ZSTD for best balance of speed and compression (better than GZIP)
- Row groups: Smaller row groups (50k-100k) improve query selectivity
- Statistics: Enable column statistics for query optimization
Best Practices
Recommended Practices
Use .warc.gz for storage efficiency (10x+ compression)
Process large files without loading into memory
WARC-Scrape-Log-Id
For fast lookups and cross-referencing
For audit trails and reprocessing
For analytics and debugging
Common Pitfalls
Use streaming iterators instead
Use gzip.open before reading
WARC files may contain retries and redirects
Check for None before using
Next Steps
- Learn about all retrieval methods available for crawler results
- Understand crawler billing and how WARC downloads are charged
- Explore crawler configuration options
- View the complete crawler API specification
External Resources
- ISO 28500:2017 WARC Standard - Official WARC specification
- warcio (Python) - Recommended Python library
- node-warc (JavaScript) - Node.js WARC library
- JWAT (Java) - Java WARC library
- gowarc (Go) - Go WARC library