FastWARC
FastWARC is a high-performance Python library for parsing WARC (Web ARChive) files, written in C++/Cython. It supports WARC/1.0 and WARC/1.1 streams with GZip and LZ4 compression, offering significant speed improvements over pure Python alternatives like WARCIO. FastWARC is part of the ChatNoir Resiliparse toolkit and is currently at version 0.16.0, with active development.
Warnings
- gotcha FastWARC is not a drop-in replacement for WARCIO. Its API is inspired by WARCIO but designed for performance, meaning direct migration may require code adjustments.
- gotcha Malformed WARC records (e.g., missing Content-Length, non-standard line endings) in archives like ClueWeb can cause parsing issues. By default, `strict_mode=True` which may lead to early termination.
- gotcha Automatic HTTP parsing (`parse_http=True`) can incur a performance overhead. If you only need WARC metadata or raw content and not parsed HTTP headers, this can be skipped.
- gotcha Verifying record digests (e.g., `record.verify_block_digest()`) creates an in-memory copy of the remaining record stream to preserve its contents for further processing. This can consume significant memory for very large records.
- breaking FastWARC explicitly does not support the legacy ARC (Archive Record) format for simplicity and performance reasons.
- gotcha Pre-built Linux binaries are compiled on an older `manylinux` base system for compatibility, which may not offer optimal performance on modern systems.
Install
-
pip install fastwarc -
pip install fastwarc[fsspec] -
sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev && pip install --no-binary fastwarc fastwarc
Imports
- ArchiveIterator
from fastwarc.warc import ArchiveIterator
- WarcRecord
from fastwarc.warc import WarcRecord
- WarcRecordType
from fastwarc.warc import WarcRecordType
Quickstart
import os
from fastwarc.warc import ArchiveIterator, WarcRecordType
# Create a dummy WARC file for demonstration purposes
dummy_warc_content = b'WARC/1.0\r\nWARC-Type: warcinfo\r\nWARC-Date: 2023-01-01T12:00:00Z\r\nWARC-Record-ID: <urn:uuid:example-warcinfo>\r\nContent-Length: 100\r\n\r\ninfo: This is a dummy WARC file created for FastWARC quickstart example.\n123456789012345678901234567890\r\nWARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2023-01-01T12:00:01Z\r\nWARC-Record-ID: <urn:uuid:example-response>\r\nWARC-Target-URI: http://example.com/\r\nContent-Length: 77\r\n\r\nHTTP/1.1 200 OK\r\nContent-Type: text/plain\r\n\r\nHello, FastWARC!\n'
with open('example.warc', 'wb') as f:
f.write(dummy_warc_content)
warc_path = 'example.warc'
if not os.path.exists(warc_path):
print(f"Error: WARC file '{warc_path}' not found. Please ensure it exists.")
else:
try:
# Iterate over WARC records, parsing HTTP for response records
for record in ArchiveIterator(warc_path, parse_http=True):
if record.record_type == WarcRecordType.warcinfo:
print(f"WARC Info Record ID: {record.record_id}")
print(f"Content: {record.reader.read().decode('utf-8').strip()}")
elif record.record_type == WarcRecordType.response:
print(f"\nResponse Record URL: {record.url}")
if record.http_headers:
print(f"HTTP Status: {record.http_headers.status_code}")
print(f"Payload: {record.reader.read().decode('utf-8').strip()}")
except Exception as e:
print(f"An error occurred during WARC processing: {e}")
finally:
# Clean up the dummy WARC file
os.remove(warc_path)