FastWARC

0.16.0 · active · verified Tue Apr 14

FastWARC is a high-performance Python library for parsing WARC (Web ARChive) files, written in C++/Cython. It supports WARC/1.0 and WARC/1.1 streams with GZip and LZ4 compression, offering significant speed improvements over pure Python alternatives like WARCIO. FastWARC is part of the ChatNoir Resiliparse toolkit and is currently at version 0.16.0, with active development.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to iterate through records in a WARC file using `ArchiveIterator`. It shows how to access record metadata like `record_id` and `url`, and how to read the content. For HTTP response records, it also shows how to access parsed HTTP headers. A dummy WARC file is created for the example to be runnable.

import os
from fastwarc.warc import ArchiveIterator, WarcRecordType

# Create a dummy WARC file for demonstration purposes
dummy_warc_content = b'WARC/1.0\r\nWARC-Type: warcinfo\r\nWARC-Date: 2023-01-01T12:00:00Z\r\nWARC-Record-ID: <urn:uuid:example-warcinfo>\r\nContent-Length: 100\r\n\r\ninfo: This is a dummy WARC file created for FastWARC quickstart example.\n123456789012345678901234567890\r\nWARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2023-01-01T12:00:01Z\r\nWARC-Record-ID: <urn:uuid:example-response>\r\nWARC-Target-URI: http://example.com/\r\nContent-Length: 77\r\n\r\nHTTP/1.1 200 OK\r\nContent-Type: text/plain\r\n\r\nHello, FastWARC!\n'
with open('example.warc', 'wb') as f:
    f.write(dummy_warc_content)

warc_path = 'example.warc'

if not os.path.exists(warc_path):
    print(f"Error: WARC file '{warc_path}' not found. Please ensure it exists.")
else:
    try:
        # Iterate over WARC records, parsing HTTP for response records
        for record in ArchiveIterator(warc_path, parse_http=True):
            if record.record_type == WarcRecordType.warcinfo:
                print(f"WARC Info Record ID: {record.record_id}")
                print(f"Content: {record.reader.read().decode('utf-8').strip()}")
            elif record.record_type == WarcRecordType.response:
                print(f"\nResponse Record URL: {record.url}")
                if record.http_headers:
                    print(f"HTTP Status: {record.http_headers.status_code}")
                print(f"Payload: {record.reader.read().decode('utf-8').strip()}")
    except Exception as e:
        print(f"An error occurred during WARC processing: {e}")
    finally:
        # Clean up the dummy WARC file
        os.remove(warc_path)

view raw JSON →