Streaming WARC (and ARC) IO library

1.8.1 · active · verified Sun Apr 12

warcio is a Python library (v1.8.1) for fast, low-level, streaming input/output of Web ARChive (WARC) and ARC files, adhering to WARC 1.0 and 1.1 ISO standards. It focuses on processing a stream of web archive records rather than entire files. Developed by Webrecorder, it includes features for both reading existing archives and capturing HTTP/S traffic directly into WARC files. The library is actively maintained, with recent updates adding support for remote file systems like S3 and HTTPS.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates both writing and reading WARC files. The writing section uses `warcio.capture_http` to automatically capture HTTP traffic from a `requests` call into a WARC file. The reading section then iterates through the created WARC file using `warcio.archiveiterator.ArchiveIterator`, printing details of each record. The example includes commented-out code for creating the WARC file and for reading from a remote S3 URL, highlighting the flexibility of the library.

import requests
import os
from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator

# --- Writing a WARC file by capturing HTTP traffic ---
output_warc_file = 'example.warc.gz'

# Ensure requests is imported AFTER capture_http if monkey-patching
# with capture_http(output_warc_file, warc_version='1.1') as writer:
#     # You can optionally set WARC-IP-Address for records if available
#     os.environ['WARC_IP_ADDRESS'] = '192.168.1.1' # Example
#     resp = requests.get('http://httpbin.org/get?q=test')
#     print(f"Captured GET request to {resp.url} with status {resp.status_code}")
#     del os.environ['WARC_IP_ADDRESS'] # Clean up env var
#
# print(f"WARC file '{output_warc_file}' created successfully.")

# --- Reading records from the WARC file (or a remote one) ---
# For remote files (e.g., S3), ensure warcio[s3] is installed
# remote_warc_url = 's3://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1701389650426.47/warc/CC-MAIN-20231130201438-00000-ip-10-2-12-106.warc.gz'
# If using a local file, ensure it exists from the writing step or provide your own
input_warc_source = output_warc_file # or remote_warc_url

if os.path.exists(output_warc_file):
    print(f"\n--- Reading records from '{input_warc_source}' ---")
    try:
        with open(input_warc_source, 'rb') as stream:
            for record in ArchiveIterator(stream):
                if record.rec_type == 'response':
                    uri = record.rec_headers.get_header('WARC-Target-URI')
                    status = record.http_headers.get_statuscode() if record.http_headers else 'N/A'
                    print(f"  Response Record: URI={uri}, Status={status}")
                elif record.rec_type == 'request':
                    uri = record.rec_headers.get_header('WARC-Target-URI')
                    print(f"  Request Record: URI={uri}")
                elif record.rec_type == 'warcinfo':
                    filename = record.rec_headers.get_header('WARC-Filename')
                    print(f"  Warcinfo Record: Filename={filename}")
    except FileNotFoundError:
        print(f"Error: Local WARC file '{output_warc_file}' not found. Skipping read example.")
    except Exception as e:
        print(f"An error occurred while reading the WARC file: {e}")
else:
    print(f"Local WARC file '{output_warc_file}' not found. Skipping read example. Uncomment the writing section to create it.")

view raw JSON →