Streaming WARC (and ARC) IO library
warcio is a Python library (v1.8.1) for fast, low-level, streaming input/output of Web ARChive (WARC) and ARC files, adhering to WARC 1.0 and 1.1 ISO standards. It focuses on processing a stream of web archive records rather than entire files. Developed by Webrecorder, it includes features for both reading existing archives and capturing HTTP/S traffic directly into WARC files. The library is actively maintained, with recent updates adding support for remote file systems like S3 and HTTPS.
Warnings
- gotcha To utilize remote file system capabilities (e.g., reading/writing to S3 or HTTP/HTTPS URLs), you must explicitly install optional dependencies like `fsspec` and `s3fs`. Use `pip install warcio[s3]` or `pip install warcio[all]`.
- deprecated Older versions of `warcio` (prior to 1.7.5) might have used `pkg_resources` for version checks, which is deprecated. While `warcio` itself has updated to `importlib` for this, users might still encounter `DeprecationWarning` messages depending on their `setuptools` or `pip` versions, or if other dependencies still use `pkg_resources`.
- breaking The `setup.py test` command was removed as `setuptools` version 72 deprecated this functionality. Projects that relied on `python setup.py test` for running `warcio`'s tests or their own tests against it will break.
- gotcha The function `open_or_default` was re-added as an alias for `fsspec_open` in `v1.8.1`. This implies that `open_or_default` might have been removed or renamed in `v1.8.0`, potentially causing `AttributeError` or `NameError` for users upgrading from older `1.x` versions to `1.8.0` before `1.8.1` was released, if they were using this specific function.
- gotcha For very large-scale web crawls (tera- or petabyte scale), `warcio` (being pure Python) might be less performant than C++/Cython alternatives like `FastWARC`. `FastWARC` offers speedups but is not a drop-in replacement and lacks ARC file support.
Install
-
pip install warcio -
pip install warcio[s3]
Imports
- ArchiveIterator
from warcio.archiveiterator import ArchiveIterator
- capture_http
from warcio.capture_http import capture_http
- WARCWriter
from warcio.warcwriter import WARCWriter
- WARCRecord
from warcio.warcwriter import WARCRecord
Quickstart
import requests
import os
from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
# --- Writing a WARC file by capturing HTTP traffic ---
output_warc_file = 'example.warc.gz'
# Ensure requests is imported AFTER capture_http if monkey-patching
# with capture_http(output_warc_file, warc_version='1.1') as writer:
# # You can optionally set WARC-IP-Address for records if available
# os.environ['WARC_IP_ADDRESS'] = '192.168.1.1' # Example
# resp = requests.get('http://httpbin.org/get?q=test')
# print(f"Captured GET request to {resp.url} with status {resp.status_code}")
# del os.environ['WARC_IP_ADDRESS'] # Clean up env var
#
# print(f"WARC file '{output_warc_file}' created successfully.")
# --- Reading records from the WARC file (or a remote one) ---
# For remote files (e.g., S3), ensure warcio[s3] is installed
# remote_warc_url = 's3://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1701389650426.47/warc/CC-MAIN-20231130201438-00000-ip-10-2-12-106.warc.gz'
# If using a local file, ensure it exists from the writing step or provide your own
input_warc_source = output_warc_file # or remote_warc_url
if os.path.exists(output_warc_file):
print(f"\n--- Reading records from '{input_warc_source}' ---")
try:
with open(input_warc_source, 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
uri = record.rec_headers.get_header('WARC-Target-URI')
status = record.http_headers.get_statuscode() if record.http_headers else 'N/A'
print(f" Response Record: URI={uri}, Status={status}")
elif record.rec_type == 'request':
uri = record.rec_headers.get_header('WARC-Target-URI')
print(f" Request Record: URI={uri}")
elif record.rec_type == 'warcinfo':
filename = record.rec_headers.get_header('WARC-Filename')
print(f" Warcinfo Record: Filename={filename}")
except FileNotFoundError:
print(f"Error: Local WARC file '{output_warc_file}' not found. Skipping read example.")
except Exception as e:
print(f"An error occurred while reading the WARC file: {e}")
else:
print(f"Local WARC file '{output_warc_file}' not found. Skipping read example. Uncomment the writing section to create it.")