{"id":5927,"library":"fastwarc","title":"FastWARC","description":"FastWARC is a high-performance Python library for parsing WARC (Web ARChive) files, written in C++/Cython. It supports WARC/1.0 and WARC/1.1 streams with GZip and LZ4 compression, offering significant speed improvements over pure Python alternatives like WARCIO. FastWARC is part of the ChatNoir Resiliparse toolkit and is currently at version 0.16.0, with active development.","status":"active","version":"0.16.0","language":"en","source_language":"en","source_url":"https://github.com/chatnoir-eu/chatnoir-resiliparse","tags":["WARC","web archives","parsing","performance","Cython","C++","data processing"],"install":[{"cmd":"pip install fastwarc","lang":"bash","label":"Standard installation"},{"cmd":"pip install fastwarc[fsspec]","lang":"bash","label":"With fsspec for remote filesystems"},{"cmd":"sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev && pip install --no-binary fastwarc fastwarc","lang":"bash","label":"Build from source for optimal Linux performance"}],"dependencies":[{"reason":"Enables reading WARC files from remote filesystems and URLs.","package":"fsspec","optional":true}],"imports":[{"symbol":"ArchiveIterator","correct":"from fastwarc.warc import ArchiveIterator"},{"symbol":"WarcRecord","correct":"from fastwarc.warc import WarcRecord"},{"symbol":"WarcRecordType","correct":"from fastwarc.warc import WarcRecordType"}],"quickstart":{"code":"import os\nfrom fastwarc.warc import ArchiveIterator, WarcRecordType\n\n# Create a dummy WARC file for demonstration purposes\ndummy_warc_content = b'WARC/1.0\\r\\nWARC-Type: warcinfo\\r\\nWARC-Date: 2023-01-01T12:00:00Z\\r\\nWARC-Record-ID: <urn:uuid:example-warcinfo>\\r\\nContent-Length: 100\\r\\n\\r\\ninfo: This is a dummy WARC file created for FastWARC quickstart example.\\n123456789012345678901234567890\\r\\nWARC/1.0\\r\\nWARC-Type: response\\r\\nWARC-Date: 2023-01-01T12:00:01Z\\r\\nWARC-Record-ID: <urn:uuid:example-response>\\r\\nWARC-Target-URI: http://example.com/\\r\\nContent-Length: 77\\r\\n\\r\\nHTTP/1.1 200 OK\\r\\nContent-Type: text/plain\\r\\n\\r\\nHello, FastWARC!\\n'\nwith open('example.warc', 'wb') as f:\n    f.write(dummy_warc_content)\n\nwarc_path = 'example.warc'\n\nif not os.path.exists(warc_path):\n    print(f\"Error: WARC file '{warc_path}' not found. Please ensure it exists.\")\nelse:\n    try:\n        # Iterate over WARC records, parsing HTTP for response records\n        for record in ArchiveIterator(warc_path, parse_http=True):\n            if record.record_type == WarcRecordType.warcinfo:\n                print(f\"WARC Info Record ID: {record.record_id}\")\n                print(f\"Content: {record.reader.read().decode('utf-8').strip()}\")\n            elif record.record_type == WarcRecordType.response:\n                print(f\"\\nResponse Record URL: {record.url}\")\n                if record.http_headers:\n                    print(f\"HTTP Status: {record.http_headers.status_code}\")\n                print(f\"Payload: {record.reader.read().decode('utf-8').strip()}\")\n    except Exception as e:\n        print(f\"An error occurred during WARC processing: {e}\")\n    finally:\n        # Clean up the dummy WARC file\n        os.remove(warc_path)\n","lang":"python","description":"This quickstart demonstrates how to iterate through records in a WARC file using `ArchiveIterator`. It shows how to access record metadata like `record_id` and `url`, and how to read the content. For HTTP response records, it also shows how to access parsed HTTP headers. A dummy WARC file is created for the example to be runnable."},"warnings":[{"fix":"Review FastWARC documentation for API differences when migrating from WARCIO.","message":"FastWARC is not a drop-in replacement for WARCIO. Its API is inspired by WARCIO but designed for performance, meaning direct migration may require code adjustments.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Pass `strict_mode=False` to `ArchiveIterator` for more lenient parsing of non-compliant WARC files. Be aware this might affect how record boundaries are determined.","message":"Malformed WARC records (e.g., missing Content-Length, non-standard line endings) in archives like ClueWeb can cause parsing issues. By default, `strict_mode=True` which may lead to early termination.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Set `parse_http=False` in the `ArchiveIterator` constructor. You can parse HTTP headers later on a per-record basis using `record.parse_http()` if needed.","message":"Automatic HTTP parsing (`parse_http=True`) can incur a performance overhead. If you only need WARC metadata or raw content and not parsed HTTP headers, this can be skipped.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For large records, ensure they fit into memory or set `consume=True` when calling digest verification methods if you do not need to preserve the stream contents for subsequent operations. This avoids creating a stream copy.","message":"Verifying record digests (e.g., `record.verify_block_digest()`) creates an in-memory copy of the remaining record stream to preserve its contents for further processing. This can consume significant memory for very large records.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If you require ARC format compatibility, you will need to use a different library such as WARCIO.","message":"FastWARC explicitly does not support the legacy ARC (Archive Record) format for simplicity and performance reasons.","severity":"breaking","affected_versions":"All versions"},{"fix":"For the best performance on Linux, it is recommended to build FastWARC from source by installing build dependencies (`build-essential`, `python3-dev`, `zlib1g-dev`, `liblz4-dev`) and then using `pip install --no-binary fastwarc fastwarc`.","message":"Pre-built Linux binaries are compiled on an older `manylinux` base system for compatibility, which may not offer optimal performance on modern systems.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z","problems":[]}