{"id":5095,"library":"warcio","title":"Streaming WARC (and ARC) IO library","description":"warcio is a Python library (v1.8.1) for fast, low-level, streaming input/output of Web ARChive (WARC) and ARC files, adhering to WARC 1.0 and 1.1 ISO standards. It focuses on processing a stream of web archive records rather than entire files. Developed by Webrecorder, it includes features for both reading existing archives and capturing HTTP/S traffic directly into WARC files. The library is actively maintained, with recent updates adding support for remote file systems like S3 and HTTPS.","status":"active","version":"1.8.1","language":"en","source_language":"en","source_url":"https://github.com/webrecorder/warcio","tags":["warc","arc","web archive","streaming","io","webrecorder"],"install":[{"cmd":"pip install warcio","lang":"bash","label":"Base Installation"},{"cmd":"pip install warcio[s3]","lang":"bash","label":"With S3/Remote File System Support"}],"dependencies":[{"reason":"Minimal external dependency for Python 3.7+.","package":"six","optional":false},{"reason":"Required for remote file system access (e.g., HTTP, S3, GCS). Installed automatically with 'warcio[all]' or 'warcio[s3]'.","package":"fsspec","optional":true},{"reason":"Specifically for Amazon S3 remote file system support. Installed with 'warcio[s3]'.","package":"s3fs","optional":true}],"imports":[{"symbol":"ArchiveIterator","correct":"from warcio.archiveiterator import ArchiveIterator"},{"symbol":"capture_http","correct":"from warcio.capture_http import capture_http"},{"symbol":"WARCWriter","correct":"from warcio.warcwriter import WARCWriter"},{"note":"WARCRecord for manual creation is part of warcwriter since v1.6.","wrong":"from warcio.record import WARCRecord","symbol":"WARCRecord","correct":"from warcio.warcwriter import WARCRecord"}],"quickstart":{"code":"import requests\nimport os\nfrom warcio.capture_http import capture_http\nfrom warcio.archiveiterator import ArchiveIterator\n\n# --- Writing a WARC file by capturing HTTP traffic ---\noutput_warc_file = 'example.warc.gz'\n\n# Ensure requests is imported AFTER capture_http if monkey-patching\n# with capture_http(output_warc_file, warc_version='1.1') as writer:\n#     # You can optionally set WARC-IP-Address for records if available\n#     os.environ['WARC_IP_ADDRESS'] = '192.168.1.1' # Example\n#     resp = requests.get('http://httpbin.org/get?q=test')\n#     print(f\"Captured GET request to {resp.url} with status {resp.status_code}\")\n#     del os.environ['WARC_IP_ADDRESS'] # Clean up env var\n#\n# print(f\"WARC file '{output_warc_file}' created successfully.\")\n\n# --- Reading records from the WARC file (or a remote one) ---\n# For remote files (e.g., S3), ensure warcio[s3] is installed\n# remote_warc_url = 's3://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1701389650426.47/warc/CC-MAIN-20231130201438-00000-ip-10-2-12-106.warc.gz'\n# If using a local file, ensure it exists from the writing step or provide your own\ninput_warc_source = output_warc_file # or remote_warc_url\n\nif os.path.exists(output_warc_file):\n    print(f\"\\n--- Reading records from '{input_warc_source}' ---\")\n    try:\n        with open(input_warc_source, 'rb') as stream:\n            for record in ArchiveIterator(stream):\n                if record.rec_type == 'response':\n                    uri = record.rec_headers.get_header('WARC-Target-URI')\n                    status = record.http_headers.get_statuscode() if record.http_headers else 'N/A'\n                    print(f\"  Response Record: URI={uri}, Status={status}\")\n                elif record.rec_type == 'request':\n                    uri = record.rec_headers.get_header('WARC-Target-URI')\n                    print(f\"  Request Record: URI={uri}\")\n                elif record.rec_type == 'warcinfo':\n                    filename = record.rec_headers.get_header('WARC-Filename')\n                    print(f\"  Warcinfo Record: Filename={filename}\")\n    except FileNotFoundError:\n        print(f\"Error: Local WARC file '{output_warc_file}' not found. Skipping read example.\")\n    except Exception as e:\n        print(f\"An error occurred while reading the WARC file: {e}\")\nelse:\n    print(f\"Local WARC file '{output_warc_file}' not found. Skipping read example. Uncomment the writing section to create it.\")\n","lang":"python","description":"This quickstart demonstrates both writing and reading WARC files. The writing section uses `warcio.capture_http` to automatically capture HTTP traffic from a `requests` call into a WARC file. The reading section then iterates through the created WARC file using `warcio.archiveiterator.ArchiveIterator`, printing details of each record. The example includes commented-out code for creating the WARC file and for reading from a remote S3 URL, highlighting the flexibility of the library."},"warnings":[{"fix":"Install `warcio` with the appropriate extras: `pip install warcio[s3]` for S3, or `pip install warcio[all]` for all optional features.","message":"To utilize remote file system capabilities (e.g., reading/writing to S3 or HTTP/HTTPS URLs), you must explicitly install optional dependencies like `fsspec` and `s3fs`. Use `pip install warcio[s3]` or `pip install warcio[all]`.","severity":"gotcha","affected_versions":">=1.8.0"},{"fix":"Ensure your `setuptools` and `pip` are updated to their latest versions to minimize `pkg_resources` warnings. `warcio` v1.7.5 migrated to `importlib` for version retrieval.","message":"Older versions of `warcio` (prior to 1.7.5) might have used `pkg_resources` for version checks, which is deprecated. While `warcio` itself has updated to `importlib` for this, users might still encounter `DeprecationWarning` messages depending on their `setuptools` or `pip` versions, or if other dependencies still use `pkg_resources`.","severity":"deprecated","affected_versions":"<1.7.5 (and potentially later due to transitive dependencies or environment setup)"},{"fix":"Directly use `pytest` or other standard test runners instead of `python setup.py test`.","message":"The `setup.py test` command was removed as `setuptools` version 72 deprecated this functionality. Projects that relied on `python setup.py test` for running `warcio`'s tests or their own tests against it will break.","severity":"breaking","affected_versions":">=1.7.5"},{"fix":"Upgrade to `warcio` v1.8.1 or later to ensure `open_or_default` is available as an alias for `fsspec_open`. If on `1.8.0`, use `fsspec_open` directly.","message":"The function `open_or_default` was re-added as an alias for `fsspec_open` in `v1.8.1`. This implies that `open_or_default` might have been removed or renamed in `v1.8.0`, potentially causing `AttributeError` or `NameError` for users upgrading from older `1.x` versions to `1.8.0` before `1.8.1` was released, if they were using this specific function.","severity":"gotcha","affected_versions":"Potentially 1.8.0 (fixed in 1.8.1)"},{"fix":"Evaluate performance needs for large datasets. For maximum speed, consider `FastWARC` while being aware of its API differences and lack of ARC support. Otherwise, `warcio` remains a robust option for general use.","message":"For very large-scale web crawls (tera- or petabyte scale), `warcio` (being pure Python) might be less performant than C++/Cython alternatives like `FastWARC`. `FastWARC` offers speedups but is not a drop-in replacement and lacks ARC file support.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}