warc3-wet
warc3-wet is a Python library designed to work with ARC and WARC (Web ARChive) files, which are formats for storing web crawls. It is a fork of the original `warc` repository, updated for Python 3 compatibility and to handle issues with specific datasets like ClueWeb09. The current version is 0.2.5, released on July 17, 2024, indicating an active, though not rapid, release cadence for maintenance and compatibility updates.
Warnings
- gotcha Despite the package name `warc3-wet`, the module to import is `warc`. Users accustomed to the PyPI package name might incorrectly attempt `import warc3_wet` or `from warc3_wet import warc`, which is not the correct public API usage.
- breaking This library is a Python 3 port and fork of an older, 'now dead' Python 2 `warc` library. While the interface is largely unchanged, direct compatibility with Python 2 applications using the original `warc` library is not guaranteed, and migration efforts will be required for Python 2 codebases.
- gotcha The official documentation for `warc3-wet` points to `http://warc.readthedocs.org/`, which is the documentation for the *original* `warc` library. While the interface is stated to be largely unchanged, be aware that any installation instructions on that external documentation may not apply to `warc3-wet`.
Install
-
pip install warc3-wet
Imports
- warc
import warc
Quickstart
import warc
# Assuming 'test.warc.wet' is a valid WARC/WET file
# For demonstration, we'll create a dummy file if it doesn't exist
# In a real scenario, you would have an actual WARC/WET file.
import os
if not os.path.exists("test.warc.wet"):
with open("test.warc.wet", "w") as f:
f.write("WARC/1.0\r\n")
f.write("WARC-Type: warcinfo\r\n")
f.write("WARC-Date: 2023-01-01T12:00:00Z\r\n")
f.write("WARC-Record-ID: <urn:uuid:00000000-0000-0000-0000-000000000001>\r\n")
f.write("Content-Length: 0\r\n")
f.write("\r\n")
f.write("\r\n")
f.write("WARC/1.0\r\n")
f.write("WARC-Type: response\r\n")
f.write("WARC-Target-URI: http://example.com/\r\n")
f.write("WARC-Date: 2023-01-01T12:00:01Z\r\n")
f.write("WARC-Record-ID: <urn:uuid:00000000-0000-0000-0000-000000000002>\r\n")
f.write("Content-Length: 33\r\n")
f.write("Content-Type: text/plain\r\n")
f.write("\r\n")
f.write("HTTP/1.1 200 OK\r\n")
f.write("Content-Length: 9\r\n")
f.write("\r\n")
f.write("Hello World\r\n")
with warc.open("test.warc.wet") as f:
for record in f:
if 'WARC-Target-URI' in record and 'Content-Length' in record:
print(f"URI: {record['WARC-Target-URI']}, Length: {record['Content-Length']}")
# Clean up the dummy file
os.remove("test.warc.wet")