warc3-wet

0.2.5 · active · verified Tue Apr 14

warc3-wet is a Python library designed to work with ARC and WARC (Web ARChive) files, which are formats for storing web crawls. It is a fork of the original `warc` repository, updated for Python 3 compatibility and to handle issues with specific datasets like ClueWeb09. The current version is 0.2.5, released on July 17, 2024, indicating an active, though not rapid, release cadence for maintenance and compatibility updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to open and iterate through records in a WARC or WET file. It includes a minimal setup to create a dummy WARC/WET file for immediate execution and then processes it, printing the target URI and content length for each record.

import warc

# Assuming 'test.warc.wet' is a valid WARC/WET file
# For demonstration, we'll create a dummy file if it doesn't exist
# In a real scenario, you would have an actual WARC/WET file.
import os
if not os.path.exists("test.warc.wet"):
    with open("test.warc.wet", "w") as f:
        f.write("WARC/1.0\r\n")
        f.write("WARC-Type: warcinfo\r\n")
        f.write("WARC-Date: 2023-01-01T12:00:00Z\r\n")
        f.write("WARC-Record-ID: <urn:uuid:00000000-0000-0000-0000-000000000001>\r\n")
        f.write("Content-Length: 0\r\n")
        f.write("\r\n")
        f.write("\r\n")
        f.write("WARC/1.0\r\n")
        f.write("WARC-Type: response\r\n")
        f.write("WARC-Target-URI: http://example.com/\r\n")
        f.write("WARC-Date: 2023-01-01T12:00:01Z\r\n")
        f.write("WARC-Record-ID: <urn:uuid:00000000-0000-0000-0000-000000000002>\r\n")
        f.write("Content-Length: 33\r\n")
        f.write("Content-Type: text/plain\r\n")
        f.write("\r\n")
        f.write("HTTP/1.1 200 OK\r\n")
        f.write("Content-Length: 9\r\n")
        f.write("\r\n")
        f.write("Hello World\r\n")

with warc.open("test.warc.wet") as f:
    for record in f:
        if 'WARC-Target-URI' in record and 'Content-Length' in record:
            print(f"URI: {record['WARC-Target-URI']}, Length: {record['Content-Length']}")

# Clean up the dummy file
os.remove("test.warc.wet")

view raw JSON →