warc3-wet-clueweb09

0.2.5 · active · verified Thu Apr 16

A Python library designed to efficiently parse and work with ARC and WARC files, specifically tailored with fixes and optimizations for ClueWeb09 WET (Web Extracted Text) files. It provides an interface to iterate over records within these compressed archives. The current version is 0.2.5, indicating a pre-1.0 status with potential for future API changes, and it's maintained on an as-needed basis.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a dummy ClueWeb09 WET.gz file and then use `warc3-wet-clueweb09` to parse records from it. It shows the basic steps of opening the gzipped file in binary mode and iterating through the `Warc3Record` objects.

import gzip
import os
from warc3_wet_clueweb09 import Warc3Record

# Create a dummy WET.gz file for demonstration purposes
dummy_wet_content = b"WARC/1.0\r\nWARC-Type: wet\r\nWARC-Record-ID: <urn:uuid:1>\r\nContent-Type: text/plain\r\nContent-Length: 21\r\n\r\nHello ClueWeb09 World!\r\nWARC/1.0\r\nWARC-Type: wet\r\nWARC-Record-ID: <urn:uuid:2>\r\nContent-Type: text/plain\r\nContent-Length: 17\r\n\r\nAnother line here\r\n"

dummy_filepath = "dummy.wet.gz"
with gzip.open(dummy_filepath, "wb") as f:
    f.write(dummy_wet_content)

# Now, parse records from the dummy file
try:
    with gzip.open(dummy_filepath, 'rb') as f:
        print(f"Reading records from: {dummy_filepath}")
        for i, record in enumerate(Warc3Record.parse_records(f)):
            print(f"--- Record {i+1} ---")
            print(f"WARC-Type: {record.warc_type}")
            print(f"WARC-Record-ID: {record.warc_record_id}")
            if record.content:
                print(f"Content (first 50 chars): {record.content.decode('utf-8', errors='ignore').strip()[:50]}...")
except Exception as e:
    print(f"An error occurred: {e}")
finally:
    # Clean up the dummy file
    if os.path.exists(dummy_filepath):
        os.remove(dummy_filepath)
        print(f"Cleaned up: {dummy_filepath}")

view raw JSON →