warc3-wet-clueweb09
A Python library designed to efficiently parse and work with ARC and WARC files, specifically tailored with fixes and optimizations for ClueWeb09 WET (Web Extracted Text) files. It provides an interface to iterate over records within these compressed archives. The current version is 0.2.5, indicating a pre-1.0 status with potential for future API changes, and it's maintained on an as-needed basis.
Common errors
-
AttributeError: module 'warc3_wet_clueweb09' has no attribute 'Warc3Record'
cause You are trying to access `Warc3Record` as an attribute of the imported module, but it's exposed directly at the package root for direct import.fixChange your import statement from `import warc3_wet_clueweb09` to `from warc3_wet_clueweb09 import Warc3Record`. -
OSError: Not a gzipped file
cause The file you are attempting to parse with `gzip.open` is either not a valid gzip compressed file, or it's corrupted, or you're trying to open a non-gzipped file as if it were.fixVerify that the input `.wet.gz` file is a legitimate gzipped file. Ensure it's not corrupted and that its extension correctly reflects its compression status. Sometimes this can also happen if the file is opened in text mode (`'r'`) instead of binary (`'rb'`). -
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position X: invalid start byte
cause You are attempting to decode binary content from a WARC record using UTF-8, but the content contains bytes that are not valid in a UTF-8 sequence (e.g., non-text content, or text in a different encoding).fixWhen decoding `record.content`, use `record.content.decode('utf-8', errors='ignore')` to skip problematic characters, or `record.content.decode('utf-8', errors='replace')` to replace them. If the content is expected to be in a different encoding, try decoding with that specific encoding if known.
Warnings
- gotcha This library is specifically designed with 'fixes for ClueWeb09 WET files'. While it may work for general WARC/WET files, its behavior and parsing accuracy are optimized for the ClueWeb09 dataset. Using it for other WARC archives might lead to unexpected parsing errors or incomplete data extraction.
- gotcha The library is in a pre-1.0 version (0.2.5). This implies that the API might not be stable, and breaking changes could be introduced in minor updates. Always review the release notes when upgrading.
- gotcha When opening `.wet.gz` files, it is crucial to use `gzip.open` and specify binary read mode (`'rb'`). Incorrect file handling (e.g., opening with `open()` or in text mode) will result in `OSError: Not a gzipped file` or `struct.error`.
Install
-
pip install warc3-wet-clueweb09
Imports
- Warc3Record
import warc3_wet_clueweb09 record = warc3_wet_clueweb09.Warc3Record()
from warc3_wet_clueweb09 import Warc3Record
Quickstart
import gzip
import os
from warc3_wet_clueweb09 import Warc3Record
# Create a dummy WET.gz file for demonstration purposes
dummy_wet_content = b"WARC/1.0\r\nWARC-Type: wet\r\nWARC-Record-ID: <urn:uuid:1>\r\nContent-Type: text/plain\r\nContent-Length: 21\r\n\r\nHello ClueWeb09 World!\r\nWARC/1.0\r\nWARC-Type: wet\r\nWARC-Record-ID: <urn:uuid:2>\r\nContent-Type: text/plain\r\nContent-Length: 17\r\n\r\nAnother line here\r\n"
dummy_filepath = "dummy.wet.gz"
with gzip.open(dummy_filepath, "wb") as f:
f.write(dummy_wet_content)
# Now, parse records from the dummy file
try:
with gzip.open(dummy_filepath, 'rb') as f:
print(f"Reading records from: {dummy_filepath}")
for i, record in enumerate(Warc3Record.parse_records(f)):
print(f"--- Record {i+1} ---")
print(f"WARC-Type: {record.warc_type}")
print(f"WARC-Record-ID: {record.warc_record_id}")
if record.content:
print(f"Content (first 50 chars): {record.content.decode('utf-8', errors='ignore').strip()[:50]}...")
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Clean up the dummy file
if os.path.exists(dummy_filepath):
os.remove(dummy_filepath)
print(f"Cleaned up: {dummy_filepath}")