{"id":7856,"library":"warc3-wet-clueweb09","title":"warc3-wet-clueweb09","description":"A Python library designed to efficiently parse and work with ARC and WARC files, specifically tailored with fixes and optimizations for ClueWeb09 WET (Web Extracted Text) files. It provides an interface to iterate over records within these compressed archives. The current version is 0.2.5, indicating a pre-1.0 status with potential for future API changes, and it's maintained on an as-needed basis.","status":"active","version":"0.2.5","language":"en","source_language":"en","source_url":"https://github.com/seanmacavaney/warc3-clueweb","tags":["warc","wet","clueweb09","web archives","parsing","data extraction"],"install":[{"cmd":"pip install warc3-wet-clueweb09","lang":"bash","label":"Install latest version"}],"dependencies":[],"imports":[{"note":"The Warc3Record class is directly exposed at the package root, not nested under the package name after a simple import.","wrong":"import warc3_wet_clueweb09\nrecord = warc3_wet_clueweb09.Warc3Record()","symbol":"Warc3Record","correct":"from warc3_wet_clueweb09 import Warc3Record"}],"quickstart":{"code":"import gzip\nimport os\nfrom warc3_wet_clueweb09 import Warc3Record\n\n# Create a dummy WET.gz file for demonstration purposes\ndummy_wet_content = b\"WARC/1.0\\r\\nWARC-Type: wet\\r\\nWARC-Record-ID: <urn:uuid:1>\\r\\nContent-Type: text/plain\\r\\nContent-Length: 21\\r\\n\\r\\nHello ClueWeb09 World!\\r\\nWARC/1.0\\r\\nWARC-Type: wet\\r\\nWARC-Record-ID: <urn:uuid:2>\\r\\nContent-Type: text/plain\\r\\nContent-Length: 17\\r\\n\\r\\nAnother line here\\r\\n\"\n\ndummy_filepath = \"dummy.wet.gz\"\nwith gzip.open(dummy_filepath, \"wb\") as f:\n    f.write(dummy_wet_content)\n\n# Now, parse records from the dummy file\ntry:\n    with gzip.open(dummy_filepath, 'rb') as f:\n        print(f\"Reading records from: {dummy_filepath}\")\n        for i, record in enumerate(Warc3Record.parse_records(f)):\n            print(f\"--- Record {i+1} ---\")\n            print(f\"WARC-Type: {record.warc_type}\")\n            print(f\"WARC-Record-ID: {record.warc_record_id}\")\n            if record.content:\n                print(f\"Content (first 50 chars): {record.content.decode('utf-8', errors='ignore').strip()[:50]}...\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\nfinally:\n    # Clean up the dummy file\n    if os.path.exists(dummy_filepath):\n        os.remove(dummy_filepath)\n        print(f\"Cleaned up: {dummy_filepath}\")\n","lang":"python","description":"This quickstart demonstrates how to create a dummy ClueWeb09 WET.gz file and then use `warc3-wet-clueweb09` to parse records from it. It shows the basic steps of opening the gzipped file in binary mode and iterating through the `Warc3Record` objects."},"warnings":[{"fix":"Always verify parsed data when using with non-ClueWeb09 WARC/WET files. Consider more general WARC libraries like `warcio` or `warc` for broader compatibility.","message":"This library is specifically designed with 'fixes for ClueWeb09 WET files'. While it may work for general WARC/WET files, its behavior and parsing accuracy are optimized for the ClueWeb09 dataset. Using it for other WARC archives might lead to unexpected parsing errors or incomplete data extraction.","severity":"gotcha","affected_versions":"<1.0"},{"fix":"Pin the library version in your `requirements.txt` (e.g., `warc3-wet-clueweb09==0.2.5`) to ensure consistent behavior across deployments. Test thoroughly after any version upgrade.","message":"The library is in a pre-1.0 version (0.2.5). This implies that the API might not be stable, and breaking changes could be introduced in minor updates. Always review the release notes when upgrading.","severity":"gotcha","affected_versions":"<1.0"},{"fix":"Always open gzipped WARC/WET files using `gzip.open(filepath, 'rb')`. Ensure the file exists and is indeed gzipped.","message":"When opening `.wet.gz` files, it is crucial to use `gzip.open` and specify binary read mode (`'rb'`). Incorrect file handling (e.g., opening with `open()` or in text mode) will result in `OSError: Not a gzipped file` or `struct.error`.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Change your import statement from `import warc3_wet_clueweb09` to `from warc3_wet_clueweb09 import Warc3Record`.","cause":"You are trying to access `Warc3Record` as an attribute of the imported module, but it's exposed directly at the package root for direct import.","error":"AttributeError: module 'warc3_wet_clueweb09' has no attribute 'Warc3Record'"},{"fix":"Verify that the input `.wet.gz` file is a legitimate gzipped file. Ensure it's not corrupted and that its extension correctly reflects its compression status. Sometimes this can also happen if the file is opened in text mode (`'r'`) instead of binary (`'rb'`).","cause":"The file you are attempting to parse with `gzip.open` is either not a valid gzip compressed file, or it's corrupted, or you're trying to open a non-gzipped file as if it were.","error":"OSError: Not a gzipped file"},{"fix":"When decoding `record.content`, use `record.content.decode('utf-8', errors='ignore')` to skip problematic characters, or `record.content.decode('utf-8', errors='replace')` to replace them. If the content is expected to be in a different encoding, try decoding with that specific encoding if known.","cause":"You are attempting to decode binary content from a WARC record using UTF-8, but the content contains bytes that are not valid in a UTF-8 sequence (e.g., non-text content, or text in a different encoding).","error":"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position X: invalid start byte"}]}