LM Dataformat
LM Dataformat (lm-dataformat) is a Python utility designed for efficient storage and reading of files specifically tailored for large language model (LLM) training. It provides functionalities to archive data with associated metadata and stream documents for processing. The current version is 0.0.20, but the project appears to be abandoned, with no active development or maintenance since its last release in 2021 and last GitHub commit over six years ago.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: 'output_dir/meta.jsonl.zst'
cause The `commit()` method was not called after adding data to the `Archive`, or the directory specified for `Reader` does not contain a valid, committed archive.fixAlways call `Archive.commit()` after adding all data to finalize the archive. Ensure the path passed to `Reader` is the directory where `Archive.commit()` was executed successfully. -
"current chunk incomplete" without any json1.zst file
cause This error, reported in GitHub issues, suggests an issue during the archiving process where a data chunk was not properly written or finalized, potentially due to an incomplete write operation or a corrupted state.fixThis can indicate data corruption or an interruption during archive creation. Ensure sufficient disk space, proper permissions, and that the `commit()` method is called without interruption. Inspect the `output_dir` for partially written `.zst` files. If the problem persists, it may be an unaddressed bug in the abandoned library. -
TypeError: Object of type bytes is not JSON serializable
cause The `add_data` method expects a string (often JSON stringified) for the document, but raw bytes or a non-serializable Python object was passed.fixEnsure that any data you wish to store as a document is first converted into a string format, typically by using `json.dumps()` for Python dictionaries/lists, or `.decode('utf-8')` for byte strings.
Warnings
- breaking The lm-dataformat library appears to be abandoned, with the last PyPI release in August 2021 and the last GitHub commit over six years ago. This means no new features, bug fixes, or compatibility updates for newer Python versions or external libraries are expected.
- gotcha Lack of active maintenance may lead to compatibility issues with newer Python versions (e.g., Python 3.9+) or other evolving ecosystem libraries, potentially causing unexpected errors or silent failures.
- gotcha The library does not provide robust error handling or detailed logging in some cases, which can make debugging issues like corrupted archives or malformed data challenging.
Install
-
pip install lm-dataformat
Imports
- Archive
from lm_dataformat import Archive
- Reader
from lm_dataformat import Reader
Quickstart
import os
import shutil
import json
from lm_dataformat import Archive, Reader
# Define output directory
output_dir = 'lm_data_archive'
# --- Writing Data ---
print(f"Creating archive in {output_dir}")
ar = Archive(output_dir)
# Add some sample data
ar.add_data(json.dumps({'text': 'This is the first document for LLM training.', 'id': 1}), meta={'source': 'quickstart'})
ar.add_data(json.dumps({'text': 'A second document with different content.', 'id': 2}), meta={'author': 'gemini'})
ar.add_data(json.dumps({'text': 'The third and final document.', 'id': 3}), meta={'source': 'quickstart', 'version': '1.0'})
# Commit changes to finalize the archive
ar.commit()
print("Archive created and committed.")
# --- Reading Data ---
print(f"\nReading data from {output_dir}")
rdr = Reader(output_dir)
doc_count = 0
for doc in rdr.stream_data():
doc_count += 1
print(f" Document {doc_count}: {doc}")
print(f"Successfully read {doc_count} documents.")
# Clean up the created directory
print(f"\nCleaning up {output_dir}")
shutil.rmtree(output_dir)
print("Cleanup complete.")