LM Dataformat

0.0.20 · abandoned · verified Thu Apr 16

LM Dataformat (lm-dataformat) is a Python utility designed for efficient storage and reading of files specifically tailored for large language model (LLM) training. It provides functionalities to archive data with associated metadata and stream documents for processing. The current version is 0.0.20, but the project appears to be abandoned, with no active development or maintenance since its last release in 2021 and last GitHub commit over six years ago.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a data archive using `lm-dataformat` by adding multiple JSON-formatted documents with associated metadata, committing the archive, and then reading the stored data back. It uses a temporary directory for demonstration and includes cleanup.

import os
import shutil
import json
from lm_dataformat import Archive, Reader

# Define output directory
output_dir = 'lm_data_archive'

# --- Writing Data ---
print(f"Creating archive in {output_dir}")
ar = Archive(output_dir)

# Add some sample data
ar.add_data(json.dumps({'text': 'This is the first document for LLM training.', 'id': 1}), meta={'source': 'quickstart'})
ar.add_data(json.dumps({'text': 'A second document with different content.', 'id': 2}), meta={'author': 'gemini'})
ar.add_data(json.dumps({'text': 'The third and final document.', 'id': 3}), meta={'source': 'quickstart', 'version': '1.0'})

# Commit changes to finalize the archive
ar.commit()
print("Archive created and committed.")

# --- Reading Data ---
print(f"\nReading data from {output_dir}")
rdr = Reader(output_dir)
doc_count = 0
for doc in rdr.stream_data():
    doc_count += 1
    print(f"  Document {doc_count}: {doc}")

print(f"Successfully read {doc_count} documents.")

# Clean up the created directory
print(f"\nCleaning up {output_dir}")
shutil.rmtree(output_dir)
print("Cleanup complete.")

view raw JSON →