{"id":9089,"library":"lm-dataformat","title":"LM Dataformat","description":"LM Dataformat (lm-dataformat) is a Python utility designed for efficient storage and reading of files specifically tailored for large language model (LLM) training. It provides functionalities to archive data with associated metadata and stream documents for processing. The current version is 0.0.20, but the project appears to be abandoned, with no active development or maintenance since its last release in 2021 and last GitHub commit over six years ago.","status":"abandoned","version":"0.0.20","language":"en","source_language":"en","source_url":"https://github.com/leogao2/lm_dataformat","tags":["LLM","data storage","data format","archive","natural language processing"],"install":[{"cmd":"pip install lm-dataformat","lang":"bash","label":"PyPI"}],"dependencies":[{"reason":"Often used for progress bars in data processing, and mentioned in PyPI's uploaded via metadata, though not a strict dependency for core functionality.","package":"tqdm","optional":true}],"imports":[{"symbol":"Archive","correct":"from lm_dataformat import Archive"},{"symbol":"Reader","correct":"from lm_dataformat import Reader"}],"quickstart":{"code":"import os\nimport shutil\nimport json\nfrom lm_dataformat import Archive, Reader\n\n# Define output directory\noutput_dir = 'lm_data_archive'\n\n# --- Writing Data ---\nprint(f\"Creating archive in {output_dir}\")\nar = Archive(output_dir)\n\n# Add some sample data\nar.add_data(json.dumps({'text': 'This is the first document for LLM training.', 'id': 1}), meta={'source': 'quickstart'})\nar.add_data(json.dumps({'text': 'A second document with different content.', 'id': 2}), meta={'author': 'gemini'})\nar.add_data(json.dumps({'text': 'The third and final document.', 'id': 3}), meta={'source': 'quickstart', 'version': '1.0'})\n\n# Commit changes to finalize the archive\nar.commit()\nprint(\"Archive created and committed.\")\n\n# --- Reading Data ---\nprint(f\"\\nReading data from {output_dir}\")\nrdr = Reader(output_dir)\ndoc_count = 0\nfor doc in rdr.stream_data():\n    doc_count += 1\n    print(f\"  Document {doc_count}: {doc}\")\n\nprint(f\"Successfully read {doc_count} documents.\")\n\n# Clean up the created directory\nprint(f\"\\nCleaning up {output_dir}\")\nshutil.rmtree(output_dir)\nprint(\"Cleanup complete.\")","lang":"python","description":"This quickstart demonstrates how to create a data archive using `lm-dataformat` by adding multiple JSON-formatted documents with associated metadata, committing the archive, and then reading the stored data back. It uses a temporary directory for demonstration and includes cleanup."},"warnings":[{"fix":"Consider migrating to actively maintained alternatives for LLM data handling, especially for new projects or those requiring long-term stability and security updates.","message":"The lm-dataformat library appears to be abandoned, with the last PyPI release in August 2021 and the last GitHub commit over six years ago. This means no new features, bug fixes, or compatibility updates for newer Python versions or external libraries are expected.","severity":"breaking","affected_versions":"0.0.20 and earlier"},{"fix":"Pin exact dependency versions if using this library in a production environment, and thoroughly test for compatibility. Be prepared to fork and maintain the library yourself or migrate if critical issues arise.","message":"Lack of active maintenance may lead to compatibility issues with newer Python versions (e.g., Python 3.9+) or other evolving ecosystem libraries, potentially causing unexpected errors or silent failures.","severity":"gotcha","affected_versions":"0.0.20 and earlier"},{"fix":"Implement custom validation and error handling around `add_data` and `stream_data` calls. Ensure data is well-formed before archiving, and add logging to monitor the integrity of the data processing pipeline.","message":"The library does not provide robust error handling or detailed logging in some cases, which can make debugging issues like corrupted archives or malformed data challenging.","severity":"gotcha","affected_versions":"0.0.20 and earlier"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Always call `Archive.commit()` after adding all data to finalize the archive. Ensure the path passed to `Reader` is the directory where `Archive.commit()` was executed successfully.","cause":"The `commit()` method was not called after adding data to the `Archive`, or the directory specified for `Reader` does not contain a valid, committed archive.","error":"FileNotFoundError: [Errno 2] No such file or directory: 'output_dir/meta.jsonl.zst'"},{"fix":"This can indicate data corruption or an interruption during archive creation. Ensure sufficient disk space, proper permissions, and that the `commit()` method is called without interruption. Inspect the `output_dir` for partially written `.zst` files. If the problem persists, it may be an unaddressed bug in the abandoned library.","cause":"This error, reported in GitHub issues, suggests an issue during the archiving process where a data chunk was not properly written or finalized, potentially due to an incomplete write operation or a corrupted state.","error":"\"current chunk incomplete\" without any json1.zst file"},{"fix":"Ensure that any data you wish to store as a document is first converted into a string format, typically by using `json.dumps()` for Python dictionaries/lists, or `.decode('utf-8')` for byte strings.","cause":"The `add_data` method expects a string (often JSON stringified) for the document, but raw bytes or a non-serializable Python object was passed.","error":"TypeError: Object of type bytes is not JSON serializable"}]}