{"id":10203,"library":"rubric","title":"Rubric","description":"Rubric is an open-source Python library designed to define and manage data quality rules for Large Language Model (LLM) datasets. It provides a structured way to validate LLM inputs and outputs against predefined criteria, helping ensure data consistency and reliability. The current version is 2.2.0, and it follows a minor release cadence based on feature additions and bug fixes.","status":"active","version":"2.2.0","language":"en","source_language":"en","source_url":"https://github.com/The-LLM-Data-Company/rubric","tags":["LLM","AI","evaluation","data quality","rubric","validation","pydantic"],"install":[{"cmd":"pip install rubric","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Used for defining data schemas and validation models.","package":"pydantic"},{"reason":"Used for language detection in 'IS_LANGUAGE' rules.","package":"langdetect"},{"reason":"Used for advanced language processing, specifically with 'IS_LANGUAGE' rules. Requires manual model download.","package":"spacy"},{"reason":"Numerical operations, likely underlying data structures.","package":"numpy"},{"reason":"Scientific computing, likely for statistical or advanced data processing.","package":"scipy"}],"imports":[{"symbol":"RubricEngine","correct":"from rubric.engine import RubricEngine"},{"symbol":"Dataset","correct":"from rubric.schemas import Dataset"},{"symbol":"DataQualityRule","correct":"from rubric.schemas import DataQualityRule"},{"symbol":"Severity","correct":"from rubric.schemas import Severity"},{"symbol":"Operator","correct":"from rubric.schemas import Operator"}],"quickstart":{"code":"from rubric.schemas import Dataset, DataQualityRule, Severity, Operator\nfrom rubric.engine import RubricEngine\n\n# Define your dataset schema\nrules = [\n    DataQualityRule(\n        rule_id=\"length_check\",\n        description=\"Responses should be between 10 and 100 characters.\",\n        column=\"response\",\n        operator=Operator.LENGTH_BETWEEN,\n        value=[10, 100],\n        severity=Severity.HIGH,\n        error_message=\"Response length out of range.\"\n    ),\n    DataQualityRule(\n        rule_id=\"language_is_english\",\n        description=\"Responses should be in English.\",\n        column=\"response\",\n        operator=Operator.IS_LANGUAGE,\n        value=\"en\",\n        severity=Severity.MEDIUM,\n        error_message=\"Response is not in English.\"\n    )\n]\n\ndataset_schema = Dataset(rules=rules)\n\n# Initialize RubricEngine with the schema\nrubric_engine = RubricEngine(dataset=dataset_schema)\n\n# Sample data to validate\ndata = [\n    {\"id\": 1, \"prompt\": \"Hello\", \"response\": \"This is a short test.\"}, # Valid\n    {\"id\": 2, \"prompt\": \"Another\", \"response\": \"Too short\"}, # Invalid (length)\n    {\"id\": 3, \"prompt\": \"Translate\", \"response\": \"Ceci n'est pas anglais.\"}, # Invalid (language)\n    {\"id\": 4, \"prompt\": \"Long response\", \"response\": \"a\" * 150} # Invalid (length)\n]\n\n# Validate the data\nvalidation_results = rubric_engine.validate(data)\n\nfor result in validation_results:\n    print(f\"ID: {result.id}, Valid: {result.is_valid}, Errors: {result.errors}\")\n\n# Expected Output:\n# ID: 1, Valid: True, Errors: []\n# ID: 2, Valid: False, Errors: ['Response length out of range.']\n# ID: 3, Valid: False, Errors: ['Response is not in English.']\n# ID: 4, Valid: False, Errors: ['Response length out of range.']","lang":"python","description":"This example demonstrates how to define a dataset schema with data quality rules for response length and language, then use the RubricEngine to validate a list of sample data entries. It showcases the use of `DataQualityRule` with various operators and severity levels."},"warnings":[{"fix":"Run `python -m spacy download en_core_web_sm` in your environment for English. Adjust model name as needed for other languages.","message":"When using the `IS_LANGUAGE` operator, you must manually download the required spaCy language models. For English, this is `en_core_web_sm`.","severity":"gotcha","affected_versions":">=1.0.0"},{"fix":"Ensure all `DataQualityRule` instances explicitly set the `column` argument, e.g., `column=\"your_data_column_name\"`.","message":"The `DataQualityRule` class introduced a mandatory `column` field in version 1.1.0, specifying which column the rule applies to. Older code that did not specify a column will break.","severity":"breaking","affected_versions":">=1.1.0"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Download the missing spaCy model: `python -m spacy download en_core_web_sm` (replace 'en_core_web_sm' with the appropriate model if using another language).","cause":"The required spaCy language model for the 'IS_LANGUAGE' operator has not been downloaded or is not accessible.","error":"OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a package name or a path to a directory."},{"fix":"Carefully review your `Dataset` and `DataQualityRule` definitions against the library's documentation and examples to ensure all mandatory fields are present and correctly typed.","cause":"The provided data schema (e.g., `Dataset` or `DataQualityRule` instances) does not conform to the expected Pydantic model structure, often due to missing required fields or incorrect types.","error":"pydantic_core._pydantic_core.ValidationError: 1 validation error for Dataset"}]}