{"library":"rubric","title":"Rubric","description":"Rubric is an open-source Python library designed to define and manage data quality rules for Large Language Model (LLM) datasets. It provides a structured way to validate LLM inputs and outputs against predefined criteria, helping ensure data consistency and reliability. The current version is 2.2.0, and it follows a minor release cadence based on feature additions and bug fixes.","language":"python","status":"active","last_verified":"Fri Apr 17","install":{"commands":["pip install rubric"],"cli":null},"imports":["from rubric.engine import RubricEngine","from rubric.schemas import Dataset","from rubric.schemas import DataQualityRule","from rubric.schemas import Severity","from rubric.schemas import Operator"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"from rubric.schemas import Dataset, DataQualityRule, Severity, Operator\nfrom rubric.engine import RubricEngine\n\n# Define your dataset schema\nrules = [\n    DataQualityRule(\n        rule_id=\"length_check\",\n        description=\"Responses should be between 10 and 100 characters.\",\n        column=\"response\",\n        operator=Operator.LENGTH_BETWEEN,\n        value=[10, 100],\n        severity=Severity.HIGH,\n        error_message=\"Response length out of range.\"\n    ),\n    DataQualityRule(\n        rule_id=\"language_is_english\",\n        description=\"Responses should be in English.\",\n        column=\"response\",\n        operator=Operator.IS_LANGUAGE,\n        value=\"en\",\n        severity=Severity.MEDIUM,\n        error_message=\"Response is not in English.\"\n    )\n]\n\ndataset_schema = Dataset(rules=rules)\n\n# Initialize RubricEngine with the schema\nrubric_engine = RubricEngine(dataset=dataset_schema)\n\n# Sample data to validate\ndata = [\n    {\"id\": 1, \"prompt\": \"Hello\", \"response\": \"This is a short test.\"}, # Valid\n    {\"id\": 2, \"prompt\": \"Another\", \"response\": \"Too short\"}, # Invalid (length)\n    {\"id\": 3, \"prompt\": \"Translate\", \"response\": \"Ceci n'est pas anglais.\"}, # Invalid (language)\n    {\"id\": 4, \"prompt\": \"Long response\", \"response\": \"a\" * 150} # Invalid (length)\n]\n\n# Validate the data\nvalidation_results = rubric_engine.validate(data)\n\nfor result in validation_results:\n    print(f\"ID: {result.id}, Valid: {result.is_valid}, Errors: {result.errors}\")\n\n# Expected Output:\n# ID: 1, Valid: True, Errors: []\n# ID: 2, Valid: False, Errors: ['Response length out of range.']\n# ID: 3, Valid: False, Errors: ['Response is not in English.']\n# ID: 4, Valid: False, Errors: ['Response length out of range.']","lang":"python","description":"This example demonstrates how to define a dataset schema with data quality rules for response length and language, then use the RubricEngine to validate a list of sample data entries. It showcases the use of `DataQualityRule` with various operators and severity levels.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}