Rubric
Rubric is an open-source Python library designed to define and manage data quality rules for Large Language Model (LLM) datasets. It provides a structured way to validate LLM inputs and outputs against predefined criteria, helping ensure data consistency and reliability. The current version is 2.2.0, and it follows a minor release cadence based on feature additions and bug fixes.
Common errors
-
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a package name or a path to a directory.
cause The required spaCy language model for the 'IS_LANGUAGE' operator has not been downloaded or is not accessible.fixDownload the missing spaCy model: `python -m spacy download en_core_web_sm` (replace 'en_core_web_sm' with the appropriate model if using another language). -
pydantic_core._pydantic_core.ValidationError: 1 validation error for Dataset
cause The provided data schema (e.g., `Dataset` or `DataQualityRule` instances) does not conform to the expected Pydantic model structure, often due to missing required fields or incorrect types.fixCarefully review your `Dataset` and `DataQualityRule` definitions against the library's documentation and examples to ensure all mandatory fields are present and correctly typed.
Warnings
- gotcha When using the `IS_LANGUAGE` operator, you must manually download the required spaCy language models. For English, this is `en_core_web_sm`.
- breaking The `DataQualityRule` class introduced a mandatory `column` field in version 1.1.0, specifying which column the rule applies to. Older code that did not specify a column will break.
Install
-
pip install rubric
Imports
- RubricEngine
from rubric.engine import RubricEngine
- Dataset
from rubric.schemas import Dataset
- DataQualityRule
from rubric.schemas import DataQualityRule
- Severity
from rubric.schemas import Severity
- Operator
from rubric.schemas import Operator
Quickstart
from rubric.schemas import Dataset, DataQualityRule, Severity, Operator
from rubric.engine import RubricEngine
# Define your dataset schema
rules = [
DataQualityRule(
rule_id="length_check",
description="Responses should be between 10 and 100 characters.",
column="response",
operator=Operator.LENGTH_BETWEEN,
value=[10, 100],
severity=Severity.HIGH,
error_message="Response length out of range."
),
DataQualityRule(
rule_id="language_is_english",
description="Responses should be in English.",
column="response",
operator=Operator.IS_LANGUAGE,
value="en",
severity=Severity.MEDIUM,
error_message="Response is not in English."
)
]
dataset_schema = Dataset(rules=rules)
# Initialize RubricEngine with the schema
rubric_engine = RubricEngine(dataset=dataset_schema)
# Sample data to validate
data = [
{"id": 1, "prompt": "Hello", "response": "This is a short test."}, # Valid
{"id": 2, "prompt": "Another", "response": "Too short"}, # Invalid (length)
{"id": 3, "prompt": "Translate", "response": "Ceci n'est pas anglais."}, # Invalid (language)
{"id": 4, "prompt": "Long response", "response": "a" * 150} # Invalid (length)
]
# Validate the data
validation_results = rubric_engine.validate(data)
for result in validation_results:
print(f"ID: {result.id}, Valid: {result.is_valid}, Errors: {result.errors}")
# Expected Output:
# ID: 1, Valid: True, Errors: []
# ID: 2, Valid: False, Errors: ['Response length out of range.']
# ID: 3, Valid: False, Errors: ['Response is not in English.']
# ID: 4, Valid: False, Errors: ['Response length out of range.']