Rubric

2.2.0 · active · verified Fri Apr 17

Rubric is an open-source Python library designed to define and manage data quality rules for Large Language Model (LLM) datasets. It provides a structured way to validate LLM inputs and outputs against predefined criteria, helping ensure data consistency and reliability. The current version is 2.2.0, and it follows a minor release cadence based on feature additions and bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This example demonstrates how to define a dataset schema with data quality rules for response length and language, then use the RubricEngine to validate a list of sample data entries. It showcases the use of `DataQualityRule` with various operators and severity levels.

from rubric.schemas import Dataset, DataQualityRule, Severity, Operator
from rubric.engine import RubricEngine

# Define your dataset schema
rules = [
    DataQualityRule(
        rule_id="length_check",
        description="Responses should be between 10 and 100 characters.",
        column="response",
        operator=Operator.LENGTH_BETWEEN,
        value=[10, 100],
        severity=Severity.HIGH,
        error_message="Response length out of range."
    ),
    DataQualityRule(
        rule_id="language_is_english",
        description="Responses should be in English.",
        column="response",
        operator=Operator.IS_LANGUAGE,
        value="en",
        severity=Severity.MEDIUM,
        error_message="Response is not in English."
    )
]

dataset_schema = Dataset(rules=rules)

# Initialize RubricEngine with the schema
rubric_engine = RubricEngine(dataset=dataset_schema)

# Sample data to validate
data = [
    {"id": 1, "prompt": "Hello", "response": "This is a short test."}, # Valid
    {"id": 2, "prompt": "Another", "response": "Too short"}, # Invalid (length)
    {"id": 3, "prompt": "Translate", "response": "Ceci n'est pas anglais."}, # Invalid (language)
    {"id": 4, "prompt": "Long response", "response": "a" * 150} # Invalid (length)
]

# Validate the data
validation_results = rubric_engine.validate(data)

for result in validation_results:
    print(f"ID: {result.id}, Valid: {result.is_valid}, Errors: {result.errors}")

# Expected Output:
# ID: 1, Valid: True, Errors: []
# ID: 2, Valid: False, Errors: ['Response length out of range.']
# ID: 3, Valid: False, Errors: ['Response is not in English.']
# ID: 4, Valid: False, Errors: ['Response length out of range.']

view raw JSON →