Inspect AI
Inspect AI is an open-source framework for large language model (LLM) evaluations, developed by the UK AI Security Institute. It provides robust tools for prompt engineering, integrating tool usage, managing multi-turn dialogues, and conducting model-graded evaluations. The library is actively maintained with frequent releases, often multiple times a month, ensuring up-to-date compatibility and features for evaluating frontier models.
Warnings
- gotcha Using the local tool environment (`--sandbox local`) without an 'outer sandbox' is explicitly warned against as it can be a security risk, as tools are executed on the client system.
- gotcha Tool usage (e.g., `bash()`, `python()`) is not universally supported by all LLM model providers. Tools are executed on the client machine, not within the model's environment.
- gotcha API keys for model providers (e.g., OpenAI, Anthropic, Google) must be correctly configured, typically as environment variables (e.g., `OPENAI_API_KEY`). Evaluations will fail if these are missing or incorrect.
- gotcha By default, raw model API request/response logs are only captured and displayed when an error occurs. This can obscure debugging for successful but unexpected model behaviors.
- breaking Compatibility with external LLM client libraries (e.g., `openai`, `anthropic`, `mistralai`) frequently requires specific minimum versions due to upstream breaking changes in those packages. For example, `openai` v1.104.1 became a minimum required version due to type changes and web search action renames.
Install
-
pip install inspect-ai -
pip install inspect-ai openai
Imports
- Task
from inspect_ai import Task
- task
from inspect_ai import task
- Sample
from inspect_ai.dataset import Sample
- generate
from inspect_ai.solver import generate
- exact
from inspect_ai.scorer import exact
Quickstart
import os
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.solver import generate
from inspect_ai.scorer import exact
@task
def hello_world():
return Task(
dataset=[
Sample(input="Just reply with Hello World", target="Hello World"),
],
solver=[generate()],
scorer=exact(),
)
# To run this, save it as a Python file (e.g., hello_eval.py)
# and execute from your terminal:
# export OPENAI_API_KEY=your_openai_api_key # Or set in .env file
# inspect eval hello_eval.py --model openai/gpt-4o
# Example of setting the key for programmatic use (less common for inspect eval CLI)
# os.environ['OPENAI_API_KEY'] = os.environ.get('OPENAI_API_KEY', 'sk-...')