Berkeley Function Calling Leaderboard Evaluation

2026.3.23 · active · verified Thu Apr 16

bfcl-eval is the Python library for the Berkeley Function Calling Leaderboard (BFCL), a benchmark to evaluate Large Language Models (LLMs) on their ability to perform function calling. It provides the evaluation pipeline and datasets, including support for multi-step and multi-turn function calls as of its V3 release. The library is actively maintained with frequent updates, with its current PyPI version being 2026.3.23.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to programmatically run an evaluation using `bfcl-eval`. It simulates the command-line arguments needed by `eval_handler.run_eval` to specify the dataset, model, and other evaluation parameters. Note that for commercial models like 'gpt-4o', an API key (e.g., `OPENAI_API_KEY`) must be set as an environment variable.

import argparse
import os
from bfcl_eval.eval_pipeline import eval_handler

# Set your OpenAI API key as an environment variable
# For testing, you might use a placeholder, but for actual runs, it's required.
os.environ['OPENAI_API_KEY'] = os.environ.get('OPENAI_API_KEY', 'YOUR_OPENAI_API_KEY_HERE')

# Create a Namespace object to simulate command-line arguments
# These are common arguments required by the `run_eval` method.
args = argparse.Namespace(
    dataset_name='multi_step_9-8-0', # Example dataset, check docs for available ones
    model_name='gpt-4o',          # Model to evaluate, e.g., 'gpt-4o', 'gemini-1.5-pro'
    num_gpus=0,                   # Set to 0 for CPU execution
    batch_size=1,
    num_eval_prompts=1,           # Number of prompts to evaluate (for quick test)
    output_dir='./bfcl_results',  # Directory to save results
    api_key=os.environ['OPENAI_API_KEY'], # Passed via args or env var
    temp=0.7,
    top_p=1.0,
    max_tokens=2000,
    system_prompt_path=None,
    eval_mode='full',
    eval_version='v3',            # Refers to the benchmark version (V1, V2, V3)
    enable_tool_code_execution=False, # Set to True to enable code execution (requires sandboxing)
    enable_parallel=False,
    num_threads=1,
    live_data=False
)

print(f"Starting BFCL evaluation for dataset '{args.dataset_name}' with model '{args.model_name}'...")

try:
    # Run the evaluation pipeline
    results = eval_handler.run_eval(args)
    print("\nEvaluation Complete!")
    print("Results:")
    print(results)
except Exception as e:
    print(f"\nAn error occurred during evaluation: {e}")
    if 'OPENAI_API_KEY' not in os.environ or not os.environ['OPENAI_API_KEY']:
        print("Please ensure your OPENAI_API_KEY environment variable is set correctly.")
    print("Check the dataset name, model name, and API key configurations.")

view raw JSON →