Berkeley Function Calling Leaderboard Evaluation
bfcl-eval is the Python library for the Berkeley Function Calling Leaderboard (BFCL), a benchmark to evaluate Large Language Models (LLMs) on their ability to perform function calling. It provides the evaluation pipeline and datasets, including support for multi-step and multi-turn function calls as of its V3 release. The library is actively maintained with frequent updates, with its current PyPI version being 2026.3.23.
Common errors
-
openai.BadRequestError: The model `gpt-4o` does not exist or you do not have access to it.
cause The OpenAI API key is missing, invalid, or your account does not have access to the specified model (e.g., insufficient tier, region restrictions).fixSet the `OPENAI_API_KEY` environment variable with a valid key. Verify your OpenAI account has permissions for the model you are trying to use. -
ModuleNotFoundError: No module named 'bfcl_eval'
cause The `bfcl-eval` package is not installed in the current Python environment, or the environment is not active.fixInstall the package using `pip install bfcl-eval`. If using a virtual environment, ensure it is activated. -
argparse.ArgumentError: argument --dataset_name: invalid choice: 'old_dataset_name' (choose from 'multi_step_9-8-0', 'multi_turn_base_34', ...)
cause The specified dataset name is either incorrect, deprecated, or not available in the installed version of `bfcl-eval` or for the chosen `eval_version`.fixConsult the `bfcl-eval` documentation or the `gorilla` GitHub repository's `README` for a list of currently supported `dataset_name` values for your installed `bfcl-eval` version and the `eval_version` you are targeting.
Warnings
- gotcha Many models (e.g., GPT-4o, Gemini) require an API key to be set either as an environment variable (e.g., `OPENAI_API_KEY`) or passed directly via `args.api_key`. Forgetting this is a common source of errors.
- breaking The Berkeley Function Calling Leaderboard has evolved through multiple versions (V1, V2, V3), which often involve changes to dataset names, formats, and evaluation methodologies. Using an older `dataset_name` with a newer evaluation pipeline, or vice-versa, can lead to `ArgumentError` or incorrect results.
- gotcha The `bfcl-eval` package is a component of the larger 'Gorilla' project. Users sometimes confuse installing the `bfcl-eval` PyPI package with directly cloning and running scripts from the `Gorilla` GitHub repository's `berkeley-function-call-leaderboard` subdirectory. This can lead to `ModuleNotFoundError` if imports are based on the repository structure instead of the installed package structure.
Install
-
pip install bfcl-eval
Imports
- eval_handler
from bfcl_eval import eval_handler
from bfcl_eval.eval_pipeline import eval_handler
- EvalMetrics
from bfcl_eval import EvalMetrics
from bfcl_eval.eval_pipeline import EvalMetrics
Quickstart
import argparse
import os
from bfcl_eval.eval_pipeline import eval_handler
# Set your OpenAI API key as an environment variable
# For testing, you might use a placeholder, but for actual runs, it's required.
os.environ['OPENAI_API_KEY'] = os.environ.get('OPENAI_API_KEY', 'YOUR_OPENAI_API_KEY_HERE')
# Create a Namespace object to simulate command-line arguments
# These are common arguments required by the `run_eval` method.
args = argparse.Namespace(
dataset_name='multi_step_9-8-0', # Example dataset, check docs for available ones
model_name='gpt-4o', # Model to evaluate, e.g., 'gpt-4o', 'gemini-1.5-pro'
num_gpus=0, # Set to 0 for CPU execution
batch_size=1,
num_eval_prompts=1, # Number of prompts to evaluate (for quick test)
output_dir='./bfcl_results', # Directory to save results
api_key=os.environ['OPENAI_API_KEY'], # Passed via args or env var
temp=0.7,
top_p=1.0,
max_tokens=2000,
system_prompt_path=None,
eval_mode='full',
eval_version='v3', # Refers to the benchmark version (V1, V2, V3)
enable_tool_code_execution=False, # Set to True to enable code execution (requires sandboxing)
enable_parallel=False,
num_threads=1,
live_data=False
)
print(f"Starting BFCL evaluation for dataset '{args.dataset_name}' with model '{args.model_name}'...")
try:
# Run the evaluation pipeline
results = eval_handler.run_eval(args)
print("\nEvaluation Complete!")
print("Results:")
print(results)
except Exception as e:
print(f"\nAn error occurred during evaluation: {e}")
if 'OPENAI_API_KEY' not in os.environ or not os.environ['OPENAI_API_KEY']:
print("Please ensure your OPENAI_API_KEY environment variable is set correctly.")
print("Check the dataset name, model name, and API key configurations.")