llama-cpp-python: Python Bindings for llama.cpp
Python bindings for the `llama.cpp` library, enabling efficient local inference of large language models (LLMs) on various hardware, including CPUs and GPUs (NVIDIA, Apple Metal, AMD ROCm). It provides both a high-level API for easy model interaction and a low-level API for direct C API access. The library is actively maintained with frequent updates, often mirroring upstream `llama.cpp` changes, and currently stands at version 0.3.20.
Warnings
- breaking Installation with GPU acceleration (CUDA, Metal, ROCm) often requires setting specific `CMAKE_ARGS` environment variables or using pre-built wheels from custom index URLs, along with correctly configured C++ compilers and GPU toolkits. Default `pip install` typically provides CPU-only support.
- gotcha On Apple Silicon (M1/M2) Macs, `llama-cpp-python` can default to building an x86 version if an ARM64 Python interpreter is not used. This results in significantly slower performance.
- breaking The library closely tracks upstream `llama.cpp` development, which can introduce breaking changes to the underlying C API that may propagate to the Python bindings. Notable past changes include the transition to GGUF model format and changes in KV cache management functions.
- gotcha LLM models require specific chat formats (e.g., 'llama-2', 'chatml') for proper conversational interaction. Using the wrong format can lead to 'weird' or unparsable responses, especially in chat completion APIs.
- gotcha To generate embeddings, you must explicitly enable them by passing `embedding=True` to the `Llama` constructor when initializing the model.
Install
-
pip install llama-cpp-python -
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu -
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python -
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python -
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python -
CMAKE_ARGS="-DLLAMA_OPENBLAS=on" pip install llama-cpp-python
Imports
- Llama
from llama_cpp import Llama
- LlamaGrammar
from llama_cpp import LlamaGrammar
- LlamaHFTokenizer
from llama_cpp import LlamaHFTokenizer
Quickstart
import os
from llama_cpp import Llama
# Ensure you have a GGUF model downloaded, e.g., to a 'models' directory.
# Example: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
model_path = os.environ.get('LLAMA_MODEL_PATH', './models/llama-2-7b-chat.Q4_K_M.gguf')
# Initialize the Llama model
# Set n_gpu_layers to a value > 0 for GPU acceleration (requires GPU install config)
llm = Llama(
model_path=model_path,
n_ctx=2048, # Context window size
n_gpu_layers=0, # Set to > 0 for GPU, -1 to offload all layers if GPU is available
verbose=False # Suppress llama.cpp verbose output
)
# Generate a completion
prompt = "Q: Name the planets in the solar system? A: "
output = llm(prompt, max_tokens=128, stop=["Q:", "\n"], echo=True)
print(output["choices"][0]["text"])