llama-cpp-python: Python Bindings for llama.cpp

0.3.20 · active · verified Tue Apr 14

Python bindings for the `llama.cpp` library, enabling efficient local inference of large language models (LLMs) on various hardware, including CPUs and GPUs (NVIDIA, Apple Metal, AMD ROCm). It provides both a high-level API for easy model interaction and a low-level API for direct C API access. The library is actively maintained with frequent updates, often mirroring upstream `llama.cpp` changes, and currently stands at version 0.3.20.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a GGUF model and generate text using the high-level `Llama` class. It highlights important parameters like `model_path`, `n_ctx` for context size, and `n_gpu_layers` for GPU offloading. Ensure your model is in the GGUF format.

import os
from llama_cpp import Llama

# Ensure you have a GGUF model downloaded, e.g., to a 'models' directory.
# Example: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
model_path = os.environ.get('LLAMA_MODEL_PATH', './models/llama-2-7b-chat.Q4_K_M.gguf')

# Initialize the Llama model
# Set n_gpu_layers to a value > 0 for GPU acceleration (requires GPU install config)
llm = Llama(
    model_path=model_path,
    n_ctx=2048,  # Context window size
    n_gpu_layers=0, # Set to > 0 for GPU, -1 to offload all layers if GPU is available
    verbose=False  # Suppress llama.cpp verbose output
)

# Generate a completion
prompt = "Q: Name the planets in the solar system? A: "
output = llm(prompt, max_tokens=128, stop=["Q:", "\n"], echo=True)

print(output["choices"][0]["text"])

view raw JSON →