{"id":5782,"library":"llama-cpp-python","title":"llama-cpp-python: Python Bindings for llama.cpp","description":"Python bindings for the `llama.cpp` library, enabling efficient local inference of large language models (LLMs) on various hardware, including CPUs and GPUs (NVIDIA, Apple Metal, AMD ROCm). It provides both a high-level API for easy model interaction and a low-level API for direct C API access. The library is actively maintained with frequent updates, often mirroring upstream `llama.cpp` changes, and currently stands at version 0.3.20.","status":"active","version":"0.3.20","language":"en","source_language":"en","source_url":"https://github.com/abetlen/llama-cpp-python","tags":["LLM","bindings","inference","NLP","AI","local-inference","GGUF","CUDA","Metal"],"install":[{"cmd":"pip install llama-cpp-python","lang":"bash","label":"Basic CPU support (builds from source)"},{"cmd":"pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu","lang":"bash","label":"Basic CPU support (pre-built wheel)"},{"cmd":"CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" pip install llama-cpp-python","lang":"bash","label":"With NVIDIA CUDA (requires CUDA Toolkit & C++ compiler)"},{"cmd":"CMAKE_ARGS=\"-DLLAMA_METAL=on\" pip install llama-cpp-python","lang":"bash","label":"With Apple Metal GPU (macOS, requires arm64 Python)"},{"cmd":"CMAKE_ARGS=\"-DLLAMA_HIPBLAS=on\" pip install llama-cpp-python","lang":"bash","label":"With AMD ROCm/hipBLAS (requires ROCm toolkit)"},{"cmd":"CMAKE_ARGS=\"-DLLAMA_OPENBLAS=on\" pip install llama-cpp-python","lang":"bash","label":"With OpenBLAS (optimized CPU)"}],"dependencies":[{"reason":"Required to build `llama.cpp` from source, which is the default installation method and often necessary for GPU acceleration. On Windows, Visual Studio with 'Desktop development with C++' is typically needed; on Linux, `gcc` and `g++`; on macOS, Xcode Command Line Tools.","package":"C/C++ compiler","optional":false},{"reason":"Required for NVIDIA GPU acceleration (cuBLAS backend).","package":"CUDA Toolkit","optional":true},{"reason":"Required for AMD GPU acceleration (hipBLAS backend).","package":"ROCm","optional":true}],"imports":[{"note":"This is the primary high-level class for interacting with loaded models.","symbol":"Llama","correct":"from llama_cpp import Llama"},{"note":"Used for constrained text generation with GBNF grammars (e.g., JSON output).","symbol":"LlamaGrammar","correct":"from llama_cpp import LlamaGrammar"},{"note":"Required for certain models (like Functionary v2) where HuggingFace tokenizers are needed due to discrepancies with `llama.cpp`'s default tokenizer.","symbol":"LlamaHFTokenizer","correct":"from llama_cpp import LlamaHFTokenizer"}],"quickstart":{"code":"import os\nfrom llama_cpp import Llama\n\n# Ensure you have a GGUF model downloaded, e.g., to a 'models' directory.\n# Example: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf\nmodel_path = os.environ.get('LLAMA_MODEL_PATH', './models/llama-2-7b-chat.Q4_K_M.gguf')\n\n# Initialize the Llama model\n# Set n_gpu_layers to a value > 0 for GPU acceleration (requires GPU install config)\nllm = Llama(\n    model_path=model_path,\n    n_ctx=2048,  # Context window size\n    n_gpu_layers=0, # Set to > 0 for GPU, -1 to offload all layers if GPU is available\n    verbose=False  # Suppress llama.cpp verbose output\n)\n\n# Generate a completion\nprompt = \"Q: Name the planets in the solar system? A: \"\noutput = llm(prompt, max_tokens=128, stop=[\"Q:\", \"\\n\"], echo=True)\n\nprint(output[\"choices\"][0][\"text\"])","lang":"python","description":"This quickstart demonstrates how to load a GGUF model and generate text using the high-level `Llama` class. It highlights important parameters like `model_path`, `n_ctx` for context size, and `n_gpu_layers` for GPU offloading. Ensure your model is in the GGUF format."},"warnings":[{"fix":"Refer to the official documentation or GitHub README for specific `CMAKE_ARGS` and installation instructions for your desired backend (e.g., `CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" pip install llama-cpp-python`). Ensure all necessary C++ compilers and GPU toolkits are correctly installed and configured in your system's PATH.","message":"Installation with GPU acceleration (CUDA, Metal, ROCm) often requires setting specific `CMAKE_ARGS` environment variables or using pre-built wheels from custom index URLs, along with correctly configured C++ compilers and GPU toolkits. Default `pip install` typically provides CPU-only support.","severity":"breaking","affected_versions":"All versions"},{"fix":"Ensure you are using an ARM64 version of Python (e.g., from Miniforge). You might need to explicitly specify `CMAKE_ARGS=\"-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on\"` during installation.","message":"On Apple Silicon (M1/M2) Macs, `llama-cpp-python` can default to building an x86 version if an ARM64 Python interpreter is not used. This results in significantly slower performance.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always check the changelog when upgrading `llama-cpp-python`. If upgrading, use `--upgrade --force-reinstall --no-cache-dir` to ensure a clean rebuild from source. Be prepared for potential API adjustments, particularly in low-level usage or integrations with other libraries.","message":"The library closely tracks upstream `llama.cpp` development, which can introduce breaking changes to the underlying C API that may propagate to the Python bindings. Notable past changes include the transition to GGUF model format and changes in KV cache management functions.","severity":"breaking","affected_versions":"Frequent, especially minor version bumps"},{"fix":"When initializing `Llama` for chat, explicitly set the `chat_format` parameter (e.g., `llm = Llama(..., chat_format=\"llama-2\")`). Consult the model card or `llama-cpp-python` documentation for the correct format for your chosen model.","message":"LLM models require specific chat formats (e.g., 'llama-2', 'chatml') for proper conversational interaction. Using the wrong format can lead to 'weird' or unparsable responses, especially in chat completion APIs.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Initialize your model with `llm = Llama(model_path=\"...\", embedding=True)`. Then use `llm.create_embedding(...)`.","message":"To generate embeddings, you must explicitly enable them by passing `embedding=True` to the `Llama` constructor when initializing the model.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z"}