LlamaIndex Ollama LLM Integration
The `llama-index-llms-ollama` library provides an integration for LlamaIndex to utilize Large Language Models (LLMs) hosted locally via Ollama. It enables users to leverage various open-source models (like Llama, Mistral, Gemma, Phi-3, etc.) for tasks such as completions and chat within a LlamaIndex application, without relying on cloud-based LLM services. The current version is 0.10.1, released on March 20, 2026, and follows LlamaIndex's active and rapid release cadence for its integration packages.
Warnings
- gotcha Ollama Server Prerequisite: The Ollama application must be installed and actively running on your local machine, and the desired LLM model (e.g., `llama3.1`) must be pulled using `ollama pull <model_name>` before this integration can connect to it.
- gotcha Default Timeout: The default request timeout (often 30 seconds) may be too short for larger local LLMs or slower machines, leading to `Timeout` errors.
- gotcha High Memory Usage: Running large local LLMs (e.g., Llama 3.1 8B) through Ollama can be memory-intensive, often requiring 32GB of RAM or more, especially when combined with embedding models.
- gotcha ModuleNotFoundError: Users frequently encounter `ModuleNotFoundError` if `llama-index-llms-ollama` is not installed in the currently active Python environment, or if their IDE (e.g., VS Code's Pylance) is configured to use a different interpreter.
- gotcha Conflicting Ollama Client Versions: There have been reports of conflicts when trying to install both `llama-index-multi-modal-llms-ollama` and `llama-index-llms-ollama` due to differing `ollama-client` version requirements.
Install
-
pip install llama-index-llms-ollama
Imports
- Ollama
from llama_index.llms.ollama import Ollama
Quickstart
# First, ensure Ollama is installed and running, and pull a model:
# On your terminal:
# curl -fsSL https://ollama.com/install.sh | sh
# ollama serve
# ollama pull llama3.1
from llama_index.llms.ollama import Ollama
from llama_index.core.llms import ChatMessage
import os
# Initialize Ollama LLM. Adjust model and timeout as needed.
# Ensure the model 'llama3.1' is pulled via 'ollama pull llama3.1'
llm = Ollama(
model="llama3.1:latest",
request_timeout=120.0, # Increase timeout from default 30s if model is slow
# context_window=8000 # Optionally set context window to limit memory usage
)
# Generate a completion
response_completion = llm.complete("Tell me a short story about a brave knight.")
print("\n--- Completion Response ---")
print(response_completion)
# Send a chat message
messages = [
ChatMessage(role="system", content="You are a helpful assistant."),
ChatMessage(role="user", content="What is the capital of France?")
]
response_chat = llm.chat(messages)
print("\n--- Chat Response ---")
print(response_chat.message.content)