LlamaIndex OpenAI Multi-Modal LLMs Integration
This library provides an integration for LlamaIndex to use OpenAI's multi-modal Large Language Models (LLMs), such as GPT-4V and GPT-4o, for tasks involving both text and image inputs. It allows users to leverage OpenAI's capabilities for image understanding, reasoning, and multi-modal Retrieval Augmented Generation (RAG) applications within the LlamaIndex framework. The current version is 0.6.2 and it is part of the broader LlamaIndex ecosystem for building LLM applications.
Warnings
- deprecated The `OpenAIMultiModal` class and its specific multi-modal LLM abstraction are being phased out in LlamaIndex. Future development is consolidating multi-modal support directly into the unified `OpenAI` LLM class (from `llama_index.llms.openai`) by using `ChatMessage` objects with `ImageBlock` and `TextBlock` content.
- gotcha An OpenAI API key is required and must be set as an environment variable (`OPENAI_API_KEY`) or passed explicitly to the `OpenAIMultiModal` constructor for authentication with OpenAI services.
- gotcha Only specific OpenAI models (e.g., `gpt-4-vision-preview`, `gpt-4o`) support multi-modal (text + image) inputs. Using a text-only model with `OpenAIMultiModal` or multimodal prompts will result in an error or unexpected behavior.
Install
-
pip install llama-index-multi-modal-llms-openai
Imports
- OpenAIMultiModal
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
- load_image_urls
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
- SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader
Quickstart
import os
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
# Ensure your OpenAI API key is set as an environment variable
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")
# Example image URL (replace with a real URL or local path for actual use)
# For a runnable example, you'd need a publicly accessible image URL or a local image file.
# For demonstration purposes, we'll use a placeholder URL and note its purpose.
image_urls = [
"https://docs.llamaindex.ai/en/stable/_static/assets/img/llama-index-logo.png"
]
# Load image documents from URLs
image_documents = load_image_urls(image_urls)
# Initialize the OpenAI Multi-Modal LLM
openai_mm_llm = OpenAIMultiModal(
model="gpt-4-vision-preview", # Or "gpt-4o"
api_key=os.environ["OPENAI_API_KEY"],
max_new_tokens=300,
)
# Complete a prompt with image documents
response = openai_mm_llm.complete(
prompt="What is in the image? Describe it.",
image_documents=image_documents,
)
print(response.text)