{"id":4614,"library":"llama-index-multi-modal-llms-openai","title":"LlamaIndex OpenAI Multi-Modal LLMs Integration","description":"This library provides an integration for LlamaIndex to use OpenAI's multi-modal Large Language Models (LLMs), such as GPT-4V and GPT-4o, for tasks involving both text and image inputs. It allows users to leverage OpenAI's capabilities for image understanding, reasoning, and multi-modal Retrieval Augmented Generation (RAG) applications within the LlamaIndex framework. The current version is 0.6.2 and it is part of the broader LlamaIndex ecosystem for building LLM applications.","status":"active","version":"0.6.2","language":"en","source_language":"en","source_url":"https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/multi_modal_llms/llama-index-multi-modal-llms-openai","tags":["llama-index","multi-modal","llm","openai","gpt-4v","gpt-4o","rag","vision","image"],"install":[{"cmd":"pip install llama-index-multi-modal-llms-openai","lang":"bash","label":"Install the OpenAI Multi-Modal LLM integration"}],"dependencies":[{"reason":"Core LlamaIndex framework components are required for integration.","package":"llama-index-core","optional":false},{"reason":"Provides the underlying OpenAI API client for model interaction.","package":"openai","optional":false}],"imports":[{"note":"While 'OpenAIMultiModal' is the current class in this package, LlamaIndex is consolidating multi-modal support directly into the 'OpenAI' LLM class within 'llama_index.llms.openai' using ChatMessage content blocks for newer versions of `llama-index-llms-openai`.","wrong":"from llama_index.llms.openai import OpenAI","symbol":"OpenAIMultiModal","correct":"from llama_index.multi_modal_llms.openai import OpenAIMultiModal"},{"note":"Utility for loading images from URLs into ImageDocument objects.","symbol":"load_image_urls","correct":"from llama_index.core.multi_modal_llms.generic_utils import load_image_urls"},{"note":"Used for loading documents, including images from local directories.","symbol":"SimpleDirectoryReader","correct":"from llama_index.core import SimpleDirectoryReader"}],"quickstart":{"code":"import os\nfrom llama_index.multi_modal_llms.openai import OpenAIMultiModal\nfrom llama_index.core.multi_modal_llms.generic_utils import load_image_urls\n\n# Ensure your OpenAI API key is set as an environment variable\nos.environ[\"OPENAI_API_KEY\"] = os.environ.get(\"OPENAI_API_KEY\", \"\")\n\n# Example image URL (replace with a real URL or local path for actual use)\n# For a runnable example, you'd need a publicly accessible image URL or a local image file.\n# For demonstration purposes, we'll use a placeholder URL and note its purpose.\nimage_urls = [\n    \"https://docs.llamaindex.ai/en/stable/_static/assets/img/llama-index-logo.png\"\n]\n\n# Load image documents from URLs\nimage_documents = load_image_urls(image_urls)\n\n# Initialize the OpenAI Multi-Modal LLM\nopenai_mm_llm = OpenAIMultiModal(\n    model=\"gpt-4-vision-preview\", # Or \"gpt-4o\"\n    api_key=os.environ[\"OPENAI_API_KEY\"],\n    max_new_tokens=300,\n)\n\n# Complete a prompt with image documents\nresponse = openai_mm_llm.complete(\n    prompt=\"What is in the image? Describe it.\",\n    image_documents=image_documents,\n)\n\nprint(response.text)","lang":"python","description":"This quickstart demonstrates how to initialize the `OpenAIMultiModal` class with an OpenAI vision model (e.g., `gpt-4-vision-preview` or `gpt-4o`), load image documents from URLs, and then use the LLM to get a descriptive response based on both a text prompt and the provided images. Ensure `OPENAI_API_KEY` is set in your environment."},"warnings":[{"fix":"For new implementations, consider using `from llama_index.llms.openai import OpenAI` and construct multimodal prompts using `ChatMessage` with `ImageBlock` and `TextBlock` content instead.","message":"The `OpenAIMultiModal` class and its specific multi-modal LLM abstraction are being phased out in LlamaIndex. Future development is consolidating multi-modal support directly into the unified `OpenAI` LLM class (from `llama_index.llms.openai`) by using `ChatMessage` objects with `ImageBlock` and `TextBlock` content.","severity":"deprecated","affected_versions":">=0.6.2 (future versions of llama-index-llms-openai will replace this functionality)"},{"fix":"Set `export OPENAI_API_KEY='your-key'` in your terminal or `os.environ[\"OPENAI_API_KEY\"] = \"your-key\"` in your code before initialization.","message":"An OpenAI API key is required and must be set as an environment variable (`OPENAI_API_KEY`) or passed explicitly to the `OpenAIMultiModal` constructor for authentication with OpenAI services.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure you specify a multi-modal capable model like `model=\"gpt-4-vision-preview\"` or `model=\"gpt-4o\"` when initializing `OpenAIMultiModal`.","message":"Only specific OpenAI models (e.g., `gpt-4-vision-preview`, `gpt-4o`) support multi-modal (text + image) inputs. Using a text-only model with `OpenAIMultiModal` or multimodal prompts will result in an error or unexpected behavior.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}