Phi-4 Multimodal Instruct

JSON →
microsoft multimodal
textimageaudio

A multimodal instruction-tuned model from Microsoft Phi-4 family capable of processing text, images, and audio.

context window 131K tokens
max output 4K tokens
input price $0.08 / 1M tokens
output price $0.32 / 1M tokens
visionstreamingreasoningcode-generationfunction-calling
releasedFeb 2025
knowledge cutoffJun 2024