OpenCLIP

3.3.0 · active · verified Fri Apr 10

OpenCLIP is an open-source implementation of OpenAI's Contrastive Language-Image Pre-training (CLIP) and related models. It enables training CLIP models at scale, leveraging state-of-the-art pretrained weights, and performing zero-shot image classification and retrieval. The current version is 3.3.0, with active development and regular releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a pre-trained OpenCLIP model, preprocess a dummy image and text, then compute the zero-shot similarity probabilities between the image and the given text labels. It includes loading the model, tokenizer, and performing inference with feature normalization.

import torch
from PIL import Image
import open_clip
import io
import base64

# Create a dummy image (in a real scenario, load from file or URL)
dummy_image_data = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII="
image = Image.open(io.BytesIO(base64.b64decode(dummy_image_data))).convert('RGB')

# 1. Load model and preprocessing transforms
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'
)
model.eval() # Set model to evaluation mode

# 2. Get tokenizer
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# 3. Prepare inputs
image_input = preprocess(image).unsqueeze(0) # Add batch dimension
text_input = tokenizer(["a diagram", "a dog", "a cat"])

# 4. Run inference
with torch.no_grad(): # Disable gradient computation for inference
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_input)

    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    # Compute similarity scores
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probabilities:", text_probs)

# Optional: Interpret results
labels = ["a diagram", "a dog", "a cat"]
top_prob, top_idx = text_probs[0].max(dim=0)
print(f"Predicted: {labels[top_idx]} ({top_prob.item():.1%} confidence)")

view raw JSON →