OpenCLIP
OpenCLIP is an open-source implementation of OpenAI's Contrastive Language-Image Pre-training (CLIP) and related models. It enables training CLIP models at scale, leveraging state-of-the-art pretrained weights, and performing zero-shot image classification and retrieval. The current version is 3.3.0, with active development and regular releases.
Warnings
- gotcha When using `timm`-based image encoders (e.g., ConvNeXt, SigLIP, EVA), ensure you have the latest `timm` library installed. Older versions may result in 'Unknown model' errors.
- breaking The default activation function for models changed from `QuickGELU` to `torch.nn.GELU` in newer PyTorch versions. For OpenCLIP pretrained weights, using model definitions with a `-quickgelu` postfix (e.g., 'ViT-B-32-quickgelu') is necessary to match the original training and avoid an accuracy drop, especially during fine-tuning.
- gotcha Mismatch between installed `torch` and `open-clip-torch` versions can lead to `ModuleNotFoundError` or other runtime issues. Ensure compatible versions are installed, often by following PyTorch's installation instructions for your CUDA version before installing OpenCLIP.
- gotcha For optimal performance and consistency with original CLIP, OpenCLIP is designed to be used within a mixed-precision context (e.g., `torch.autocast('cuda')`) as OpenAI's original models utilized mixed-precision. Without it, there might be slight numerical differences in embeddings or reduced performance on GPU.
- gotcha If you are using models that rely on transformer tokenizers (e.g., certain text encoders), the `transformers` library must be installed separately, as it is an optional dependency for `open-clip-torch`.
Install
-
pip install open_clip_torch -
pip install open_clip_torch[training] -
pip install -U timm -
pip install transformers
Imports
- open_clip
import open_clip
- create_model_and_transforms
model, _, preprocess = open_clip.create_model_and_transforms(...)
- get_tokenizer
tokenizer = open_clip.get_tokenizer(...)
Quickstart
import torch
from PIL import Image
import open_clip
import io
import base64
# Create a dummy image (in a real scenario, load from file or URL)
dummy_image_data = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII="
image = Image.open(io.BytesIO(base64.b64decode(dummy_image_data))).convert('RGB')
# 1. Load model and preprocessing transforms
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-32',
pretrained='laion2b_s34b_b79k'
)
model.eval() # Set model to evaluation mode
# 2. Get tokenizer
tokenizer = open_clip.get_tokenizer('ViT-B-32')
# 3. Prepare inputs
image_input = preprocess(image).unsqueeze(0) # Add batch dimension
text_input = tokenizer(["a diagram", "a dog", "a cat"])
# 4. Run inference
with torch.no_grad(): # Disable gradient computation for inference
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_input)
# Normalize features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Compute similarity scores
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probabilities:", text_probs)
# Optional: Interpret results
labels = ["a diagram", "a dog", "a cat"]
top_prob, top_idx = text_probs[0].max(dim=0)
print(f"Predicted: {labels[top_idx]} ({top_prob.item():.1%} confidence)")