CLIP (Contrastive Language-Image Pre-training)
CLIP (Contrastive Language-Image Pre-training) is a neural network developed by OpenAI that efficiently learns visual concepts from natural language supervision. It connects computer vision and natural language understanding by training on a vast dataset of image-text pairs, enabling zero-shot classification and text-image retrieval. The PyPI package `openai-clip` (version 1.0.1) is an unofficial distribution of the original OpenAI CLIP library, which is primarily maintained and distributed via its official GitHub repository. The library does not have a formal release cadence.
Warnings
- gotcha The `openai-clip` package on PyPI is an unofficial wrapper around OpenAI's official CLIP GitHub repository. For the most up-to-date and officially supported version, it is recommended to install directly from the OpenAI CLIP GitHub repository.
- breaking Newer versions of `setuptools` (81+) cause build failures due to the removal of `pkg_resources`, which `clip` (from OpenAI's GitHub) might still implicitly use.
- gotcha CLIP's performance can be sensitive to the phrasing of text prompts ('prompt engineering'). Slight variations in wording can significantly impact classification accuracy.
- gotcha CLIP may struggle with tasks requiring precise spatial reasoning, counting, or very fine-grained classification (e.g., distinguishing between similar car models or flower species). It also exhibits poor generalization to images not well-represented in its pre-training data.
- gotcha Models trained on internet-scale data like CLIP can inherit and exhibit social biases present in the training datasets.
Install
-
pip install openai-clip -
pip install git+https://github.com/openai/CLIP.git -
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0 && pip install ftfy regex tqdm && pip install git+https://github.com/openai/CLIP.git
Imports
- clip
import clip
- CLIPModel, CLIPProcessor
from transformers import CLIPModel, CLIPProcessor
Quickstart
import torch
import clip
from PIL import Image
import os
# Ensure 'CLIP.png' exists or replace with a valid image path
# For demonstration, let's create a dummy image file if it doesn't exist
if not os.path.exists('CLIP.png'):
try:
from io import BytesIO
import base64
dummy_image_b64 = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII="
img_data = base64.b64decode(dummy_image_b64)
with open('CLIP.png', 'wb') as f:
f.write(img_data)
except ImportError:
print("Pillow not installed or cannot create dummy image. Please provide a real image path.")
exit()
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs)
# Expected output for a blank image and these labels might be somewhat uniform or biased,
# but demonstrates the process. For a real image, probabilities would be skewed.