CLIP (anytorch)
CLIP (Contrastive Language-Image Pre-training) is a neural network trained on a variety of (image, text) pairs, allowing zero-shot visual classification and multimodal embeddings. The `clip-anytorch` library provides an easy-to-use PyTorch implementation, often used as a direct port of the original OpenAI CLIP model. It is currently at version 2.6.0 with a consistent minor release cadence.
Common errors
-
ModuleNotFoundError: No module named 'clip'
cause The `clip-anytorch` package was not installed or the environment where it was installed is not active.fixRun `pip install clip-anytorch` in your active Python environment. -
RuntimeError: CUDA out of memory. Tried to allocate X GiB (GPU Y; X GiB total capacity; Z GiB already allocated; W GiB free; P MiB reserved in total by PyTorch)
cause The GPU does not have enough memory to load the model or process the current batch size.fixUse a smaller CLIP model (e.g., 'ViT-B/32' instead of 'ViT-L/14'), reduce your batch size if processing multiple items, or acquire a GPU with more VRAM. -
ValueError: Unknown model name '...' (Available models are: 'RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px')
cause The model name passed to `clip.load()` is misspelled or not a valid pre-trained model supported by the library.fixCheck the official documentation or the error message itself for the list of available model names and correct the spelling.
Warnings
- gotcha CLIP models, especially larger variants (e.g., ViT-L/14), require significant GPU memory. Running on CPU can be very slow.
- gotcha Ensure your PyTorch installation is compatible with your CUDA drivers and GPU hardware. Mismatched versions can lead to `CUDA error` or `Device 'cuda:0' not found`.
- gotcha Pre-trained CLIP models are downloaded on the first `clip.load()` call, which requires an active internet connection and available disk space (typically 200MB - 1GB depending on the model).
Install
-
pip install clip-anytorch
Imports
- clip
import clip
- load
model, preprocess = clip.load(...)
- tokenize
text = clip.tokenize(...)
Quickstart
import torch
import clip
from PIL import Image
import os # Added for sample image path
# Ensure you have a sample image, e.g., 'sample.jpg' in the current directory
# For demonstration, let's create a dummy image if not present:
if not os.path.exists("sample.jpg"):
from PIL import ImageDraw
img = Image.new('RGB', (60, 30), color = 'red')
d = ImageDraw.Draw(img)
d.text((10,10), "Hello", fill=(255,255,0))
img.save("sample.jpg")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load the CLIP model and its preprocessing function
model, preprocess = clip.load("ViT-B/32", device=device)
# Preprocess an image
image_path = "sample.jpg"
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# Tokenize text
text = clip.tokenize(["a photo of a cat", "a photo of a dog", "a red square with text"]).to(device)
with torch.no_grad():
# Encode image and text to get features
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Calculate similarity scores
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probabilities (for image vs text captions):")
for i, p in enumerate(probs[0]):
print(f" '{clip.tokenize(["a photo of a cat", "a photo of a dog", "a red square with text"])[i][0].text}': {p:.4f}")