CLIP (anytorch)

2.6.0 · active · verified Fri Apr 17

CLIP (Contrastive Language-Image Pre-training) is a neural network trained on a variety of (image, text) pairs, allowing zero-shot visual classification and multimodal embeddings. The `clip-anytorch` library provides an easy-to-use PyTorch implementation, often used as a direct port of the original OpenAI CLIP model. It is currently at version 2.6.0 with a consistent minor release cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a pre-trained CLIP model, preprocess an image and text, and then use the model to compute similarity scores (logits) between the image and various text descriptions. It handles device selection (GPU/CPU) and includes a minimal dummy image creation for standalone execution.

import torch
import clip
from PIL import Image
import os # Added for sample image path

# Ensure you have a sample image, e.g., 'sample.jpg' in the current directory
# For demonstration, let's create a dummy image if not present:
if not os.path.exists("sample.jpg"):
    from PIL import ImageDraw
    img = Image.new('RGB', (60, 30), color = 'red')
    d = ImageDraw.Draw(img)
    d.text((10,10), "Hello", fill=(255,255,0))
    img.save("sample.jpg")

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the CLIP model and its preprocessing function
model, preprocess = clip.load("ViT-B/32", device=device)

# Preprocess an image
image_path = "sample.jpg"
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)

# Tokenize text
text = clip.tokenize(["a photo of a cat", "a photo of a dog", "a red square with text"]).to(device)

with torch.no_grad():
    # Encode image and text to get features
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    # Calculate similarity scores
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probabilities (for image vs text captions):")
for i, p in enumerate(probs[0]):
    print(f"  '{clip.tokenize(["a photo of a cat", "a photo of a dog", "a red square with text"])[i][0].text}': {p:.4f}")

view raw JSON →