CLIP (Contrastive Language-Image Pre-training)

1.0.1 · active · verified Wed Apr 15

CLIP (Contrastive Language-Image Pre-training) is a neural network developed by OpenAI that efficiently learns visual concepts from natural language supervision. It connects computer vision and natural language understanding by training on a vast dataset of image-text pairs, enabling zero-shot classification and text-image retrieval. The PyPI package `openai-clip` (version 1.0.1) is an unofficial distribution of the original OpenAI CLIP library, which is primarily maintained and distributed via its official GitHub repository. The library does not have a formal release cadence.

Warnings

Install

Imports

Quickstart

This quickstart loads a pre-trained CLIP model (ViT-B/32) and its associated preprocessing function. It then processes a dummy image and a list of text labels, computes the image and text features, and calculates the similarity scores to predict the most relevant text snippet for the image. It runs on CUDA if available, otherwise falls back to CPU.

import torch
import clip
from PIL import Image
import os

# Ensure 'CLIP.png' exists or replace with a valid image path
# For demonstration, let's create a dummy image file if it doesn't exist
if not os.path.exists('CLIP.png'):
    try:
        from io import BytesIO
        import base64
        dummy_image_b64 = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII="
        img_data = base64.b64decode(dummy_image_b64)
        with open('CLIP.png', 'wb') as f:
            f.write(img_data)
    except ImportError:
        print("Pillow not installed or cannot create dummy image. Please provide a real image path.")
        exit()

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)
# Expected output for a blank image and these labels might be somewhat uniform or biased, 
# but demonstrates the process. For a real image, probabilities would be skewed.

view raw JSON →