{"id":6748,"library":"openai-clip","title":"CLIP (Contrastive Language-Image Pre-training)","description":"CLIP (Contrastive Language-Image Pre-training) is a neural network developed by OpenAI that efficiently learns visual concepts from natural language supervision. It connects computer vision and natural language understanding by training on a vast dataset of image-text pairs, enabling zero-shot classification and text-image retrieval. The PyPI package `openai-clip` (version 1.0.1) is an unofficial distribution of the original OpenAI CLIP library, which is primarily maintained and distributed via its official GitHub repository. The library does not have a formal release cadence.","status":"active","version":"1.0.1","language":"en","source_language":"en","source_url":"https://github.com/openai/CLIP","tags":["computer-vision","natural-language-processing","multimodal","zero-shot-learning","image-retrieval","pytorch","openai"],"install":[{"cmd":"pip install openai-clip","lang":"bash","label":"Unofficial PyPI package"},{"cmd":"pip install git+https://github.com/openai/CLIP.git","lang":"bash","label":"Official (GitHub source) - Recommended"},{"cmd":"conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0 && pip install ftfy regex tqdm && pip install git+https://github.com/openai/CLIP.git","lang":"bash","label":"Official (GitHub source) with PyTorch and dependencies (CUDA 11.0)"}],"dependencies":[{"reason":"Core deep learning framework for model execution.","package":"torch"},{"reason":"Provides image datasets, models, and transformations for PyTorch.","package":"torchvision"},{"reason":"Handles text encoding issues in tokenizer.","package":"ftfy"},{"reason":"Used for tokenizer operations.","package":"regex"},{"reason":"For progress bars during model loading/processing.","package":"tqdm"},{"reason":"For image manipulation (PIL.Image).","package":"Pillow"}],"imports":[{"symbol":"clip","correct":"import clip"},{"note":"For using CLIP via Hugging Face Transformers, which is an alternative and often recommended approach for ease of use.","symbol":"CLIPModel, CLIPProcessor","correct":"from transformers import CLIPModel, CLIPProcessor"}],"quickstart":{"code":"import torch\nimport clip\nfrom PIL import Image\nimport os\n\n# Ensure 'CLIP.png' exists or replace with a valid image path\n# For demonstration, let's create a dummy image file if it doesn't exist\nif not os.path.exists('CLIP.png'):\n    try:\n        from io import BytesIO\n        import base64\n        dummy_image_b64 = \"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII=\"\n        img_data = base64.b64decode(dummy_image_b64)\n        with open('CLIP.png', 'wb') as f:\n            f.write(img_data)\n    except ImportError:\n        print(\"Pillow not installed or cannot create dummy image. Please provide a real image path.\")\n        exit()\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel, preprocess = clip.load(\"ViT-B/32\", device=device)\n\nimage = preprocess(Image.open(\"CLIP.png\")).unsqueeze(0).to(device)\ntext = clip.tokenize([\"a diagram\", \"a dog\", \"a cat\"]).to(device)\n\nwith torch.no_grad():\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text)\n    \n    logits_per_image, logits_per_text = model(image, text)\n    probs = logits_per_image.softmax(dim=-1).cpu().numpy()\n\nprint(\"Label probs:\", probs)\n# Expected output for a blank image and these labels might be somewhat uniform or biased, \n# but demonstrates the process. For a real image, probabilities would be skewed.\n","lang":"python","description":"This quickstart loads a pre-trained CLIP model (ViT-B/32) and its associated preprocessing function. It then processes a dummy image and a list of text labels, computes the image and text features, and calculates the similarity scores to predict the most relevant text snippet for the image. It runs on CUDA if available, otherwise falls back to CPU."},"warnings":[{"fix":"Use `pip install git+https://github.com/openai/CLIP.git` instead of `pip install openai-clip`.","message":"The `openai-clip` package on PyPI is an unofficial wrapper around OpenAI's official CLIP GitHub repository. For the most up-to-date and officially supported version, it is recommended to install directly from the OpenAI CLIP GitHub repository.","severity":"gotcha","affected_versions":"All versions of `openai-clip` on PyPI (1.0.1+)"},{"fix":"Pin `setuptools` to a version below 81 (`pip install 'setuptools<81'`) or ensure your Python environment uses compatible versions if encountering build errors during installation.","message":"Newer versions of `setuptools` (81+) cause build failures due to the removal of `pkg_resources`, which `clip` (from OpenAI's GitHub) might still implicitly use.","severity":"breaking","affected_versions":"Potentially `openai-clip` 1.0.1 when used with `setuptools>=81` and Python versions where `pkg_resources` is removed (e.g., Python 3.12+)."},{"fix":"Experiment with different prompt templates (e.g., 'a photo of {class}', 'this is a {class}') and systematically test prompt variations for your specific use case. Consider ensembling predictions from multiple prompts.","message":"CLIP's performance can be sensitive to the phrasing of text prompts ('prompt engineering'). Slight variations in wording can significantly impact classification accuracy.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Be aware of these limitations. For highly specific or fine-grained tasks, fine-tuning CLIP on domain-specific data or using task-specific models might be necessary. Augmenting prompts or using retrieval-augmented approaches can sometimes help.","message":"CLIP may struggle with tasks requiring precise spatial reasoning, counting, or very fine-grained classification (e.g., distinguishing between similar car models or flower species). It also exhibits poor generalization to images not well-represented in its pre-training data.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Exercise caution when deploying CLIP in sensitive applications. Conduct thorough bias evaluations for your specific use cases and consider ethical implications. Techniques like debiasing embeddings or careful prompt engineering can mitigate some issues.","message":"Models trained on internet-scale data like CLIP can inherit and exhibit social biases present in the training datasets.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}