{"id":9587,"library":"clip-anytorch","title":"CLIP (anytorch)","description":"CLIP (Contrastive Language-Image Pre-training) is a neural network trained on a variety of (image, text) pairs, allowing zero-shot visual classification and multimodal embeddings. The `clip-anytorch` library provides an easy-to-use PyTorch implementation, often used as a direct port of the original OpenAI CLIP model. It is currently at version 2.6.0 with a consistent minor release cadence.","status":"active","version":"2.6.0","language":"en","source_language":"en","source_url":"https://github.com/rom1504/CLIP","tags":["ml","vision","nlp","clip","pytorch","multimodal","deep-learning"],"install":[{"cmd":"pip install clip-anytorch","lang":"bash","label":"Install library"}],"dependencies":[{"reason":"Core deep learning framework for model execution.","package":"torch"},{"reason":"Provides dataset and model preprocessing utilities, especially for image handling.","package":"torchvision"},{"reason":"Used for fixing unicode text prior to tokenization.","package":"ftfy"},{"reason":"Advanced regular expression operations for text processing.","package":"regex"},{"reason":"Progress bar for model downloads and processing.","package":"tqdm"}],"imports":[{"symbol":"clip","correct":"import clip"},{"note":"`load` is a function directly accessible from the `clip` module, not imported as `from clip import load`.","symbol":"load","correct":"model, preprocess = clip.load(...)"},{"note":"`tokenize` is a function directly accessible from the `clip` module, not imported as `from clip import tokenize`.","symbol":"tokenize","correct":"text = clip.tokenize(...)"}],"quickstart":{"code":"import torch\nimport clip\nfrom PIL import Image\nimport os # Added for sample image path\n\n# Ensure you have a sample image, e.g., 'sample.jpg' in the current directory\n# For demonstration, let's create a dummy image if not present:\nif not os.path.exists(\"sample.jpg\"):\n    from PIL import ImageDraw\n    img = Image.new('RGB', (60, 30), color = 'red')\n    d = ImageDraw.Draw(img)\n    d.text((10,10), \"Hello\", fill=(255,255,0))\n    img.save(\"sample.jpg\")\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nprint(f\"Using device: {device}\")\n\n# Load the CLIP model and its preprocessing function\nmodel, preprocess = clip.load(\"ViT-B/32\", device=device)\n\n# Preprocess an image\nimage_path = \"sample.jpg\"\nimage = preprocess(Image.open(image_path)).unsqueeze(0).to(device)\n\n# Tokenize text\ntext = clip.tokenize([\"a photo of a cat\", \"a photo of a dog\", \"a red square with text\"]).to(device)\n\nwith torch.no_grad():\n    # Encode image and text to get features\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text)\n\n    # Calculate similarity scores\n    logits_per_image, logits_per_text = model(image, text)\n    probs = logits_per_image.softmax(dim=-1).cpu().numpy()\n\nprint(\"Label probabilities (for image vs text captions):\")\nfor i, p in enumerate(probs[0]):\n    print(f\"  '{clip.tokenize([\"a photo of a cat\", \"a photo of a dog\", \"a red square with text\"])[i][0].text}': {p:.4f}\")\n","lang":"python","description":"This quickstart demonstrates how to load a pre-trained CLIP model, preprocess an image and text, and then use the model to compute similarity scores (logits) between the image and various text descriptions. It handles device selection (GPU/CPU) and includes a minimal dummy image creation for standalone execution."},"warnings":[{"fix":"Use a GPU-enabled environment. For memory errors, try smaller models (e.g., ViT-B/32) or reduce batch sizes if possible.","message":"CLIP models, especially larger variants (e.g., ViT-L/14), require significant GPU memory. Running on CPU can be very slow.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Refer to the PyTorch installation guide (pytorch.org) to install the correct PyTorch+CUDA version for your system. Upgrade GPU drivers if necessary.","message":"Ensure your PyTorch installation is compatible with your CUDA drivers and GPU hardware. Mismatched versions can lead to `CUDA error` or `Device 'cuda:0' not found`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure stable internet access. If behind a proxy, configure `HTTPS_PROXY` environment variables. If disk space is an issue, consider alternative model storage locations if the library supports it (not directly via `clip.load`).","message":"Pre-trained CLIP models are downloaded on the first `clip.load()` call, which requires an active internet connection and available disk space (typically 200MB - 1GB depending on the model).","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Run `pip install clip-anytorch` in your active Python environment.","cause":"The `clip-anytorch` package was not installed or the environment where it was installed is not active.","error":"ModuleNotFoundError: No module named 'clip'"},{"fix":"Use a smaller CLIP model (e.g., 'ViT-B/32' instead of 'ViT-L/14'), reduce your batch size if processing multiple items, or acquire a GPU with more VRAM.","cause":"The GPU does not have enough memory to load the model or process the current batch size.","error":"RuntimeError: CUDA out of memory. Tried to allocate X GiB (GPU Y; X GiB total capacity; Z GiB already allocated; W GiB free; P MiB reserved in total by PyTorch)"},{"fix":"Check the official documentation or the error message itself for the list of available model names and correct the spelling.","cause":"The model name passed to `clip.load()` is misspelled or not a valid pre-trained model supported by the library.","error":"ValueError: Unknown model name '...' (Available models are: 'RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px')"}]}