LAION CLAP
LAION CLAP (Contrastive Language-Audio Pretraining) is a Python library that provides a pre-trained multimodal model capable of understanding and embedding both text and audio inputs into a shared latent space. This allows for tasks like audio-text retrieval, zero-shot audio classification, and text-to-audio search. The current version is 1.1.7, and releases are typically made to incorporate new model weights, bug fixes, or minor feature enhancements.
Common errors
-
ModuleNotFoundError: No module named 'clap_module'
cause Attempting to import `CLAP` from `clap_module.model` after installing `laion-clap` from PyPI.fixUse the correct import path: `from laion_clap import CLAP`. -
RuntimeError: CUDA out of memory. Tried to allocate XXX MiB (GPU XXX; XXX MiB total capacity; XXX MiB already allocated; XXX MiB free; XXX MiB reserved in total by PyTorch)
cause The input audio/text batch size or model size exceeds the available GPU memory.fixReduce the batch size for text or audio inputs. If using a large CLAP model version, consider a smaller one or upgrading your GPU. -
soundfile.LibsndfileError: Error opening 'path/to/audio.wav': File not found.
cause The specified audio file path is incorrect or the file does not exist. This error typically occurs when `soundfile` attempts to load a non-existent file.fixDouble-check the file path. Ensure the file exists and is accessible. Verify any relative paths are correct based on the current working directory. -
KeyError: 'CLAP_512' (or similar model name)
cause Attempting to load a CLAP model version that does not exist or whose name is misspelled.fixVerify the correct model version names from the `laion-clap` documentation or GitHub repository. Common versions include 'CLAP_512', 'CLAP_630k'. Ensure the library version supports the requested model.
Warnings
- gotcha The CLAP model downloads its pre-trained weights (approx. 600MB-1.5GB depending on the version) to a cache directory on the first initialization. This can be slow and requires an active internet connection. Ensure sufficient disk space and network connectivity.
- gotcha When processing audio, CLAP typically expects a specific sampling rate (e.g., 48000 Hz) and mono channel. Providing audio with different characteristics without resampling can lead to suboptimal embeddings or errors.
- gotcha Using the CLAP model on CPU can be significantly slower than using a GPU (CUDA). For larger batches or real-time applications, a CUDA-enabled GPU is highly recommended.
- deprecated Older examples or internal code might refer to `clap_module.model.CLAP`. When installing via `pip install laion-clap`, this import path is incorrect.
Install
-
pip install laion-clap -
pip install laion-clap[full]
Imports
- CLAP
from clap_module.model import CLAP
from laion_clap import CLAP
Quickstart
import torch
from laion_clap import CLAP
# Determine device (CUDA if available, otherwise CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Initialize the CLAP model (default 'CLAP_512' model is loaded)
# This will download model weights on first run (can be ~600MB).
model = CLAP(version='CLAP_512', use_cuda=torch.cuda.is_available())
# --- Text Embedding Example ---
text_data = [
"A clear audio recording of a dog barking.",
"The sound of waves crashing on the shore."
]
text_embeddings = model.get_text_embeddings(text_data)
print(f"Text embeddings shape: {text_embeddings.shape}")
# --- Audio Embedding Example ---
# For a runnable quickstart without needing actual audio files,
# we generate dummy audio data. In a real scenario, you'd load files.
# CLAP expects audio at 48kHz sampling rate, mono channel.
sample_rate = 48000
duration_seconds = 5
# Generate a batch of 2 mono audio tensors (2 x 5 seconds at 48kHz)
dummy_audio = torch.randn(2, sample_rate * duration_seconds)
# Move audio to the correct device
audio_data_tensors = [d.to(device) for d in dummy_audio]
# Get audio embeddings. `resample=True` is often helpful to handle
# potential mismatches in sample rates, though here our dummy data matches.
audio_embeddings = model.get_audio_embeddings(audio_data_tensors, resample=True)
print(f"Audio embeddings shape: {audio_embeddings.shape}")
# --- Similarity Calculation ---
# Normalize embeddings for cosine similarity
text_embeddings_norm = text_embeddings / text_embeddings.norm(dim=-1, keepdim=True)
audio_embeddings_norm = audio_embeddings / audio_embeddings.norm(dim=-1, keepdim=True)
similarity = torch.matmul(text_embeddings_norm, audio_embeddings_norm.T)
print(f"\nSimilarity scores (text x audio):\n{similarity.cpu().numpy()}")
# Expected: High similarity for text[0] with audio[0], text[1] with audio[1] (if embeddings were meaningful)