LAION CLAP

1.1.7 · active · verified Fri Apr 17

LAION CLAP (Contrastive Language-Audio Pretraining) is a Python library that provides a pre-trained multimodal model capable of understanding and embedding both text and audio inputs into a shared latent space. This allows for tasks like audio-text retrieval, zero-shot audio classification, and text-to-audio search. The current version is 1.1.7, and releases are typically made to incorporate new model weights, bug fixes, or minor feature enhancements.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the CLAP model, generate embeddings for both text and (dummy) audio inputs, and calculate the similarity between them. The model weights are downloaded automatically on the first run. For actual audio files, use libraries like `soundfile`, `torchaudio`, or `librosa` to load them into PyTorch tensors before passing them to `get_audio_embeddings`.

import torch
from laion_clap import CLAP

# Determine device (CUDA if available, otherwise CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Initialize the CLAP model (default 'CLAP_512' model is loaded)
# This will download model weights on first run (can be ~600MB).
model = CLAP(version='CLAP_512', use_cuda=torch.cuda.is_available())

# --- Text Embedding Example ---
text_data = [
    "A clear audio recording of a dog barking.",
    "The sound of waves crashing on the shore."
]
text_embeddings = model.get_text_embeddings(text_data)
print(f"Text embeddings shape: {text_embeddings.shape}")

# --- Audio Embedding Example ---
# For a runnable quickstart without needing actual audio files, 
# we generate dummy audio data. In a real scenario, you'd load files.
# CLAP expects audio at 48kHz sampling rate, mono channel.

sample_rate = 48000
duration_seconds = 5
# Generate a batch of 2 mono audio tensors (2 x 5 seconds at 48kHz)
dummy_audio = torch.randn(2, sample_rate * duration_seconds)

# Move audio to the correct device
audio_data_tensors = [d.to(device) for d in dummy_audio]

# Get audio embeddings. `resample=True` is often helpful to handle 
# potential mismatches in sample rates, though here our dummy data matches.
audio_embeddings = model.get_audio_embeddings(audio_data_tensors, resample=True)
print(f"Audio embeddings shape: {audio_embeddings.shape}")

# --- Similarity Calculation ---
# Normalize embeddings for cosine similarity
text_embeddings_norm = text_embeddings / text_embeddings.norm(dim=-1, keepdim=True)
audio_embeddings_norm = audio_embeddings / audio_embeddings.norm(dim=-1, keepdim=True)

similarity = torch.matmul(text_embeddings_norm, audio_embeddings_norm.T)
print(f"\nSimilarity scores (text x audio):\n{similarity.cpu().numpy()}")
# Expected: High similarity for text[0] with audio[0], text[1] with audio[1] (if embeddings were meaningful)

view raw JSON →