S3Tokenizer

0.3.0 · active · verified Thu Apr 16

S3Tokenizer is a Python library that provides a reverse-engineered PyTorch implementation of the Supervised Semantic Speech Tokenizer (S3Tokenizer), originally proposed in CosyVoice. It enables high-throughput batch inference and online speech code extraction. The current version is 0.3.0, and the library demonstrates a rapid release cadence, frequently adding support for newer CosyVoice versions and improving audio processing capabilities.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load an S3Tokenizer model, load an audio file, and extract discrete speech codes from it. The example prioritizes GPU usage if available and includes a fallback to CPU. It also includes code to generate a dummy WAV file for immediate execution.

import s3tokenizer
import os

# Ensure PyTorch is available and CUDA is configured if using GPU
# For demonstration, we'll try to use CUDA if available, otherwise CPU.
# In a real scenario, you might want to explicitly set device or handle errors.

# Assuming a dummy audio file for demonstration. 
# In a real use case, replace with your .wav file path.
# You might need to create a dummy WAV file or download one from the S3Tokenizer repo assets.
# For example, by downloading from: https://github.com/xingchensong/S3Tokenizer/blob/main/s3tokenizer/assets/BAC009S0764W0121.wav

# Create a dummy .wav file if it doesn't exist for a runnable example
dummy_wav_path = "dummy_audio.wav"
if not os.path.exists(dummy_wav_path):
    try:
        import torchaudio
        import torch
        sample_rate = 16000
        duration_seconds = 5
        waveform = torch.randn(1, sample_rate * duration_seconds)
        torchaudio.save(dummy_wav_path, waveform, sample_rate)
        print(f"Created dummy audio file: {dummy_wav_path}")
    except ImportError:
        print("torchaudio or torch not found. Cannot create dummy audio. Please provide a real .wav file.")
        exit()


# Load the tokenizer model, preferring CUDA if available
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = s3tokenizer.load_model("speech_tokenizer_v1").to(device)
print(f"Tokenizer model loaded on device: {device}")

# Load an audio file
# Replace `dummy_wav_path` with your actual audio file path if not using the dummy
if os.path.exists(dummy_wav_path):
    audio = s3tokenizer.load_audio(dummy_wav_path)
    audio = audio.to(device)
    audio_len = torch.tensor([audio.shape[1]], device=device)

    # Quantize the audio to get speech codes
    speech_codes, speech_codes_lens = tokenizer.quantize(audio, audio_len)

    print(f"Shape of extracted speech codes: {speech_codes.shape}")
    print(f"Length of speech codes: {speech_codes_lens.item()}")
else:
    print(f"Error: Audio file not found at {dummy_wav_path}.")

view raw JSON →