S3Tokenizer
S3Tokenizer is a Python library that provides a reverse-engineered PyTorch implementation of the Supervised Semantic Speech Tokenizer (S3Tokenizer), originally proposed in CosyVoice. It enables high-throughput batch inference and online speech code extraction. The current version is 0.3.0, and the library demonstrates a rapid release cadence, frequently adding support for newer CosyVoice versions and improving audio processing capabilities.
Common errors
-
speech_tokenizer_v3_25hz tokens produce very different reconstruction vs CosyVoice tokens (same shape/length, very different codes)
cause Reported inconsistency or fidelity issues with the `speech_tokenizer_v3_25hz` model's output compared to the original CosyVoice tokens, despite matching shape and length.fixMonitor the project's GitHub issues (#49) for updates or official fixes. Consider using earlier, more stable model versions (e.g., `speech_tokenizer_v1`, `speech_tokenizer_v2_25hz`) if `v3_25hz` exhibits critical reconstruction discrepancies for your application. -
RuntimeError: No CUDA GPUs are available
cause Attempting to load the model onto a CUDA-enabled device (`.cuda()` or `.to("cuda")`) when no compatible NVIDIA GPU or CUDA installation is detected on the system.fixEnsure PyTorch with CUDA support is correctly installed (`pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118` or appropriate CUDA version). If no GPU is available, load the model to CPU: `tokenizer = s3tokenizer.load_model("...").to("cpu")`.
Warnings
- gotcha Automatic long audio processing introduced in v0.2.0 (and refined in v0.2.5/0.25) transparently handles audio longer than 30 seconds by segmenting it with a sliding window (30-second window, 4-second overlap). While this requires no explicit user action, advanced users should be aware of this internal behavior for specific use cases or debugging.
- gotcha When upgrading to support CosyVoice3, ensure you are using the correct model identifier, such as `speech_tokenizer_v3_25hz`. While new models are supported, an open issue suggests potential differences in reconstruction quality compared to original CosyVoice tokens for `v3_25hz` models.
Install
-
pip install s3tokenizer
Imports
- load_model
import s3tokenizer tokenizer = s3tokenizer.load_model("speech_tokenizer_v1") - load_audio
import s3tokenizer audio = s3tokenizer.load_audio("path/to/audio.wav")
Quickstart
import s3tokenizer
import os
# Ensure PyTorch is available and CUDA is configured if using GPU
# For demonstration, we'll try to use CUDA if available, otherwise CPU.
# In a real scenario, you might want to explicitly set device or handle errors.
# Assuming a dummy audio file for demonstration.
# In a real use case, replace with your .wav file path.
# You might need to create a dummy WAV file or download one from the S3Tokenizer repo assets.
# For example, by downloading from: https://github.com/xingchensong/S3Tokenizer/blob/main/s3tokenizer/assets/BAC009S0764W0121.wav
# Create a dummy .wav file if it doesn't exist for a runnable example
dummy_wav_path = "dummy_audio.wav"
if not os.path.exists(dummy_wav_path):
try:
import torchaudio
import torch
sample_rate = 16000
duration_seconds = 5
waveform = torch.randn(1, sample_rate * duration_seconds)
torchaudio.save(dummy_wav_path, waveform, sample_rate)
print(f"Created dummy audio file: {dummy_wav_path}")
except ImportError:
print("torchaudio or torch not found. Cannot create dummy audio. Please provide a real .wav file.")
exit()
# Load the tokenizer model, preferring CUDA if available
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = s3tokenizer.load_model("speech_tokenizer_v1").to(device)
print(f"Tokenizer model loaded on device: {device}")
# Load an audio file
# Replace `dummy_wav_path` with your actual audio file path if not using the dummy
if os.path.exists(dummy_wav_path):
audio = s3tokenizer.load_audio(dummy_wav_path)
audio = audio.to(device)
audio_len = torch.tensor([audio.shape[1]], device=device)
# Quantize the audio to get speech codes
speech_codes, speech_codes_lens = tokenizer.quantize(audio, audio_len)
print(f"Shape of extracted speech codes: {speech_codes.shape}")
print(f"Length of speech codes: {speech_codes_lens.item()}")
else:
print(f"Error: Audio file not found at {dummy_wav_path}.")