{"id":7697,"library":"s3tokenizer","title":"S3Tokenizer","description":"S3Tokenizer is a Python library that provides a reverse-engineered PyTorch implementation of the Supervised Semantic Speech Tokenizer (S3Tokenizer), originally proposed in CosyVoice. It enables high-throughput batch inference and online speech code extraction. The current version is 0.3.0, and the library demonstrates a rapid release cadence, frequently adding support for newer CosyVoice versions and improving audio processing capabilities.","status":"active","version":"0.3.0","language":"en","source_language":"en","source_url":"https://github.com/xingchensong/S3Tokenizer","tags":["audio","speech","tokenizer","pytorch","cosyvoice","nlp","ai","machine-learning"],"install":[{"cmd":"pip install s3tokenizer","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core deep learning framework for model implementation.","package":"torch","optional":false},{"reason":"Audio processing functionalities, tightly integrated with PyTorch.","package":"torchaudio","optional":false},{"reason":"Progress bar for iterative processes (e.g., batch processing).","package":"tqdm","optional":false},{"reason":"Fundamental package for numerical computing in Python.","package":"numpy","optional":false},{"reason":"Flexible and powerful tensor operations for neural networks.","package":"einops","optional":false},{"reason":"Runtime for ONNX models, used for converting original ONNX weights to PyTorch.","package":"onnxruntime","optional":false},{"reason":"Library for reading and writing sound files.","package":"soundfile","optional":false}],"imports":[{"note":"Primary entry point for loading S3Tokenizer models.","symbol":"load_model","correct":"import s3tokenizer\ntokenizer = s3tokenizer.load_model(\"speech_tokenizer_v1\")"},{"note":"Utility function for loading audio files into the correct format.","symbol":"load_audio","correct":"import s3tokenizer\naudio = s3tokenizer.load_audio(\"path/to/audio.wav\")"}],"quickstart":{"code":"import s3tokenizer\nimport os\n\n# Ensure PyTorch is available and CUDA is configured if using GPU\n# For demonstration, we'll try to use CUDA if available, otherwise CPU.\n# In a real scenario, you might want to explicitly set device or handle errors.\n\n# Assuming a dummy audio file for demonstration. \n# In a real use case, replace with your .wav file path.\n# You might need to create a dummy WAV file or download one from the S3Tokenizer repo assets.\n# For example, by downloading from: https://github.com/xingchensong/S3Tokenizer/blob/main/s3tokenizer/assets/BAC009S0764W0121.wav\n\n# Create a dummy .wav file if it doesn't exist for a runnable example\ndummy_wav_path = \"dummy_audio.wav\"\nif not os.path.exists(dummy_wav_path):\n    try:\n        import torchaudio\n        import torch\n        sample_rate = 16000\n        duration_seconds = 5\n        waveform = torch.randn(1, sample_rate * duration_seconds)\n        torchaudio.save(dummy_wav_path, waveform, sample_rate)\n        print(f\"Created dummy audio file: {dummy_wav_path}\")\n    except ImportError:\n        print(\"torchaudio or torch not found. Cannot create dummy audio. Please provide a real .wav file.\")\n        exit()\n\n\n# Load the tokenizer model, preferring CUDA if available\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ntokenizer = s3tokenizer.load_model(\"speech_tokenizer_v1\").to(device)\nprint(f\"Tokenizer model loaded on device: {device}\")\n\n# Load an audio file\n# Replace `dummy_wav_path` with your actual audio file path if not using the dummy\nif os.path.exists(dummy_wav_path):\n    audio = s3tokenizer.load_audio(dummy_wav_path)\n    audio = audio.to(device)\n    audio_len = torch.tensor([audio.shape[1]], device=device)\n\n    # Quantize the audio to get speech codes\n    speech_codes, speech_codes_lens = tokenizer.quantize(audio, audio_len)\n\n    print(f\"Shape of extracted speech codes: {speech_codes.shape}\")\n    print(f\"Length of speech codes: {speech_codes_lens.item()}\")\nelse:\n    print(f\"Error: Audio file not found at {dummy_wav_path}.\")","lang":"python","description":"This quickstart demonstrates how to load an S3Tokenizer model, load an audio file, and extract discrete speech codes from it. The example prioritizes GPU usage if available and includes a fallback to CPU. It also includes code to generate a dummy WAV file for immediate execution."},"warnings":[{"fix":"No direct fix needed, but be aware of the internal long audio processing mechanism. Consult the GitHub README for details on windowing and overlap if fine-grained control is required.","message":"Automatic long audio processing introduced in v0.2.0 (and refined in v0.2.5/0.25) transparently handles audio longer than 30 seconds by segmenting it with a sliding window (30-second window, 4-second overlap). While this requires no explicit user action, advanced users should be aware of this internal behavior for specific use cases or debugging.","severity":"gotcha","affected_versions":">=0.2.0"},{"fix":"Explicitly specify the desired model version (e.g., `s3tokenizer.load_model(\"speech_tokenizer_v3_25hz\")`). If encountering reconstruction quality issues with `v3_25hz`, consider testing with previous model versions (`v1`, `v2`) or referring to GitHub issues for updates.","message":"When upgrading to support CosyVoice3, ensure you are using the correct model identifier, such as `speech_tokenizer_v3_25hz`. While new models are supported, an open issue suggests potential differences in reconstruction quality compared to original CosyVoice tokens for `v3_25hz` models.","severity":"gotcha","affected_versions":">=0.2.5"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Monitor the project's GitHub issues (#49) for updates or official fixes. Consider using earlier, more stable model versions (e.g., `speech_tokenizer_v1`, `speech_tokenizer_v2_25hz`) if `v3_25hz` exhibits critical reconstruction discrepancies for your application.","cause":"Reported inconsistency or fidelity issues with the `speech_tokenizer_v3_25hz` model's output compared to the original CosyVoice tokens, despite matching shape and length.","error":"speech_tokenizer_v3_25hz tokens produce very different reconstruction vs CosyVoice tokens (same shape/length, very different codes)"},{"fix":"Ensure PyTorch with CUDA support is correctly installed (`pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118` or appropriate CUDA version). If no GPU is available, load the model to CPU: `tokenizer = s3tokenizer.load_model(\"...\").to(\"cpu\")`.","cause":"Attempting to load the model onto a CUDA-enabled device (`.cuda()` or `.to(\"cuda\")`) when no compatible NVIDIA GPU or CUDA installation is detected on the system.","error":"RuntimeError: No CUDA GPUs are available"}]}