FunASR

1.3.1 · active · verified Thu Apr 16

FunASR is a fundamental, end-to-end speech recognition toolkit from Alibaba DAMO Academy, currently at version 1.3.1. It provides a wide range of features including Automatic Speech Recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, and Speaker Diarization. The library is actively maintained with frequent updates, often releasing new models and features such as the Fun-ASR-Nano-2512 which supports 31 languages and low-latency real-time transcription.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to perform non-streaming Automatic Speech Recognition (ASR) using the `AutoModel` class. It downloads the `paraformer-zh` model (which often includes integrated VAD and Punctuation), processes a remote audio file, and prints the transcribed text. You can easily switch to local audio files or specify GPU usage.

from funasr import AutoModel
import soundfile as sf
import os

# You might need to set an environment variable for modelscope token if hitting rate limits or private models
# os.environ['MODELSCOPE_API_TOKEN'] = 'your_token_here'

# Initialize the ASR model, will download 'paraformer-zh' from ModelScope if not local
# 'paraformer-zh' is a multi-functional model, with VAD and PUNC integrated.
# Use a public audio URL for demonstration
model = AutoModel(model="paraformer-zh", 
                  vad_model="fsmn-vad", 
                  punc_model="ct-punc-c",
                  device="cpu") # Specify 'cuda:0' for GPU if available

# Example audio input: a remote URL or a local file path
# For a local file, ensure it exists, e.g., 'path/to/your/audio.wav'
# Using a provided example audio from FunASR's repository
audio_input = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav"

print(f"Processing audio: {audio_input}")

# Perform speech recognition
# The generate method returns a list of dictionaries with transcription results
result = model.generate(input=audio_input)

# Print the transcription result
if result and result[0].get('text'):
    print(f"Transcription: {result[0]['text']}")
else:
    print("No transcription result found.")

# Example of VAD (Voice Activity Detection)
# model_vad = AutoModel(model="fsmn-vad", device="cpu")
# vad_result = model_vad.generate(input=audio_input)
# print(f"VAD Result: {vad_result}")

view raw JSON →