Azure Cognitive Services Speech SDK for Python
The Microsoft Cognitive Services Speech SDK for Python (current version 1.49.0) provides robust capabilities for integrating speech-to-text, text-to-speech, and speech translation into Python applications. It supports both real-time and non-real-time scenarios across various platforms, enabling developers to build intelligent speech-enabled features. The library maintains an active release cadence with frequent updates.
Warnings
- breaking Standard text-to-speech voices were retired on August 31, 2024. Applications using these voices must migrate to neural voices to avoid service disruption.
- breaking Support for Intent Recognition and Speaker Recognition has been removed due to service retirement.
- gotcha Network connectivity issues, including firewalls, proxies, and incorrect endpoint configurations, are common. The SDK might silently fail without clear exceptions.
- gotcha On Windows, the Speech SDK requires the Microsoft Visual C++ Redistributable for Visual Studio 2015-2022 to be installed.
- gotcha Latency issues, especially with large SSML files or certain neural voices (e.g., F1 tier), can lead to partial audio output or 'Internal Server Error' due to timeouts.
- gotcha Authentication failures often stem from incorrect API keys, expired tokens, or mismatches between the specified region/endpoint in code and the actual Azure resource deployment.
Install
-
pip install azure-cognitiveservices-speech
Imports
- speechsdk
import azure.cognitiveservices.speech as speechsdk
- SpeechConfig
speechsdk.SpeechConfig
- AudioConfig
speechsdk.audio.AudioOutputConfig
- SpeechSynthesizer
speechsdk.SpeechSynthesizer
- SpeechRecognizer
speechsdk.SpeechRecognizer
- ResultReason
speechsdk.ResultReason
Quickstart
import os
import azure.cognitiveservices.speech as speechsdk
# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
# (or "ENDPOINT" for custom endpoints) to be set.
# Replace with your own subscription key and service region. Example: "westus", "eastus"
speech_key = os.environ.get('SPEECH_KEY', '')
speech_region = os.environ.get('SPEECH_REGION', '') # e.g., 'westus'
if not speech_key or not speech_region:
print("Please set the SPEECH_KEY and SPEECH_REGION environment variables.")
exit()
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
# The neural multilingual voice can speak different languages based on the input text.
speech_config.speech_synthesis_voice_name='en-US-AvaMultilingualNeural'
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
print("Enter some text that you want to speak (type 'exit' to quit) >")
while True:
text = input()
if text.lower() == 'exit':
break
speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"Speech synthesized for text: [{text}]")
elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = speech_synthesis_result.cancellation_details
print(f"Speech synthesis canceled: {cancellation_details.reason}")
if cancellation_details.reason == speechsdk.CancellationReason.Error:
if cancellation_details.error_details:
print(f"Error details: {cancellation_details.error_details}")
print("Did you set the speech resource key and region environment variables correctly?")