Google Cloud Speech-to-Text Python Client

raw JSON →
2.38.0 verified Tue May 12 auth: no python install: verified quickstart: stale

The `google-cloud-speech` Python client library provides seamless integration with the Google Cloud Speech-to-Text API. It allows developers to convert audio to text using powerful neural network models, supporting various languages and audio formats. Currently at version 2.38.0, the library is actively maintained with frequent releases, often monthly or bi-monthly, ensuring ongoing improvements and new features.

pip install google-cloud-speech
error ModuleNotFoundError: No module named 'google.cloud'
cause The `google-cloud-speech` library or its base `google-cloud` package is not installed in the Python environment being used, or an outdated/incorrect `google-cloud` package was installed.
fix
Install the specific google-cloud-speech library using pip: pip install google-cloud-speech.
error The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials.
cause The application cannot find valid Google Cloud credentials to authenticate with the Speech-to-Text API, or the authenticated identity lacks the necessary permissions.
fix
Set up Application Default Credentials by running gcloud auth application-default login or by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your service account key JSON file. Ensure the associated service account has the 'Cloud Speech-to-Text API User' role.
error google.api_core.exceptions.InvalidArgument: 400 Must use single channel (mono) audio, but WAV header indicates 2 channels.
cause The `RecognitionConfig` sent with the API request specifies a single audio channel (mono) but the provided audio file has multiple channels (e.g., stereo), or vice-versa, leading to a mismatch.
fix
Ensure the audio_channel_count in your RecognitionConfig (e.g., speech.RecognitionConfig(audio_channel_count=2, ...)) accurately reflects the actual number of channels in your audio file. Convert the audio to mono if desired, or explicitly specify the correct channel count.
breaking The Speech-to-Text V2 API is not a drop-in replacement for V1. It features a modernized interface, new features, and different pricing. Existing V1 code will require modification to use V2.
fix Refer to the official documentation for the V2 API client library and update your code to use the new V2 models and request structures. Be aware of the `SpeechClient` vs `speech_v2.SpeechClient` instantiation.
gotcha The most common error is `DefaultCredentialsError`, indicating that the client cannot find valid authentication credentials.
fix Ensure the `GOOGLE_APPLICATION_CREDENTIALS` environment variable is correctly set to the path of your service account JSON key file. Alternatively, explicitly pass `credentials` to the `SpeechClient` constructor. Make sure the service account has the 'Cloud Speech-to-Text User' role.
gotcha Incorrect audio file encoding, sample rate, or format (e.g., trying to transcribe an MP3 with `LINEAR16` config) will lead to transcription errors or poor results. For files stored in Google Cloud Storage, the URI must be in `gs://bucket-name/object-name` format.
fix Always match the `RecognitionConfig` parameters (like `encoding` and `sample_rate_hertz`) to the actual properties of your audio file. For GCS files, ensure the URI is correctly formatted. Consider using an audio conversion library if your input format is not directly supported or needs normalization.
gotcha Streaming transcription for longer audio (especially for certain non-English languages) may encounter intermittent failures around the 4-minute mark due to internal streaming limits or processing complexities.
fix For very long audio, consider using asynchronous batch processing via `LongRunningRecognize`. For streaming, ensure stable network connectivity and consider breaking audio into shorter segments or adjusting `streaming_limit` if applicable. Simplify audio input by speaking clearly and minimizing background noise.
gotcha When using streaming recognition with `interim_results=True` in the V2 API, the `responses_iterator` might block until all requests are done instead of yielding results immediately, which can be unexpected for real-time applications.
fix This is an ongoing issue being tracked. Monitor the GitHub repository for updates and potential workarounds. You might need to adjust your application's logic to handle delayed interim results or consider alternative streaming patterns if real-time interim results are critical.
python os / libc status wheel install import disk
3.10 alpine (musl) - - 1.76s 70.7M
3.10 slim (glibc) - - 1.06s 68M
3.11 alpine (musl) - - 2.52s 75.7M
3.11 slim (glibc) - - 1.57s 73M
3.12 alpine (musl) - - 2.58s 67.1M
3.12 slim (glibc) - - 2.06s 65M
3.13 alpine (musl) - - 2.56s 66.7M
3.13 slim (glibc) - - 2.44s 64M
3.9 alpine (musl) - - 1.61s 70.9M
3.9 slim (glibc) - - 1.24s 69M

This quickstart demonstrates how to transcribe a local audio file using the Google Cloud Speech-to-Text client library. It covers client instantiation, reading audio content, configuring recognition settings, and processing the transcription response. Ensure you have a Google Cloud project with the Speech-to-Text API enabled and your `GOOGLE_APPLICATION_CREDENTIALS` environment variable pointing to a service account key file with appropriate permissions.

import os
from google.cloud import speech

# Set the path to your service account key file
# This is typically done via the GOOGLE_APPLICATION_CREDENTIALS environment variable.
# For local testing, you might set it in code (not recommended for production).
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/your/keyfile.json"

def transcribe_audio(audio_file_path):
    client = speech.SpeechClient()

    with open(audio_file_path, "rb") as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
    )

    try:
        response = client.recognize(config=config, audio=audio)
        for result in response.results:
            print(f"Transcript: {result.alternatives[0].transcript}")
    except Exception as e:
        print(f"An error occurred: {e}")

# Example usage (replace with your actual audio file)
if __name__ == "__main__":
    # Make sure you have an audio file named 'audio.wav' (16-bit, 16000 Hz, mono WAV)
    # and that GOOGLE_APPLICATION_CREDENTIALS is set up.
    # For testing, create a dummy WAV file or use a real one.
    # e.g., using `scipy.io.wavfile.write('audio.wav', 16000, np.zeros(16000, dtype=np.int16))`
    # Or, for a real test, ensure you have a small audio.wav file.
    # You must have a service account key file and set the GOOGLE_APPLICATION_CREDENTIALS
    # environment variable pointing to it, or pass credentials explicitly.
    # e.g., export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json"

    # This example assumes a 'test.wav' file exists in the same directory
    # and is a LINEAR16 (16-bit PCM), 16000 Hz, mono WAV file.
    # Create a dummy file for demonstration if needed:
    # import numpy as np
    # from scipy.io.wavfile import write as write_wav
    # write_wav('test.wav', 16000, np.zeros(16000, dtype=np.int16))

    # Placeholder for a real audio file path
    # In a real scenario, ensure this file exists and is correctly formatted.
    # For this quickstart, you might use a short, simple WAV file.
    audio_test_file = "test.wav"
    print(f"Attempting to transcribe: {audio_test_file}")
    print("Ensure GOOGLE_APPLICATION_CREDENTIALS is set and the file exists and is LINEAR16, 16000 Hz, mono.")
    transcribe_audio(audio_test_file)