Google’s Speech-to-Text API: A New Milestone in AI Technology Applications

Speech recognition is an interesting task, and we can see a large number of APIs in the market recently. However, when it comes to speech data, especially for customer service center data, this task can become a bit challenging. Assuming a customer service conversation lasts for about ten minutes, only a few APIs on the market can handle this type of data (Google, Amazon, IBM, Microsoft, Nuance, Rev.ai, Open source Wavenet, Open source CMU Sphinx). In this article, I will detail how to send speech recognition requests to Speech-to-Text from the command line using Google Cloud, which supports speech recognition and transcription services for 125 languages.

Google Speech-to-text API

Google offers 3 types of APIs for users conforming to the content of the speech service.

Content limitationAudio Duration
Synchronous requestsAround 1 minute
Asynchronous requestsAround 480 minutes (long duration)
Streamer requestsAround 5 minutes (real-time)

Synchronous requests

For synchronous service requests, the content of the speech file should be within approximately 1 minute. For this type of request, the user does not need to upload their data to Google Cloud servers. This provides great convenience for users as they can store their speech files on a local computer or server and access Google’s API to transcribe them into text. This is what we will detail in this article.

Asynchronous requests

This type of request involves approximately 480 minutes of speech content (8 hours). For this type of request, users are required to upload their own data to Google Cloud servers.

Streaming requests

Streaming requests are suitable when users have to speak directly into a microphone and real-time transcription is required. Using this type of request requires that the data content should persist for no more than 5 minutes.

Preparation

Some initialization settings are needed before starting, which can be completed according to the instructions we provide below.

The following operations are required to accomplished before sending requests to the Speech-to-Text API:

  • Enable the Speech-to-Text API on Google Cloud
  1. Ensure billing is enabled for Speech-to-Text
  2. Create and/or assign one or more service accounts to Speech-to-Text
  3. Download the service account credential key
  • Set up authentication environment variables
  • (Optional) Create a new Google Cloud Storage bucket to access your speech data

Step 1: Enable the Speech-to-Text API on GCP

Step 2: Create and/or assign one or more service accounts to Speech-to-Text

Step 3: Download the service account credential key

Step 4: Set up authentication environment variables

Activate Cloud Shell

1. Create a new Node.js application directory:$ mkdir  speech-to-text-nodejs
2. Set the speech-to-text-nodejs directory as your Cloud Shell workspace and open it: $ cd speech-to-text-nodejs; cloudshell open-workspace .
3.Upload the key file to the current speech-to-text-nodejs working directory:(figure)
4.Set the key as the default credential:$ export GOOGLE_APPLICATION_CREDENTIALS=XXXXXX.json

Send an audio file transcription request

Step 1: Create a JSON request file with the following text, and save it as sync-speechtotext.json plain text document.

{
    "config": {
        "encoding":"FLAC",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "enableWordTimeOffsets": false
    },
    "audio": {
        "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
    }
  }

Explaination: In this JSON fragment, the audio file is encoded in FLAC format. It has a sampling rate of 16000 Hz, and the audio file is located at the given URI in Google Cloud Storage.”

Step 2: Use curl to send the speech : recognition request :

$curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    https://speech.googleapis.com/v1/speech:recognize \
    -d @sync-speechtotext.json

Return the request result:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.9828748
        }
      ],
      "resultEndTime": "1.770s",
      "languageCode": "en-us"
    }
  ],
  "totalBilledTime": "2s",
  "requestId": "866425412681567144"
}

Congratulations! You have sent your first request to Speech-to-Text!

If you receive an error or empty response from Speech-to-Text, please refer to troubleshooting and debugging steps.

One common error message that users encounter when authentication with Speech-to-Text fails is a 403 code, indicating that their “application default credentials” are not available.

$ curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    https://speech.googleapis.com/v1/speech:recognize \
    -d @sync-speechtotext.json
{
  "error": {
    "code": 403,
    "message": "Your application has authenticated using end user credentials from the Google Cloud SDK or Google Cloud Shell which are not supported by the speech.googleapis.com. We recommend configuring the billing/quota_project setting in gcloud or using a service account through the auth/impersonate_service_account setting. For more information about service accounts and how to use them in your application, see https://cloud.google.com/docs/authentication/. If you are getting this error with curl or similar tools, you may need to specify 'X-Goog-User-Project' HTTP header for quota and billing purposes. For more information regarding 'X-Goog-User-Project' header, please check https://cloud.google.com/apis/docs/system-parameters.",
    "status": "PERMISSION_DENIED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        "reason": "SERVICE_DISABLED",
        "domain": "googleapis.com",
        "metadata": {
          "consumer": "projects/618104708054",
          "service": "speech.googleapis.com"
        }
      }
    ]
  }
}

A service account is required for your project and download the key (JSON file) for the service account to your development environment. Then, set the location of the JSON file as an environment variable named GOOGLE_APPLICATION_CREDENTIALS.

The following command can verify whether you have access to a valid service account key JSON file in the GOOGLE_APPLICATION_CREDENTIALS environment variable. 

$cat $GOOGLE_APPLICATION_CREDENTIALS

If it is not set correctly, you can set the key file as the default credential:

$export GOOGLE_APPLICATION_CREDENTIALS=XXXXXX.json

Google Speech-to-Text API can handle specific types of speech encoding. 

The API supports various audio encodings. The following table lists the supported audio decoders:

Encoder/ DecoderNameLosslessUsage Description
mp3MPEG The third layer sound fileNoMP3 encoding is a beta feature and is only available in v1p1beta1. For more information, please refer to the RecognitionConfig reference documentation.
FLACFree Lossless Audio CodecYesMessage flow requires a bit depth of 16 or 24 bits
LINEAR16Linear PCMYes16-bit linear pulse code modulation (PCM) encoding. The file header must contain the sample rate.
MULAWμ lawNo8-bit PCM encoding
AMRAdaptive Multi-Rate BroadbandNoSample rate must be 8000 Hz
AMR_WBAdaptive Multi-Rate BroadbandNoSample rate must be 16000 Hz
OGG_OPUSOpus-encoded audio frames in an Ogg containerNoSample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz or 48000 Hz
SPEEX_WITH_HEADER_BYTESpeex BroadbandNoSample rate must be 16000 Hz
WEBM_OPUSWebM OpusNoSample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz or 48000 Hz

Note: FLAC is both an audio encoder/decoder and an audio file format. To transcribe audio files using FLAC encoding, you must provide a file in the .FLAC format, which includes a file header containing metadata. Note that Speech-to-Text supports WAV files containing audio encoded with LINEAR16 or MULAW.

This means that if your audio file is not in a supported encoding by the Google API, you need to convert the audio file to an encoding supported by Google.

If you have the option to choose the encoding when encoding the source material, use lossless encodings like FLAC or LINEAR16 for better speech recognition performance. For guidelines on choosing the appropriate encoder for your task, refer to best practices.

Comprehensive Technical Learning and Support By Professional Team

Master Concept

Leave Us Your Message.
We are ready to talk!

Leave Us Your Message.
We are ready to talk!

Leave Us Your Message.
We are ready to talk!

Can't Find What You Need? Join Our Latest Event!

Be the first to learn about
New Trends