Speech recognition is an interesting task, and we can see a large number of APIs in the market recently. However, when it comes to speech data, especially for customer service center data, this task can become a bit challenging. Assuming a customer service conversation lasts for about ten minutes, only a few APIs on the market can handle this type of data (Google, Amazon, IBM, Microsoft, Nuance, Rev.ai, Open source Wavenet, Open source CMU Sphinx). In this article, I will detail how to send speech recognition requests to Speech-to-Text from the command line using Google Cloud, which supports speech recognition and transcription services for 125 languages.
- Google Speech-to-text API
- Step 1: Enable the Speech-to-Text API on GCP
- Step 2: Create and/or assign one or more service accounts to Speech-to-Text
- Step 3: Download the service account credential key
- Step 4: Set up authentication environment variables
- Send an audio file transcription request
- Google Speech-to-Text API can handle specific types of speech encoding.
Google Speech-to-text API
Google offers 3 types of APIs for users conforming to the content of the speech service.
Content limitation | Audio Duration |
Synchronous requests | Around 1 minute |
Asynchronous requests | Around 480 minutes (long duration) |
Streamer requests | Around 5 minutes (real-time) |
Synchronous requests
For synchronous service requests, the content of the speech file should be within approximately 1 minute. For this type of request, the user does not need to upload their data to Google Cloud servers. This provides great convenience for users as they can store their speech files on a local computer or server and access Google’s API to transcribe them into text. This is what we will detail in this article.
Asynchronous requests
This type of request involves approximately 480 minutes of speech content (8 hours). For this type of request, users are required to upload their own data to Google Cloud servers.
Streaming requests
Streaming requests are suitable when users have to speak directly into a microphone and real-time transcription is required. Using this type of request requires that the data content should persist for no more than 5 minutes.
Preparation
Some initialization settings are needed before starting, which can be completed according to the instructions we provide below.
The following operations are required to accomplished before sending requests to the Speech-to-Text API:
- Enable the Speech-to-Text API on Google Cloud
- Ensure billing is enabled for Speech-to-Text
- Create and/or assign one or more service accounts to Speech-to-Text
- Download the service account credential key
- Set up authentication environment variables
- (Optional) Create a new Google Cloud Storage bucket to access your speech data
Step 1: Enable the Speech-to-Text API on GCP
Step 2: Create and/or assign one or more service accounts to Speech-to-Text
Step 3: Download the service account credential key
Step 4: Set up authentication environment variables
Activate Cloud Shell
1. Create a new Node.js application directory:$ mkdir speech-to-text-nodejs |
2. Set the speech-to-text-nodejs directory as your Cloud Shell workspace and open it: $ cd speech-to-text-nodejs; cloudshell open-workspace . |
3.Upload the key file to the current speech-to-text-nodejs working directory: |
4.Set the key as the default credential:$ export GOOGLE_APPLICATION_CREDENTIALS=XXXXXX.json |
Send an audio file transcription request
Step 1: Create a JSON request file with the following text, and save it as sync-speechtotext.json plain text document.
{
"config": {
"encoding":"FLAC",
"sampleRateHertz": 16000,
"languageCode": "en-US",
"enableWordTimeOffsets": false
},
"audio": {
"uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
}
}
Explaination: In this JSON fragment, the audio file is encoded in FLAC format. It has a sampling rate of 16000 Hz, and the audio file is located at the given URI in Google Cloud Storage.”
Step 2: Use curl to send the speech : recognition request :
$curl -s -H "Content-Type: application/json" \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://speech.googleapis.com/v1/speech:recognize \
-d @sync-speechtotext.json
Return the request result:
{
"results": [
{
"alternatives": [
{
"transcript": "how old is the Brooklyn Bridge",
"confidence": 0.9828748
}
],
"resultEndTime": "1.770s",
"languageCode": "en-us"
}
],
"totalBilledTime": "2s",
"requestId": "866425412681567144"
}
Congratulations! You have sent your first request to Speech-to-Text!
If you receive an error or empty response from Speech-to-Text, please refer to troubleshooting and debugging steps.
One common error message that users encounter when authentication with Speech-to-Text fails is a 403 code, indicating that their “application default credentials” are not available.
$ curl -s -H "Content-Type: application/json" \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://speech.googleapis.com/v1/speech:recognize \
-d @sync-speechtotext.json
{
"error": {
"code": 403,
"message": "Your application has authenticated using end user credentials from the Google Cloud SDK or Google Cloud Shell which are not supported by the speech.googleapis.com. We recommend configuring the billing/quota_project setting in gcloud or using a service account through the auth/impersonate_service_account setting. For more information about service accounts and how to use them in your application, see https://cloud.google.com/docs/authentication/. If you are getting this error with curl or similar tools, you may need to specify 'X-Goog-User-Project' HTTP header for quota and billing purposes. For more information regarding 'X-Goog-User-Project' header, please check https://cloud.google.com/apis/docs/system-parameters.",
"status": "PERMISSION_DENIED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "SERVICE_DISABLED",
"domain": "googleapis.com",
"metadata": {
"consumer": "projects/618104708054",
"service": "speech.googleapis.com"
}
}
]
}
}
A service account is required for your project and download the key (JSON file) for the service account to your development environment. Then, set the location of the JSON file as an environment variable named GOOGLE_APPLICATION_CREDENTIALS.
The following command can verify whether you have access to a valid service account key JSON file in the GOOGLE_APPLICATION_CREDENTIALS environment variable.
$cat $GOOGLE_APPLICATION_CREDENTIALS
If it is not set correctly, you can set the key file as the default credential:
$export GOOGLE_APPLICATION_CREDENTIALS=XXXXXX.json
Google Speech-to-Text API can handle specific types of speech encoding.
The API supports various audio encodings. The following table lists the supported audio decoders:
Encoder/ Decoder | Name | Lossless | Usage Description |
mp3 | MPEG The third layer sound file | No | MP3 encoding is a beta feature and is only available in v1p1beta1. For more information, please refer to the RecognitionConfig reference documentation. |
FLAC | Free Lossless Audio Codec | Yes | Message flow requires a bit depth of 16 or 24 bits |
LINEAR16 | Linear PCM | Yes | 16-bit linear pulse code modulation (PCM) encoding. The file header must contain the sample rate. |
MULAW | μ law | No | 8-bit PCM encoding |
AMR | Adaptive Multi-Rate Broadband | No | Sample rate must be 8000 Hz |
AMR_WB | Adaptive Multi-Rate Broadband | No | Sample rate must be 16000 Hz |
OGG_OPUS | Opus-encoded audio frames in an Ogg container | No | Sample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz or 48000 Hz |
SPEEX_WITH_HEADER_BYTE | Speex Broadband | No | Sample rate must be 16000 Hz |
WEBM_OPUS | WebM Opus | No | Sample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz or 48000 Hz |
Note: FLAC is both an audio encoder/decoder and an audio file format. To transcribe audio files using FLAC encoding, you must provide a file in the .FLAC format, which includes a file header containing metadata. Note that Speech-to-Text supports WAV files containing audio encoded with LINEAR16 or MULAW.
This means that if your audio file is not in a supported encoding by the Google API, you need to convert the audio file to an encoding supported by Google.
If you have the option to choose the encoding when encoding the source material, use lossless encodings like FLAC or LINEAR16 for better speech recognition performance. For guidelines on choosing the appropriate encoder for your task, refer to best practices.