This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Speech-to-Text Transcription

Audio Toolbox™ enables you to interface with third-party speech-to-text APIs from MATLAB®.

To interface with third-party speech-to-text APIs, you must have the following:

The third-party APIs require you to generate keys for identification purposes. To begin, download the extended Audio Toolbox functionality from File Exchange. Then follow the instructions for interfacing with your chosen third-party speech-to-text API.

Download Extended Audio Toolbox Functionality from File Exchange

Navigate to the speech2text functionality on File Exchange, download the files, and then extract the content of the archive to your workspace. The download contains:

  • speechClient –– An object that specifies which third-party API to interface with, and additional properties that define the interface which are specific to the third-party API.

  • speech2text –– A function that takes as input an audio signal in MATLAB and returns text and additional information, as transcribed through the third-party interface.

Interface with Google Speech

Get Google Speech Authentication Keys

To interface with Google Speech from MATLAB, you must first create an account with Google Speech and generate an authorization key. The following steps describe how to create the authorization key. The steps are also described in the Google documentation.

  1. Navigate to the APIs & Services->Credentials panel in the Cloud Platform Console.

  2. Select Create credentials, then select API key from the drop-down menu.

  3. Click the Create button. A dialog box displays your newly created key.

  4. Once you have the API authorization key, create a JSON file:

    • Add the following content to the JSON file, replacing yourAuthenticaionKey with the authentication key you created in step 3:

      {
          "key" : "yourAuthenticationKey"
      }

    • Name the JSON file Google_Credentials_Speech2text.json and save it to a secure location.

Create speechClient Object

Create a speechClient object to interface with the Google Speech-to-Text API:

transcriber = speechClient('Google')
transcriber = 

  speechClient with no properties.

You can specify the recognition configuration for the Google speech client by specifying name-value pairs during creation. For example, to create a speech object that tells the Google translate service that the input speech is Australian English, specify languageCode as 'en-AU'.

transcriber = speechClient('Google','languageCode','en-AU')
transcriber = 

  speechClient with properties:

    languageCode: 'en-AU'

By default, the language code is set to 'en-US'.

The speechClient object does not perform input checks on the name-value pairs. Specify name-value pairs as described in the Google recognition configuration documentation. You cannot set the encoding or sampleRateHertz fields using speechClient. The encoding field is always set to FLAC, and sampleRateHertz is specified through the speech2text function.

Perform Speech-to-Text Transcription

Call the speech2text function with a speechClient object, the speech you want to transcribe, and the sample rate.

[speech,SampleRate] = audioread('Counting-16-44p1-mono-15secs.wav');

text = speech2text(transcriber,speech,SampleRate)
text =

  5×2 table

              TRANSCRIPT               CONFIDENCE
    _______________________________    __________

    "1"                                 0.79176  
    " 2"                                0.77258  
    " 3"                                0.79722  
    " 4"                                0.73335  
    " five six seven eight nine 10"     0.89762  

The speech2text function outputs a table containing the transcript and confidence of the transcript. The output table can contain additional variables depending on the configuration of the transcriber. For example, if you create a transcriber that outputs multiple alternatives for each word, then an ALTERNATIVES variable is added to the text output:

transcriber = speechClient('Google','languageCode','en-US','maxAlternatives',2);
text = speech2text(transcriber,speech,SampleRate)
text =

  5×3 table

              TRANSCRIPT               CONFIDENCE    ALTERNATIVES
    _______________________________    __________    ____________

    "1"                                 0.79176      [1×2 table] 
    " 2"                                0.77258      [1×2 table] 
    " 3"                                0.79721      [1×2 table] 
    " 4"                                0.73335      [1×2 table] 
    " five six seven eight nine 10"     0.89762      [1×2 table] 

You can also set the HTTP client timeout for your transcription request by setting the HTTPTimeOut name-value pair for speech2text:

text = speech2text(transcriber,speech,SampleRate,'HTTPTimeOut',25);

Interface with IBM Watson Speech

Get IBM Watson Speech Service Credentials

To interface with IBM Watson Speech from MATLAB, you must first create an account with IBM Bluemix and obtain a user name and password. The following steps describe how to create an account. The steps are also described in the IBM Documentation.

  1. Navigate to the Speech to Text service and sign up for a free Bluemix account or log in to your existing account.

  2. After you log in, enter speech-to-text-tutorial in the Service name field of the Speech to Text page. Click Create.

  3. Copy the credentials created:

    1. Click Service credentials

    2. Click View credentials under Actions

    3. Copy the user name and password

  4. Once you have the Service credentials, create a JSON file:

    • Add the following content to the JSON file, replacing yourUserName and yourPassword with the user name and password you created in step 3:

      {
          "key" : "yourUserName",
          "password" : "yourPassword"
      }

    • Name the JSON file IBM_Credentials_Speech2text.json and save it to a secure location.

Create speechClient Object

Create a speechClient object to interface with the IBM Speech-to-Text API:

transcriber = speechClient('IBM')
transcriber = 

  speechClient with no properties.

You can specify the recognition configuration for the IBM speech client by specifying name-value pairs during creation. For example, to create a speech object that tells the IBM translate service that the input speech is English narrowband, specify model as 'en-US_NarrowbandModel'.

transcriber = speechClient('IBM','model','en-US_NarrowbandModel')
transcriber = 

  speechClient with properties:

    model: 'en-US_NarrowbandModel'

By default, the language model is set to 'en-US_BroadbandModel'.

The speechClient object does not perform input checks on the name-value pairs. Specify name-value pairs as described in the IBM Watson Speech to Text API reference. When you specify the sample rate using speech2text, the IBM translate service resamples the audio to the bandwidth specified by the model.

Perform Speech-to-Text Transcription

Call the speech2text function with a speechClient object, the speech you want to transcribe, and the sample rate.

[speech,SampleRate] = audioread('Counting-16-44p1-mono-15secs.wav');

text = speech2text(transcriber,speech,SampleRate)
text =

  6×2 table

    TRANSCRIPT     CONFIDENCE
    ___________    __________

    "five six "       0.98   
    "so "            0.311   
    "eight "         0.492   
    "not "           0.511   
    "and "           0.501   
    "ten "           0.248   

The speech2text function outputs a table containing the transcript and confidence of the transcript. The output table can contain additional variables depending on the configuration of the transcriber. For example, if you create a transcriber that outputs multiple alternatives for each word, then an ALTERNATIVES variable is added to the text output. Create a new speech client, but this time use the default model (en-US_BroadbandModel) and set max_alternatives to two. The default model, en-US_BroadbandModel, performs much better than en-US_NarrowbandModel for the audio signal in this example.

transcriber = speechClient('IBM','max_alternatives',2);
text = speech2text(transcriber,speech,SampleRate)
text =

  10×3 table

    TRANSCRIPT    CONFIDENCE    ALTERNATIVES
    __________    __________    ____________

     "one "         0.994       [1×2 table] 
     "to "          0.995       [1×2 table] 
     "three "       0.971       [1×2 table] 
     "for "             1       [1×2 table] 
     "five "        0.997       [1×2 table] 
     "six "         0.997       [1×2 table] 
     "seven "       0.996       []          
     "eight "       0.969       [1×2 table] 
     "nine "        0.987       [1×2 table] 
     "then "        0.553       [1×2 table] 

You can also set the HTTP client timeout for your transcription request by setting the HTTPTimeOut name-value pair for speech2text:

text = speech2text(transcriber,speech,SampleRate,'HTTPTimeOut',25);

Interface with Microsoft Azure Speech

Get Microsoft Azure Speech API Keys

To interface with Microsoft Azure Speech from MATLAB, you must first create subscription keys. The following steps describe how to create an account.

  1. Navigate to the Cognitive Services and sign up for a free Azure account or log in to your existing account.

  2. After you log in, from the Cognitive Services page, click Speech APIs and then Get API Key for Bing Speech.

  3. Copy the keys created.

  4. Once you have the keys, create a JSON file:

    • Add the following content to the JSON file, replacing yourKey1 and yourKey2 with the keys you created in step 3:

      {
          "Key1" : "yourKey1",
          "Key2" : "yourKey2"
      }

    • Name the JSON file Microsoft_Credentials_Speech2text.json and save it to a secure location.

Set up speechClient

Create a speechClient object to interface with the Microsoft Speech-to-Text API:

transcriber = speechClient('Microsoft')
transcriber = 

  speechClient with no properties.

You can specify the recognition configuration for the Microsoft speech client by specifying name-value pairs during creation. For example, to create a speech object that tells the Microsoft translate service that the input speech is a dictation, specify recognition as 'dictation'.

transcriber = speechClient('Microsoft','recognition','dictation')
transcriber = 

  speechClient with properties:

    recognition: 'dictation'

By default, the language parameter is set to en-US, the format parameter is set to detailed, and the recognition parameter is set to Interactive.

The speechClient object does not perform input checks on the name-value pairs. Specify name-value pairs as described in Get started with speech recognition by using the REST API.

Perform Speech-to-Text Transcription

Call the speech2text function with a speechClient object, the speech you want to transcribe, and the sample rate.

[speech,SampleRate] = audioread('SpeechDFT-16-8-mono-5secs.wav');

text = speech2text(transcriber,speech,SampleRate)
text =

  5×5 table

    CONFIDENCE                                        LEXICAL                                                                              ITN                                                                             MASKEDITN                                                                           DISPLAY                                     
    __________    _______________________________________________________________________________    _______________________________________________________________________________    _______________________________________________________________________________    ________________________________________________________________________________

      0.8549      'the discrete fourier transform of a real valued signal is conjugate cometric'     'the discrete fourier transform of a real valued signal is conjugate cometric'     'the discrete fourier transform of a real valued signal is conjugate cometric'     'The discrete fourier transform of a real valued signal is conjugate cometric.' 
      0.8369      'the discrete fourier transform of a real valued signal is conjugate symmetric'    'the discrete fourier transform of a real valued signal is conjugate symmetric'    'the discrete fourier transform of a real valued signal is conjugate symmetric'    'The discrete fourier transform of a real valued signal is conjugate symmetric.'
      0.8369      'discrete fourier transform of a real valued signal is conjugate cometric'         'discrete fourier transform of a real valued signal is conjugate cometric'         'discrete fourier transform of a real valued signal is conjugate cometric'         'Discrete fourier transform of a real valued signal is conjugate cometric.'     
     0.81006      'the discrete fourier transform of a real valued signal is conjugates ametric'     'the discrete fourier transform of a real valued signal is conjugates ametric'     'the discrete fourier transform of a real valued signal is conjugates ametric'     'The discrete fourier transform of a real valued signal is conjugates ametric.' 
     0.81006      'discrete fourier transform of a real valued signal is conjugate symmetric'        'discrete fourier transform of a real valued signal is conjugate symmetric'        'discrete fourier transform of a real valued signal is conjugate symmetric'        'Discrete fourier transform of a real valued signal is conjugate symmetric.'    

The speech2text function outputs a table containing the lexical, ITN, masked, and display form of the recognized text, and the confidence scores.

You can also set the HTTP client timeout for your transcription request by setting the HTTPTimeOut name-value pair for speech2text:

text = speech2text(transcriber,speech,SampleRate,'HTTPTimeOut',25);

Billing Details

Audio Toolbox enables you to interface with third-party speech-to-text APIs. However, the third-party speech APIs are not free for extended use. Consult the individual API documentation for pricing details:

See Also

Apps

Functions

External Websites