Main Content

Machine Learning and Deep Learning for Audio

Dataset management, labeling, and augmentation; segmentation and feature extraction for audio, speech, and acoustic applications

Audio Toolbox™ provides functionality to develop machine and deep learning solutions for audio, speech, and acoustic applications including speaker identification, speech command recognition, acoustic scene recognition, and many more.

  • Use audioDatastore to ingest large audio data sets and process files in parallel.

  • Use Audio Labeler to build audio data sets by annotating audio recordings manually and automatically.

  • Use audioDataAugmenter to create randomized pipelines of built-in or custom signal processing methods for augmenting and synthesizing audio data sets.

  • Use audioFeatureExtractor to extract combinations of different features while sharing intermediate computations.

Audio Toolbox also provides access to third-party APIs for text-to-speech and speech-to-text, and it includes pretrained VGGish and YAMNet models so that you can perform transfer learning, classify sounds, and extract feature embeddings. Using pretrained networks requires Deep Learning Toolbox™.

Featured Examples

Speaker Verification Using i-Vectors

Speaker Verification Using i-Vectors

Speaker verification, or authentication, is the task of confirming that the identity of a speaker is who they purport to be. Speaker verification has been an active research area for many years. An early performance breakthrough was to use a Gaussian mixture model and universal background model (GMM-UBM) [1] on acoustic features (usually mfcc). For an example, see Speaker Verification Using Gaussian Mixture Models. One of the main difficulties of GMM-UBM systems involves intersession variability. Joint factor analysis (JFA) was proposed to compensate for this variability by separately modeling inter-speaker variability and channel or session variability [2] [3]. However, [4] discovered that channel factors in the JFA also contained information about the speakers, and proposed combining the channel and speaker spaces into a total variability space. Intersession variability was then compensated for by using backend procedures, such as linear discriminant analysis (LDA) and within-class covariance normalization (WCCN), followed by a scoring, such as the cosine similarity score. [5] proposed replacing the cosine similarity scoring with a probabilistic LDA (PLDA) model. [11] and [12] proposed a method to Gaussianize the i-vectors and therefore make Gaussian assumptions in the PLDA, referred to as G-PLDA or simplified PLDA. While i-vectors were originally proposed for speaker verification, they have been applied to many problems, like language recognition, speaker diarization, emotion recognition, age estimation, and anti-spoofing [10]. Recently, deep learning techniques have been proposed to replace i-vectors with d-vectors or x-vectors [8] [6].