vadnetPostprocess

Postprocess frame-based VAD probabilities

Since R2023a

collapse all in page

Syntax

roi = vadnetPostprocess(audioIn,fs,netOut)

roi = vadnetPostprocess(___,Name=Value)

[roi,probs] = vadnetPostprocess(___)

vadnetPostprocess(___)

Description

roi = vadnetPostprocess(audioIn,fs,netOut) postprocesses the speech probabilities output by a voice activity detection (VAD) network and returns indices corresponding to the beginning and end of speech within the audio signal.

example

roi = vadnetPostprocess(___,Name=Value) specifies options using one or more name-value arguments. For example, vadnetPostprocess(audioIn,fs,MergeThreshold=0.5) merges speech regions that are separated by 0.5 seconds or less.

example

[roi,probs] = vadnetPostprocess(___) also returns the probability of voice activity per sample in the input audio signal.

example

vadnetPostprocess(___) with no output arguments plots the input signal and the detected speech regions.

example

Examples

collapse all

Detect Speech with Pretrained VAD Model

This example uses:

Open Live Script

Read in an audio signal containing speech and music and listen to the sound.

[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
sound(audioIn,fs)

Use vadnetPreprocess to preprocess the audio by computing a mel spectrogram.

features = vadnetPreprocess(audioIn,fs);

Call audioPretrainedNetwork to obtain a pretrained VAD neural network.

net = audioPretrainedNetwork("vadnet");

Pass the preprocessed audio through the network to obtain the probability of speech in each frame.

probs = predict(net,features);

Use vadnetPosprocess to postprocess the network output and determine the boundaries of the speech regions in the signal.

roi = vadnetPostprocess(audioIn,fs,probs)

roi = 2×2

           1       63120
       83600      150000

Plot the audio with the detected speech regions.

vadnetPostprocess(audioIn,fs,probs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

Customize VAD Postprocessing

Open Live Script

Read in an audio signal containing speech and music and listen to the sound.

[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
sound(audioIn,fs)

Preprocess the audio and pass it through the pretrained VADNet model.

features = vadnetPreprocess(audioIn,fs);
net = audioPretrainedNetwork("vadnet");
probs = predict(net,features);

Call vadnetPostprocess with the merge threshold set to 1 to merge detected speech regions that are separated by 1 second or less.

vadnetPostprocess(audioIn,fs,probs,MergeThreshold=1)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

Get Probability of Voice Activity Per Sample of Audio

Open Live Script

Read in an audio signal containing speech and music.

[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");

Preprocess the audio and pass it through the pretrained VADNet model.

features = vadnetPreprocess(audioIn,fs);
net = audioPretrainedNetwork("vadnet");
out = predict(net,features);

Call vadnetPostprocess with an additional output variable to get the probabilities of speech in each sample of the signal.

[roi,probs] = vadnetPostprocess(audioIn,fs,out);

Plot the audio signal along with the voice activity probability.

t = (0:length(audioIn)-1)/fs;
plot(t,audioIn,t,probs,"r")
legend("Audio signal","Probability of speech",Location="best")
xlabel("Time (s)")
title("Voice Activity Probability")

Figure contains an axes object. The axes object with title Voice Activity Probability, xlabel Time (s) contains 2 objects of type line. These objects represent Audio signal, Probability of speech.

Input Arguments

collapse all

`audioIn` — Audio input
column vector

Audio input signal, specified as a column vector (single channel).

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate in Hz, specified as a positive scalar.

Data Types: single | double

`netOut` — VAD network output
vector

VAD network output, specified as a vector representing the probabilities of speech in each audio frame.

Data Types: single | double

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: vadnetPostprocess(audioIn,fs,netOut,ApplyEnergyVAD=true)

`MergeThreshold` — Merge threshold
`0.25` (default) | nonnegative scalar

Merge threshold in seconds, specified as a nonnegative scalar. The function merges speech regions that are separated by a duration less than or equal to the specified threshold. Set the threshold to Inf to not merge any detected regions.

`LengthThreshold` — Length threshold
`0.25` (default) | nonnegative scalar

Length threshold in seconds, specified as a nonnegative scalar. The function does not return speech regions that have a duration less than or equal to the specified threshold.

`ActivationThreshold` — Probability threshold to start a speech segment
`0.5` (default) | scalar in the range [0, 1]

Probability threshold to start a speech segment, specified as a scalar in the range [0, 1].

Data Types: single | double

`DeactivationThreshold` — Probability threshold to end a speech segment
`0.25` (default) | scalar in the range [0, 1]

Probability threshold to end a speech segment, specified as a scalar in the range [0, 1].

Data Types: single | double

`ApplyEnergyVAD` — Apply energy-based VAD
`false` (default) | `true`

Apply energy-based VAD to the speech regions detected by the neural network, specified as true or false.

Data Types: logical

Output Arguments

collapse all

`roi` — Speech regions
N-by-2 matrix

Speech regions, returned as an N-by-2 matrix of indices into the input signal, where N is the number of individual speech regions detected. The first column contains the index of the start of a speech region, and the second column contains the index of the end of a region.

`probs` — Probability of speech per sample
column vector

Probability of speech per sample of the input audio signal, returned as a column vector with the same size as the input signal.

Algorithms

The vadnetPostprocess function postprocesses the VAD network output using the following steps.

Apply activation and deactivation thresholds to posterior probabilities to determine candidate speech regions.
Optionally, apply energy-based VAD to refine the detected speech regions.
Merge speech regions that are close to each other according to the merge threshold.
Remove speech regions that are shorter than or equal to the length threshold.

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2023a

expand all

R2024b: Output probabilities of voice activity

Use an additional output argument to get the per-sample probabilities of voice activity in a signal.

vadnetPostprocess

Syntax

Description

Examples

Detect Speech with Pretrained VAD Model

Customize VAD Postprocessing

Get Probability of Voice Activity Per Sample of Audio

Input Arguments

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

`netOut` — VAD network output
vector

Name-Value Arguments

`MergeThreshold` — Merge threshold
`0.25` (default) | nonnegative scalar

`LengthThreshold` — Length threshold
`0.25` (default) | nonnegative scalar

`ActivationThreshold` — Probability threshold to start a speech segment
`0.5` (default) | scalar in the range [0, 1]

`DeactivationThreshold` — Probability threshold to end a speech segment
`0.25` (default) | scalar in the range [0, 1]

`ApplyEnergyVAD` — Apply energy-based VAD
`false` (default) | `true`

Output Arguments

`roi` — Speech regions
N-by-2 matrix

`probs` — Probability of speech per sample
column vector

Algorithms

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2024b: Output probabilities of voice activity

See Also

Functions

Objects

Blocks

Topics

vadnetPostprocess

Syntax

Description

Examples

Detect Speech with Pretrained VAD Model

Customize VAD Postprocessing

Get Probability of Voice Activity Per Sample of Audio

Input Arguments

audioIn — Audio input column vector

fs — Sample rate (Hz) positive scalar

netOut — VAD network output vector

Name-Value Arguments

MergeThreshold — Merge threshold 0.25 (default) | nonnegative scalar

LengthThreshold — Length threshold 0.25 (default) | nonnegative scalar

ActivationThreshold — Probability threshold to start a speech segment 0.5 (default) | scalar in the range [0, 1]

DeactivationThreshold — Probability threshold to end a speech segment 0.25 (default) | scalar in the range [0, 1]

ApplyEnergyVAD — Apply energy-based VAD false (default) | true

Output Arguments

roi — Speech regions N-by-2 matrix

probs — Probability of speech per sample column vector

Algorithms

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2024b: Output probabilities of voice activity

See Also

Functions

Objects

Blocks

Topics

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

`netOut` — VAD network output
vector

`MergeThreshold` — Merge threshold
`0.25` (default) | nonnegative scalar

`LengthThreshold` — Length threshold
`0.25` (default) | nonnegative scalar

`ActivationThreshold` — Probability threshold to start a speech segment
`0.5` (default) | scalar in the range [0, 1]

`DeactivationThreshold` — Probability threshold to end a speech segment
`0.25` (default) | scalar in the range [0, 1]

`ApplyEnergyVAD` — Apply energy-based VAD
`false` (default) | `true`

`roi` — Speech regions
N-by-2 matrix

`probs` — Probability of speech per sample
column vector

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.