detectspeechnn
Syntax
Description
specifies options using one or more name-value arguments. For example,
roi
= detectspeechnn(audioIn
,fs
,Name=Value
)detectspeechnn(audioIn,fs,MergeThreshold=0.5)
merges speech regions
that are separated by 0.5 seconds or less.
detectspeechnn(___)
with no output arguments plots the
input signal and the detected speech regions.
This function requires both Audio Toolbox™ and Deep Learning Toolbox™.
Examples
Detect Speech in Audio Signal
Read in an audio signal containing speech and music and listen to the sound.
[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
sound(audioIn,fs)
Call detectspeechnn
on the signal to obtain the regions of interest (ROIs), in samples, containing speech.
roi = detectspeechnn(audioIn,fs)
roi = 2×2
1 63120
83600 150000
Convert the ROIs from samples to seconds.
roiSeconds = (roi-1)/fs
roiSeconds = 2×2
0 3.9449
5.2249 9.3749
Plot the audio waveform with the speech regions.
detectspeechnn(audioIn,fs)
Refine Speech Regions with Energy-Based VAD
Read in an audio signal containing a speaker repeating the phrase "volume up".
[audioIn,fs] = audioread("MaleVolumeUp-16-mono-6secs.ogg");
Compare detected speech regions by calling detectspeechnn
with and without the application of an energy-based voice activity detector (VAD) in postprocessing.
tiledlayout(2,1) nexttile() detectspeechnn(audioIn,fs) nexttile() detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)
Adjust Postprocessing Parameters for Detecting Speech
Read in an audio signal.
[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
Call detectspeechnn
with no output arguments to display a plot of the detected speech regions.
detectspeechnn(audioIn,fs);
Modify the parameters used in the postprocessing algorithm and see how they affect the detected speech regions. For more information about the VAD postprocessing algorithm, see Postprocessing.
mergeThreshold = 1.3 ; % seconds lengthThreshold = 0.25; % seconds activationThreshold = 0.5; % probability deactivationThreshold = 0.25 ; % probability applyEnergyVAD = false ; detectspeechnn(audioIn,fs,MergeThreshold=mergeThreshold, ... LengthThreshold=lengthThreshold, ... ActivationThreshold=activationThreshold, ... DeactivationThreshold=deactivationThreshold)
Detect Speech in Streaming Audio
Use detectspeechnn
to detect the presence of speech in a streaming audio signal.
Create a dsp.AudioFileReader
object to stream an audio file for processing. Set the SamplesPerFrame
property to read 100 ms nonoverlapping chunks from the signal.
afr = dsp.AudioFileReader("MaleVolumeUp-16-mono-6secs.ogg"); analysisDuration = 0.1; % seconds afr.SamplesPerFrame = floor(analysisDuration*afr.SampleRate);
The neural network architecture of detectspechnn
does not retain state between calls, and it performs best when analyzing larger chunks of audio signals. When you use detectspeechnn
in a streaming scenario, specific application requirements of accuracy, computational efficiency, and latency dictate the analysis duration and whether to overlap analysis chunks.
Create a timescope
object to plot the audio signal and the detected speech regions. Create an audioDeviceWriter
to play the audio as you stream it.
scope = timescope(NumInputPorts=2, ... SampleRate=afr.SampleRate, ... TimeSpanSource="property",TimeSpan=5, ... YLimits=[-1.2,1.2], ... ShowLegend=true,ChannelNames=["Audio","Detected Speech"]); adw = audioDeviceWriter(afr.SampleRate);
In a streaming loop:
Read in a 100 ms chunk from the audio file.
Use
detectspeechnn
to detect any regions of speech in the frame. Usesigroi2binmask
to convert the region indices to a binary mask.Plot the audio signal and the detected speech.
Play the audio with the device writer.
while ~isDone(afr) audioIn = afr(); segments = detectspeechnn(audioIn,afr.SampleRate,LengthThreshold=0.01); mask = sigroi2binmask(segments,afr.SamplesPerFrame); scope(audioIn,mask) adw(audioIn); end
Input Arguments
audioIn
— Audio input
column vector
Audio input signal, specified as a column vector (single channel).
Data Types: single
| double
fs
— Sample rate (Hz)
positive scalar
Sample rate in Hz, specified as a positive scalar.
Data Types: single
| double
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)
MergeThreshold
— Merge threshold
0.25
(default) | nonnegative scalar
Merge threshold in seconds, specified as a nonnegative scalar. The function merges
speech regions that are separated by a duration less than or equal to the specified
threshold. Set the threshold to Inf
to not merge any detected
regions.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
LengthThreshold
— Length threshold
0.25
(default) | nonnegative scalar
Length threshold in seconds, specified as a nonnegative scalar. The function does not return speech regions that have a duration less than or equal to the specified threshold.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
ActivationThreshold
— Probability threshold to start a speech segment
0.5
(default) | scalar in the range [0, 1]
Probability threshold to start a speech segment, specified as a scalar in the range [0, 1].
Data Types: single
| double
DeactivationThreshold
— Probability threshold to end a speech segment
0.25
(default) | scalar in the range [0, 1]
Probability threshold to end a speech segment, specified as a scalar in the range [0, 1].
Data Types: single
| double
ApplyEnergyVAD
— Apply energy-based voice activity detector
false
(default) | true
Apply energy-based voice activity detector (VAD) to the speech regions detected by
the neural network, specified as true
or
false
.
Data Types: logical
Output Arguments
roi
— Speech regions
N-by-2 matrix
Speech regions, returned as an N-by-2 matrix of indices into the input signal, where N is the number of individual speech regions detected. The first column contains the index of the start of a speech region, and the second column contains the index of the end of a region.
Algorithms
Preprocessing
The detectspeechnn
function preprocesses the audio data using the following
steps.
Resample the audio to 16kHz.
Compute a centered short-time Fourier transform (STFT) using a 25 ms periodic Hamming window and 10 ms hop length. Pad the signal so that the first window is centered at 0 s.
Convert the STFT to a power spectrogram.
Apply a mel filter bank with 40 bands to obtain a mel spectrogram.
Convert the mel spectrogram to a log scale.
Standardize each of the mel bands to have zero mean and standard deviation of 1.
Neural Network Inference
The preprocessed data is passed to a pretrained VAD neural network. The network outputs represent the probability of speech in each frame of audio in the input spectrogram.
The neural network is a ported version of the vad-crdnn-libriparty
pretrained model provided by SpeechBrain[1], which combines
convolutional, recurrent, and fully connected layers.
Postprocessing
The detectspeechnn
function postprocesses the VAD network output using the
following steps.
Apply activation and deactivation thresholds to posterior probabilities to determine candidate speech regions.
Optionally, apply energy-based VAD to refine the detected speech regions.
Merge speech regions that are close to each other according to the merge threshold.
Remove speech regions that are shorter than or equal to the length threshold.
References
[1] Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
Usage notes and limitations:
Variable-size input is not supported.
The sample rate
fs
must be constant.
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Version History
Introduced in R2023a
MATLAB 명령
다음 MATLAB 명령에 해당하는 링크를 클릭했습니다.
명령을 실행하려면 MATLAB 명령 창에 입력하십시오. 웹 브라우저는 MATLAB 명령을 지원하지 않습니다.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)