Note: This page has been translated by MathWorks. Click here to see

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Detect presence of speech in audio signal

The `voiceActivityDetector`

System
object™ detects the presence of speech in an audio segment. You can also use the
`voiceActivityDetector`

System
object to output an estimate of the noise variance per frequency bin.

To detect the presence of speech:

Create the

`voiceActivityDetector`

object and set its properties.Call the object with arguments, as if it were a function.

To learn more about how System objects work, see What Are System Objects? (MATLAB).

`VAD = voiceActivityDetector`

`VAD = voiceActivityDetector(Name,Value)`

`VAD = voiceActivityDetector`

creates a System
object, `VAD`

, that detects the presence of speech independently
across each input channel.

`VAD = voiceActivityDetector(`

sets
each property `Name,Value`

)`Name`

to the specified `Value`

.
Unspecified properties have default values.

`VAD = voiceActivityDetector('InputDomain','Frequency')`

creates a System
object, `VAD`

, that accepts frequency-domain input.```
[probability,noiseEstimate]
= VAD(audioIn)
```

`[`

applies a voice activity detector on the input, `probability`

,`noiseEstimate`

]
= VAD(`audioIn`

)`audioIn`

, and returns
the probability that speech is present. It also returns the estimated noise variance per
frequency bin.

To use an object function, specify the
System
object as the first input argument. For
example, to release system resources of a System
object named `obj`

, use
this syntax:

release(obj)

The `voiceActivityDetector`

implements the algorithm described in [1].

If `InputDomain`

is specified as `'Time'`

, the input
signal is windowed and then converted to the frequency domain according to the
`Window`

, `SidelobeAttenuation`

, and
`FFTLength`

properties. If `InputDomain`

is specified
as frequency, the input is assumed to be a windowed discrete time Fourier transform (DTFT) of
an audio signal. The signal is then converted to the power domain. Noise variance is estimated
according to [2]. The posterior and prior
SNR are estimated according to the Minimum Mean-Square Error (MMSE) formula described in [3]. A log likelihood ratio
test and Hidden Markov Model (HMM)-based hang-over scheme determine the probability that the
current frame contains speech, according to [1].

[1] Sohn, Jongseo., Nam Soo Kim, and
Wonyong Sung. "A Statistical Model-Based Voice Activity Detection." *Signal
Processing Letters IEEE*. Vol. 6, No. 1, 1999.

[2] Martin, R. "Noise Power Spectral
Density Estimation Based on Optimal Smoothing and Minimum Statistics." *IEEE
Transactions on Speech and Audio Processing*. Vol. 9, No. 5, 2001, pp.
504–512.

[3] Ephraim, Y., and D. Malah. "Speech
Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator."
*IEEE Transactions on Acoustics, Speech, and Signal Processing*. Vol.
32, No. 6, 1984, pp. 1109–1121.