YAMNet sound classification network
Audio Toolbox / Deep Learning
The YAMNet block leverages a pretrained sound classification network that is trained on the AudioSet dataset to predict audio events from the AudioSet ontology.
features — Mel spectrograms
96-by-64 matrix | 96-by-64-by-1-by-N array
Mel spectrograms, specified as a 96-by-64 matrix or a 96-by-64-by-1-by-N array, where:
96–– Represents the number of 10 ms frames in each mel spectrogram
64–– Represents the number of mel bands spanning 125 Hz to 7.5 kHz
N –– Number of channels.
You can use the YAMNet Preprocess block to generate mel spectrograms. The dimensions of these spectrograms are 96-by-64.
sound — Predicted sound label
Predicted sound label, returned as an enumerated scalar.
scores — Predicted activations or scores
Predicted activation or score values for each supported sound label, returned as a 1-by-521 vector, where 521 is the number of classes in YAMNet.
labels — Class labels for predicted scores
Class labels for predicted scores, returned as a 1-by-521 vector.
Mini-batch size — Size of mini-batches
128 (default) | positive integer
Size of mini-batches to use for prediction, specified as a positive integer. Larger mini-batch sizes require more memory, but can lead to faster predictions.
Classification — Select to output sound classification
on (default) |
Enable the output port sound, which outputs the classified sound.
Predictions — Output all scores and associated labels
off (default) |
Enable the output ports scores and labels, which output all predicted scores and associated class labels.
The block accepts mel spectrograms of size 96-by-64 or 96-by-64-by-1-by-N, and computes a maximum of three outputs using these spectrograms:
sound: The label of the most likely sound. You get one "sound" for each 96-by-64 spectrogram input.
scores: 1-by-512 vectors. Each element in the vector is a score value for each supported sound label.
labels: 1-by-521 vectors. Each element in the vector is a sound label.
 Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952261.
 Hershey, Shawn, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. “CNN Architectures for Large-Scale Audio Classification.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952132.
C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.
Usage notes and limitations:
The Language parameter in the Configuration Parameters > Code Generation general category must be set to
For ERT-based targets, the Support: variable-size signals parameter in the Code Generation> Interface pane must be enabled.
For a list of networks and layers supported for code generation, see Networks and Layers Supported for Code Generation (MATLAB Coder).