Compress Machine Learning Model for Memory-Limited Hardware

This example uses:

This example shows how to reduce the size of a machine learning model for deployment to memory-limited hardware. To demonstrate the model compression workflow, the example builds models for the acoustic scene classification (ASC) task, which classifies environments from the sounds they produce. ASC is a generic multiclass classification problem that is foundational for context awareness in devices, robots, and other applications [1].

Assume that you want to build a model for hearing aids where the available memory size is 30 KB. First, simplify the multiclass ASC task to a binary classification problem, and them perform these steps:

Reduce the number of features by selecting important features.
Optimize hyperparameters with coupled constraints, which limit the size of a machine learning model.
Quantize model parameters.

For more details on optimizing hyperparameters to reduce the memory size, see More About.

Load Data

Load the acousticscenes data set, and display the variables in the data set.

load("acousticscenes.mat")
whos

  Name           Size               Bytes  Class          Attributes

  xEval        300x286             686400  double                   
  xTest        300x286             686400  double                   
  xTrain      1500x286            3432000  double                   
  yEval        300x1                 2102  categorical              
  yTest        300x1                 2102  categorical              
  yTrain      1500x1                 3302  categorical

xTrain, xEval, and xTest contain features extracted from the TUT acoustic scene data set using wavelet scattering. yTrain, yEval, and yTest contain acoustic scene labels of 15 different types for xTrain, xEval, and xTest, respectively. In this example, you use xTrain and yTrain to train models and xTest and yTest to test the accuracy of the trained models. During the optimization step, you use xEval and yEval as a holdout validation set.

The TUT acoustic scene data set provides development data (TUT-acoustic-scenes-2017-development [3]) and test data (TUT-acoustic-scenes-2017-evaluation [4]). The development data provides a 4-fold cross-validation setup. xTrain and xEval are from the subsets of the training and evaluation sets (respectively) defined by the first fold of the cross-validation setup, and xTest is from the subset of the test data set. The example Acoustic Scene Recognition Using Late Fusion (Audio Toolbox) describes how you can obtain these variables from a subset of the TUT acoustic scene data set.

Normalize the data sets.

[xTrain,mu,sigma] = normalize(xTrain);
xEval = normalize(xEval,center=mu,scale=sigma);
xTest = normalize(xTest,center=mu,scale=sigma);

Select Classification Model Types

Select types of classification models for this example by using the Classification Learner app.

On the Apps tab, open the apps gallery. Then, in the Machine Learning and Deep Learning group, click Classification Learner.
On the Classification Learner tab, in the File section, click New Session and select From Workspace. In the dialog box, specify yTrain as the response variable, and specify the variables in xTrain as predictors.
In the Models section of the app, click All. This option trains all the model presets available for your data set.
In the Train section, click Train All and select Train All.

You can compare trained models based on accuracy scores, visualize results by plotting class predictions, and check performance using the confusion matrix and ROC curve. For more details on Classification Learner, see Train Classification Models in Classification Learner App.

In this example, you work with these five model types:

Bilayered neural network
Linear discriminant
Random subspace ensemble with discriminant analysis learners
Linear SVM
Logistic regression

Create a variable containing the model names.

MdlNames = ["Bilayered NN","Linear Discriminant", ...
    "Subspace Discriminant","Linear SVM","Logistic Regression"]';

Train Multiclass Classification Models

Train the five models using fitting functions at the command line, and then reduce the size of the trained models by using the compact function. The compact function discards information that is not necessary for prediction.

SVM and logistic regression models support only binary classification. Therefore, use the fitcecoc function to train a multiclass classification model with linear SVM learners and a multiclass classification model with logistic regression learners. For the logistic regression model, use a templateLinear learner; in this case, you do not use the compact function because fitcecoc returns a compact model object (CompactClassificationECOC).

rng("default") % For reproducibility
multiMdls = cell(5,1);

% Bilayered NN
multiMdls{1} = compact(fitcnet(xTrain,yTrain,LayerSizes=[10 10]));

% Linear Discriminant
multiMdls{2} = compact(fitcdiscr(xTrain,yTrain));

% Subspace Discriminant
multiMdls{3} = compact(fitcensemble(xTrain,yTrain, ...
    Method="Subspace",Learners="discriminant", ...
    NumLearningCycles=30,NPredToSample=25));

% Linear SVM
multiMdls{4} = compact(fitcecoc(xTrain,yTrain));

% Logistic Regression
tLinear = templateLinear(Learner="logistic");
multiMdls{5} = fitcecoc(xTrain,yTrain,Learners=tLinear);

Specify the output display format as bank to display two digits after the decimal point.

format("bank")

Test the models with the test data set using the helper function helperMdlMetrics.This function returns a table of model metrics including the model accuracy as a percentage and the model size in KB. The code for the helperMdlMetrics function appears at the end of this example.

multiMdlTbl = helperMdlMetrics(multiMdls,xTest,yTest);
tbl1 = multiMdlTbl;
tbl1.Properties.RowNames = MdlNames;
disp(tbl1)

                             Accuracy    Model Size
                             ________    __________

    Bilayered NN              54.33         36.17  
    Linear Discriminant       53.33       2776.71  
    Subspace Discriminant     50.67        881.54  
    Linear SVM                34.33        901.90  
    Logistic Regression       50.00       1937.67

The size of each model is more than 30 KB, and the accuracy value is approximately 50% for most models.

Simplify Problem as Binary Classification

For the hearing aid application, assume you only want to distinguish background sounds and sounds from specific sources, instead of classifying sounds into the 15 types included in the data set. Group the types of sounds into two types (AllAround and Directional) by using the mergecats function.

AllAround = ["beach","forest_path","park","office","home", ...
    "library","city_center","residential_area"];
Directional = ["train","bus","car","tram","grocery_store", ...
    "metro_station","cafe/restaurant"];
yTrainMapped = mergecats(yTrain,AllAround,"AllAround");
yTrainMapped = mergecats(yTrainMapped,Directional,"Directional");
yEvalMapped = mergecats(yEval,AllAround,"AllAround");
yEvalMapped = mergecats(yEvalMapped,Directional,"Directional");
yTestMapped = mergecats(yTest,AllAround,"AllAround");
yTestMapped = mergecats(yTestMapped,Directional,"Directional");

Create a grouped scatter plot of the first two principal components to see whether the binary grouping works.

figure
[~,score] = pca(xTrain);
gscatter(score(:,1),score(:,2),yTrainMapped)
xlabel("First principal component")
ylabel("Second principal component")

Train Binary Classification Models

Train the models for the binary sound labels yTrainMapped. For the linear SVM model, reduce the memory size by discarding the support vectors by using the discardSupportVectors function. The model can still predict new data using the linear predictor coefficients stored in the model property Beta. For the logistic regression model, the fitclinear function returns a compact model that does not store the training data.

rng("default")
binaryMdls = cell(5,1);

% Bilayered NN
binaryMdls{1} = compact(fitcnet(xTrain,yTrainMapped,LayerSizes=[10 10]));

% Linear Discriminant
binaryMdls{2} = compact(fitcdiscr(xTrain,yTrainMapped));

% Subspace Discriminant
binaryMdls{3} = compact(fitcensemble(xTrain,yTrainMapped, ...
    Method="Subspace",Learners="discriminant",NumLearningCycles=30,NPredToSample=25));

% Linear SVM
binaryMdls{4} = discardSupportVectors(compact(fitcsvm(xTrain,yTrainMapped)));

% Logistic Regression
binaryMdls{5} = fitclinear(xTrain,yTrainMapped,Learner="logistic");

Test the binary classification models with the test data set yTestMapped.

binaryMdlTbl = helperMdlMetrics(binaryMdls,xTest,yTestMapped);
tbl2 = table(multiMdlTbl,binaryMdlTbl);
tbl2.Properties.RowNames = MdlNames;
tbl2.Properties.VariableNames = ["Multiclass","Binary"];
disp(tbl2)

                                   Multiclass                  Binary        
                             Accuracy    Model Size    Accuracy    Model Size
                             ______________________    ______________________

    Bilayered NN              54.33         36.17       99.33         31.89  
    Linear Discriminant       53.33       2776.71       98.00       1314.90  
    Subspace Discriminant     50.67        881.54       99.33        552.08  
    Linear SVM                34.33        901.90       97.00          8.74  
    Logistic Regression       50.00       1937.67       98.67         18.60

The trained models accurately classify the acoustic scenes for the binary classification problem. The linear SVM and logistic regression models are smaller than 30 KB.

Train Models with Fewer Features

You can make machine learning models smaller without losing too much accuracy by building models using only important features. xTrain, xTest, and xEval include 286 features. Select 50 features by using the fscmrmr function.

idx = fscmrmr(xTrain,yTrainMapped);
xTrainSelected = xTrain(:,idx(1:50));
xEvalSelected = xEval(:,idx(1:50));
xTestSelected = xTest(:,idx(1:50));

Train binary classification models using the selected features.

rng("default")
feat50binaryMdls = cell(5,1);

% Bilayered NN
feat50binaryMdls{1} = compact(fitcnet(xTrainSelected,yTrainMapped,LayerSizes=[10 10]));

% Linear Discriminant
feat50binaryMdls{2} = compact(fitcdiscr(xTrainSelected,yTrainMapped));

% Subspace Discriminant
feat50binaryMdls{3} = compact(fitcensemble(xTrainSelected,yTrainMapped, ...
    Method="Subspace",Learners="discriminant",NumLearningCycles=30,NPredToSample=25));

% Linear SVM
feat50binaryMdls{4} = discardSupportVectors(compact(fitcsvm(xTrainSelected,yTrainMapped)));

% Logistic Regression
feat50binaryMdls{5} = fitclinear(xTrainSelected,yTrainMapped,Learner="logistic");

Test the models with the test data set yTestMapped.

feat50binaryMdlTbl = helperMdlMetrics(feat50binaryMdls,xTestSelected,yTestMapped);
tbl3 = table(multiMdlTbl,binaryMdlTbl,feat50binaryMdlTbl);
tbl3.Properties.RowNames = MdlNames;
tbl3.Properties.VariableNames = ["Multiclass","Binary","50 Features"];
disp(tbl3)

                                   Multiclass                  Binary                 50 Features      
                             Accuracy    Model Size    Accuracy    Model Size    Accuracy    Model Size
                             ______________________    ______________________    ______________________

    Bilayered NN              54.33         36.17       99.33         31.89       90.67         11.38  
    Linear Discriminant       53.33       2776.71       98.00       1314.90       95.33         51.70  
    Subspace Discriminant     50.67        881.54       99.33        552.08       91.33        541.91  
    Linear SVM                34.33        901.90       97.00          8.74       96.33          4.82  
    Logistic Regression       50.00       1937.67       98.67         18.60       97.00         12.18

In addition to the linear SVM and logistic regression models, the bilayered neural network model is also smaller than 30 KB. However, reducing the number of features causes the accuracy to decrease in the trained models.

Restore the default display format.

format("default")

Optimize Neural Network with Coupled Constraints

Find optimal model hyperparameters while limiting the memory use of the models. The constraints depend on the type of machine learning model. For example, you can limit the number of support vectors for an SVM model or limit the number of parameters in a neural network model. For more details on Bayesian optimization and an example for an SVM model, see Constraints in Bayesian Optimization. This example shows constraint-coupled optimization for a bilayered neural network model.

For constraint-coupled optimization, specify the hyperparameters to optimize and define a customized objective function. Then, use the bayesopt function to find the optimal hyperparameters based on the objective function.

First, get the default hyperparameters of the bilayered neural network model by using the hyperparameters function.

params_bilayeredNet = hyperparameters("fitcnet",xTrainSelected,yTrainMapped);

Modify the first, third, and ninth hyperparameters, which correspond to NumLayers, Standardize, and Layer_3_Size, so that they are not optimized. In this way, you can build a bilayered model and use training data without standardization. The training data is already standardized.

params_bilayeredNet(1).Range = [1 2]; % NumLayers
params_bilayeredNet(1).Optimize = false;
params_bilayeredNet(3).Optimize = false; % Standardize
params_bilayeredNet(9).Optimize = false; % Layer_3_Size

Use the customized objective function helperOptimizeConstrainedBilayer, which trains a bilayered neural network model using a given set of parameters for the training data set, and returns the loss for the holdout validation set. The code for the helperOptimizeConstrainedBilayer function appears at the end of this example. The function also accepts the upper limit for the number of weight parameters in the model and returns a constraint value. A positive constraint value indicates that the number of parameters is greater than the specified limit.

Define a function handle fun that takes the hyperparameters and calls the helperOptimizeConstrainedBilayer function. Specify the upper limit for the number of weight parameters as 300.

fun = @(params)helperOptimizeConstrainedBilayer(params,xTrainSelected,yTrainMapped,xEvalSelected,yEvalMapped,300);

When you call the bayesopt function, specify the objective function as fun and specify the hyperparameters as params_bilayeredNet. Also, specify NumCoupledConstraints as 1 to indicate that the objective function has one coupled constraint. For reproducibility, set the random seed and use the expected-improvement-plus acquisition function.

rng("default")
resultNN = bayesopt(fun,params_bilayeredNet, ...
    AcquisitionFunctionName="expected-improvement-plus", ...
    NumCoupledConstraints=1);

|==================================================================================================================================================|
| Iter | Eval   | Objective   | Objective   | BestSoFar   | BestSoFar   | Constraint1  |  Activations |       Lambda | Layer_1_Size | Layer_2_Size |
|      | result |             | runtime     | (observed)  | (estim.)    |              |              |              |              |              |
|==================================================================================================================================================|
|    1 | Infeas |    0.076667 |      3.8313 |         NaN |    0.076667 |      2.4e+03 |         none |   7.6806e-06 |           15 |          115 |
|    2 | Best   |        0.07 |      1.1425 |        0.07 |    0.070445 |         -196 |         none |    0.0001221 |            2 |            1 |
|    3 | Infeas |     0.46667 |     0.15246 |        0.07 |    0.070862 |     1.39e+03 |      sigmoid |       45.438 |           26 |           14 |
|    4 | Best   |    0.063333 |      1.3051 |    0.063333 |    0.063353 |        -52.5 |         tanh |   2.6069e-05 |            4 |            8 |
|    5 | Accept |     0.11333 |      1.4743 |    0.063333 |    0.063423 |        -58.5 |         relu |   2.2423e-05 |            4 |            7 |
|    6 | Accept |        0.07 |      1.1222 |    0.063333 |    0.063344 |         -196 |         none |    0.0001411 |            2 |            1 |
|    7 | Infeas |    0.046667 |      1.5327 |    0.063333 |     0.06318 |     1.95e+04 |         tanh |   1.2269e-07 |          300 |           16 |
|    8 | Infeas |     0.11333 |       5.227 |    0.063333 |    0.063575 |     9.47e+04 |         tanh |     0.045218 |          298 |          267 |
|    9 | Accept |     0.46667 |    0.023516 |    0.063333 |    0.063332 |         -196 |         none |       9.1357 |            2 |            1 |
|   10 | Infeas |     0.46667 |    0.025527 |    0.063333 |    0.063332 |     1.42e+03 |         relu |       3.0052 |           30 |            7 |
|   11 | Best   |    0.046667 |      2.0311 |    0.046667 |    0.046678 |         -172 |         relu |    6.691e-09 |            2 |            7 |
|   12 | Accept |    0.046667 |      1.0284 |    0.046667 |    0.046675 |        -52.5 |         tanh |   6.7859e-09 |            4 |            8 |
|   13 | Accept |    0.086667 |      2.4386 |    0.046667 |    0.046686 |         -172 |         relu |   1.1251e-07 |            2 |            7 |
|   14 | Accept |     0.46667 |    0.024936 |    0.046667 |     0.04668 |        -58.5 |         tanh |       60.245 |            4 |            7 |
|   15 | Best   |        0.03 |      1.0594 |        0.03 |    0.030086 |        -58.5 |         tanh |    0.0011383 |            4 |            7 |
|   16 | Infeas |     0.12333 |     0.12629 |        0.03 |     0.03007 |          296 |      sigmoid |    6.766e-09 |           10 |            8 |
|   17 | Accept |    0.076667 |     0.71763 |        0.03 |    0.030071 |         -146 |         none |   8.2973e-09 |            3 |            1 |
|   18 | Best   |    0.023333 |      1.0659 |    0.023333 |    0.026599 |        -58.5 |         tanh |    0.0009958 |            4 |            7 |
|   19 | Accept |    0.026667 |        1.01 |    0.023333 |     0.02661 |        -52.5 |         tanh |    0.0009402 |            4 |            8 |
|   20 | Accept |        0.05 |      1.3193 |    0.023333 |    0.026601 |         -226 |      sigmoid |    1.086e-05 |            1 |            8 |
|==================================================================================================================================================|
| Iter | Eval   | Objective   | Objective   | BestSoFar   | BestSoFar   | Constraint1  |  Activations |       Lambda | Layer_1_Size | Layer_2_Size |
|      | result |             | runtime     | (observed)  | (estim.)    |              |              |              |              |              |
|==================================================================================================================================================|
|   21 | Accept |    0.036667 |      1.0198 |    0.023333 |    0.027248 |         -110 |         tanh |   0.00090677 |            3 |            8 |
|   22 | Infeas |    0.053333 |      5.9702 |    0.023333 |    0.027181 |     1.41e+04 |         tanh |   0.00048938 |          283 |            1 |
|   23 | Infeas |     0.12333 |     0.37451 |    0.023333 |    0.027429 |     1.71e+04 |         relu |   7.1367e-09 |          238 |           23 |
|   24 | Accept |    0.076667 |     0.92349 |    0.023333 |    0.029543 |         -248 |         none |   6.7138e-07 |            1 |            1 |
|   25 | Accept |    0.046667 |      1.3113 |    0.023333 |     0.02962 |         -226 |         tanh |   1.1434e-07 |            1 |            8 |
|   26 | Accept |    0.043333 |      1.3654 |    0.023333 |    0.029659 |         -168 |      sigmoid |   9.1787e-07 |            2 |            8 |
|   27 | Accept |    0.043333 |     0.71783 |    0.023333 |    0.029584 |         -226 |         tanh |    0.0018534 |            1 |            8 |
|   28 | Infeas |        0.06 |      3.8672 |    0.023333 |    0.030036 |     1.31e+04 |      sigmoid |   2.3192e-06 |          257 |            2 |
|   29 | Accept |    0.066667 |       1.257 |    0.023333 |    0.026647 |         -226 |         tanh |   0.00050488 |            1 |            8 |
|   30 | Accept |    0.036667 |     0.70965 |    0.023333 |    0.028015 |        -52.5 |         tanh |    0.0044111 |            4 |            8 |

__________________________________________________________
Optimization completed.
MaxObjectiveEvaluations of 30 reached.
Total function evaluations: 30
Total elapsed time: 60.5813 seconds
Total objective function evaluation time: 44.1746

Best observed feasible point:
    Activations     Lambda      Layer_1_Size    Layer_2_Size
    ___________    _________    ____________    ____________

       tanh        0.0009958         4               7      

Observed objective function value = 0.023333
Estimated objective function value = 0.029092
Function evaluation time = 1.0659
Observed constraint violations =[ -58.500000 ]

Best estimated feasible point (according to models):
    Activations     Lambda      Layer_1_Size    Layer_2_Size
    ___________    _________    ____________    ____________

       tanh        0.0011383         4               7      

Estimated objective function value = 0.028015
Estimated function evaluation time = 1.0464
Estimated constraint violations =[ -58.501089 ]

bayesopt finds optimal hyperparameters that minimize an error in the holdout validation set and satisfy the constraint. Extract the best point in the optimization results resultNN by using the bestPoint function.

[optimalParams,CriterionValue1,iteration] = bestPoint(resultNN)

optimalParams=1×4 table
    Activations     Lambda      Layer_1_Size    Layer_2_Size
    ___________    _________    ____________    ____________

       tanh        0.0011383         4               7

CriterionValue1 = 0.0332

iteration = 15

Train the bilayered neural network model with the optimal hyperparameters.

rng("default")
modelNNOpt = compact(fitcnet(xTrainSelected,yTrainMapped, ...
    Activations=char(optimalParams.Activations), ...
    LayerSizes=[optimalParams.Layer_1_Size optimalParams.Layer_2_Size], ...
    Lambda=optimalParams.Lambda));

Find the accuracy and size of the trained model.

OptimizedNNAccuracy = (1-loss(modelNNOpt,xTestSelected,yTestMapped))*100

OptimizedNNAccuracy = 93.3333

OptimizedNNSize = whos("modelNNOpt").bytes/1024

OptimizedNNSize = 8.3555

Quantize Model Parameters with Simulink Block

You can also reduce the memory footprint of a machine learning model by quantizing model parameters with a Simulink block. Statistics and Machine Learning Toolbox™ provides various prediction blocks that allows you to import a trained machine learning model into a Simulink model. In the prediction blocks, you can specify the data types for some or all model parameters as single-precision, fixed-point, half-precision, and so on. For an example of fixed-point conversion, see Human Activity Recognition Simulink Model for Fixed-Point Deployment.

This example provides the Simulink model slexAcousticSceneClassificationNNPredictExample.slx, which includes the ClassificationNeuralNetwork Predict block. Open this model.

SimMdlName = 'slexAcousticSceneClassificationNNPredictExample'; 
open_system(SimMdlName)

Double-click the ClassificationNeuralNetwork Predict block to open the Block Parameters dialog box. You can specify the data types for the model parameters in the Data Types tab. to reduce the memory size, specify the data types for the layers as single. For details on specifying data types, see Specify Data Types Using Data Type Assistant (Simulink).

Prepare the input data for the Simulink model. Convert the predictor data (xTestSelected) to single precision by using the single function.

soundInput.time = (0:size(xTestSelected,1)-1)';
soundInput.signals(1).values = single(xTestSelected);
soundInput.signals(1).dimensions = size(xTestSelected,2);

Simulate the Simulink model and assign the result to the out variable.

out = sim(SimMdlName);

Find the accuracy of the predict block using the data logged in the To Workspace (Simulink) block.

pred = categorical(out.simout.Data,unique(out.simout.Data),["AllAround","Directional"]);
QuantizedNNAccuracy = sum(pred == yTestMapped)/length(yTestMapped)*100

QuantizedNNAccuracy = 93.3333

Find the size of the quantized model parameters.

p = Simulink.Mask.get("slexAcousticSceneClassificationNNPredictExample/ClassificationNeuralNetwork Predict");
vars = p.getWorkspaceVariables;
blockParams = vars(end).Value;
save("params.mat","blockParams")
s = dir("params.mat");
QuantizedNNSize = s.bytes/1024

QuantizedNNSize = 2.4951

Model Compression Summary

Display the changes in model size and accuracy during the model compression workflow for the bilayered neural network model. In general, the model loses some accuracy as you apply additional model compression schemes.

NNccuracy = [multiMdlTbl{1,"Accuracy"} binaryMdlTbl{1,"Accuracy"} ...
    feat50binaryMdlTbl{1,"Accuracy"} ...
    OptimizedNNAccuracy QuantizedNNAccuracy];
NNSize = [multiMdlTbl{1,"Model Size"} binaryMdlTbl{1,"Model Size"} ...
    feat50binaryMdlTbl{1,"Model Size"} ...
    OptimizedNNSize QuantizedNNSize];
ModelType = ["Multiclass","Binary","50 Features","Optimized","Single Precision"];

figure
yyaxis left
b = bar(NNSize);
xtips = b.XEndPoints;
ytips = b.YEndPoints;
labels = string(round(b.YData,2));
text(xtips,ytips,labels,HorizontalAlignment="center",VerticalAlignment="bottom", ...
    Color='#0072BD')
ylabel("Model Size [KB]")
yyaxis right
plot(NNccuracy,"-o")
ylabel("Accuracy [%]")
xticklabels(ModelType)
grid on

For the bilayered neural network model, the model size decreases to less than 30 KB after you reduce the number of features. The constrained optimization and converting data to single precision further reduce the model size.

The accuracy of the initial multiclass classification model is lower compared to the other models, because the multiclass model classifies sounds into 15 types. After you simplify the multiclass problem into a binary classification problem, the models accurately classify more than 90% of the test data. Reducing the number of features leads to a loss of model accuracy, but the constrained optimization step improves accuracy, and converting data to single precision does not reduce accuracy.

Helper Functions

The helperMdlMetrics function takes a cell array of trained models (Mdls) and test data sets (X and Y) and returns a table of model metrics that includes the model accuracy as a percentage and the model size in KB. The helper function uses the whos function to estimate the model size. However, the size returned by the whos function can be larger than the actual model size required in the generated code for deployment. For example, the generated code does not include information that is not needed for prediction. Consider a CompactClassificationECOC model that uses logistic regression learners. The binary learners in a CompactClassificationECOC model object in the MATLAB workspace contain the ModelParameters property. However, the model prepared for deployment in the generated code does not contain this property.

function tbl = helperMdlMetrics(Mdls,X,Y)
numMdl = length(Mdls);
metrics = NaN(numMdl,2);
for i = 1 : numMdl
    Mdl = Mdls{i};
    MdlInfo = whos("Mdl");
    metrics(i,:) = [(1-loss(Mdl,X,Y))*100 MdlInfo.bytes/1024];
end
tbl = array2table(metrics, ...
    VariableNames=["Accuracy","Model Size"]);
end

The helperOptimizeConstrainedBilayer function trains a bilayered neural network model using a given set of parameters for the training data, and returns the loss for the holdout validation set. In addition, the function accepts the upper limit (maxSize) for the number of weight parameters in the model and returns a constraint value. A positive constraint value indicates that the number of parameters is greater than the specified limit maxSize.

function [objective,constraint] = helperOptimizeConstrainedBilayer(params,xTrain,yTrain,xEval,yEval,maxSize)
mdl = fitcnet(xTrain,yTrain, ...
    Activations=char(params.Activations), ...
    LayerSizes=[params.Layer_1_Size params.Layer_2_Size], ...
    Lambda=params.Lambda);
objective = loss(mdl,xEval,yEval);

numClasses = size(unique(yTrain),1);
sizeEst = size(xTrain,2)*params.Layer_1_Size + ...
    params.Layer_1_Size*params.Layer_2_Size + ...
    params.Layer_2_Size*numClasses;
constraint = sizeEst - maxSize - 0.5;
end

More About

For constraint-coupled optimization, you can consider minimizing these hyperparameters to limit the memory use, depending on the type of machine learning model:

Decision tree — Minimum number of leaf node observations (MinLeafSize) and the maximum number of decision splits (MaxNumSplits). A decision tree model has a small memory footprint.
Linear discriminant and logistic regression — Number of features and classes. Both a linear discriminant model and a logistic regression model have a small to medium memory footprint.
Shallow neural network — Number of fully connected layers and the number of hidden units in each layer (LayerSizes). A shallow neural network model has a small to medium memory footprint.
k-nearest neighbor — Training data size, the number of nearest neighbors (NumNeighbors), and the maximum number of data points in the leaf node for the Kd-tree algorithm (BucketSize). A k-nearest neighbor model has a medium memory footprint.
Support vector machine (SVM) — Number of support vectors determined by the box constrains (BoxConstraint). An SVM has a medium to large memory footprint. For an SVM model that uses the linear kernel function, you can reduce the footprint by discarding support vectors from the model using the discardSupportVectors function. The reduced SVM model can still predict new data using predictor coefficients (Beta property) stored in the model.
Ensemble — Number of learners and the size of each learner determined by NumLearningCycles and Learners. An ensemble has a medium to large memory footprint.
Gaussian process regression (regression only) — Size of the active set (ActiveSetSize). A Gaussian process regression model has a medium to large memory footprint.

Several factors determine the memory use of a machine learning model. However, in general, the memory footprint for a decision tree model is small. A linear discriminant model, logistic regression model, and shallow neural network model have a small to medium memory footprint, and a k-nearest neighbor model has a medium memory footprint. An SVM, ensemble, and Gaussian process model have a medium to large memory footprint. For an SVM model that uses the linear kernel function, you can discard support vectors from the model to reduce the footprint by using the discardSupportVectors function. The reduced SVM model can still predict new data using predictor coefficients (Beta property) stored in the model.

For deployment to memory-limited hardware, a recommended practice is to specify training data using a matrix, not a table. If you specify training data using a table, some model properties, such as PredictorNames, can take a considerable proportion of the model memory footprint.

References

[1] Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. Acoustic Scene Classification: An Overview of DCASE 2017 Challenge Entries. In proc. International Workshop on Acoustic Signal Enhancement, 2018.

[2] Lostanlen, Vincent, and Joakim Anden. Binaural Scene Classification with Wavelet Scattering. Technical Report, DCASE2016 Challenge, 2016.

[3] TUT Acoustic scenes 2017, Development dataset

[4] TUT Acoustic scenes 2017, Evaluation dataset