Loss Function and Model Quality Metrics

What is a Loss Function?

The System Identification Toolbox™ software estimates model parameters by minimizing the error between the model output and the measured response. This error, called loss function or cost function, is a positive function of prediction errors e(t). In general, this function is a weighted sum of squares of the errors. For a model with ny-outputs, the loss function V(θ) has the following general form:

$V (θ) = \frac{1}{N} \sum_{t = 1}^{N} e^{T} (t, θ) W (θ) e (t, θ)$

where:

N is the number of data samples.
e(t,θ) is ny-by-1 error vector at a given time t, parameterized by the parameter vector θ.
W(θ) is the weighting matrix, specified as a positive semidefinite matrix. If W is a diagonal matrix, you can think of it as a way to control the relative importance of outputs during multi-output estimations. When W is a fixed or known weight, it does not depend on θ.

The software determines the parameter values by minimizing V(θ) with respect to θ.

For notational convenience, V(θ) is expressed in its matrix form:

$V (θ) = \frac{1}{N} t r a c e (E^{T} (θ) E (θ) W (θ))$

E(θ) is the error matrix of size N-by-ny. The i:th row of E(θ) represents the error value at time t = i.

The exact form of V(θ) depends on the following factors:

Model structure. For example, whether the model that you want to estimate is an ARX or a state-space model.
Estimator and estimation options. For example, whether you are using n4sid or ssest estimator and specifying options such as Focus and OutputWeight.

Options to Configure the Loss Function

You can configure the loss function for your application needs. The following estimation options, when available for the estimator, configure the loss function:

Estimation Option	Description	Notes
`Focus`	`Focus` option affects how e(t) in the loss function is computed: When `Focus` is `'prediction'`, e(t) represents 1-step ahead prediction error: $e_{p} (t) = y_{m e a s u r e d} (t) - y_{p r e d i c t e d} (t)$ When `Focus` is `'simulation'`, e(t) represents the simulation error: $e_{s} (t) = y_{m e a s u r e d} (t) - y_{s i m u l a t e d} (t)$ Note For models whose noise component is trivial, (H(q) = 1), e_p(t), and e_s(t) are equivalent. The `Focus` option can also be interpreted as a weighting filter in the loss function. For more information, see Effect of Focus and WeightingFilter Options on the Loss Function.	Specify the `Focus` option in the estimation option sets. The estimation option sets for `oe` and `tfest` do not have a `Focus` option because the noise-component for the estimated models is trivial, and so e_p(t) and e_s(t) are equivalent.
`WeightingFilter`	When you specify a weighting filter, prefiltered prediction or simulation error is minimized: $e_{f} (t) = ℒ (e (t))$ where $ℒ (.)$ is a linear filter. The `WeightingFilter` option can be interpreted as a custom weighting filter that is applied to the loss function. For more information, see Effect of Focus and WeightingFilter Options on the Loss Function.	Specify the `WeightingFilter` option in the estimation option sets. Not all options for `WeightingFilter` are available for all estimation commands.
`EnforceStability`	When `EnforceStability` is `true`, the minimization objective also contains a constraint that the estimated model must be stable.	Specify the `EnforceStability` option in the estimation option sets. The estimation option sets for `procest` and `ssregest` commands do not have an `EnforceStability` option. These estimation commands always yield a stable model. The estimation commands `tfest` and `oe` always yield a stable model when used with time-domain estimation data. Identifying unstable plants requires data collection under a closed loop with a stabilizing feedback controller. A reliable estimation of the plant dynamics requires a sufficiently rich noise component in the model structure to separate out the plant dynamics from feedback effects. As a result, models that use a trivial noise component (H(q) = 1), such as models estimated by `tfest` and `oe` commands, do not estimate good results for unstable plants.
`OutputWeight`	`OutputWeight` option configures the weighting matrix W(θ) in the loss function and lets you control the relative importance of output channels during multi-output estimations. When `OutputWeight` is `'noise'`, W(θ) equals the inverse of the estimated variance of error e(t): $W (θ) = {(\frac{1}{N} E^{T} (θ) E (θ))}^{- 1}$ Because W depends on θ, the weighting is determined as a part of the estimation. Minimization of the loss function with this weight simplifies the loss function to: $V (θ) = d e t (\frac{1}{N} E^{T} (θ) E (θ))$ Using the inverse of the noise variance is the optimal weighting in the maximum likelihood sense. When `OutputWeight` is an ny-by-ny positive semidefinite matrix, a constant weighting is used. This loss function then becomes a weighted sum of squared errors.	Specify the `OutputWeight` option in the estimation option sets. Not all options for `OutputWeight` are available for all estimation commands. `OutputWeight` is not available for polynomial model estimation because such models are always estimated one output at a time. `OutputWeight` cannot be `'noise'` when `SearchMethod` is `'lsqnonlin'`.
`ErrorThreshold`	`ErrorThreshold` option specifies the threshold for when to adjust the weight of large errors from quadratic to linear. Errors larger than `ErrorThreshold` times the estimated standard deviation have a linear weight in the loss function. $V (θ) = \frac{1}{N} (\sum_{t \in I} e^{T} (t, θ) W (θ) e (t, θ) + \sum_{t \in J} v^{T} (t, θ) W (θ) v (t, θ))$ where: I represents those time instants for which $\begin{array}{l} \| e (t) \| < ρ * σ \end{array}$ , where ρ is the error threshold. J represents the complement of I, that is, the time instants for which $\| e (t) \| > = ρ * σ$ . σ is the estimated standard deviation of the error. The error v(t,θ) is defined as: $v (t, θ) = e (t, θ) * σ \frac{ρ}{\sqrt{\| e (t, θ) \|}}$	Specify the `ErrorThreshold` option in the estimation option sets. A typical value for the error threshold ρ = 1.6 minimizes the effect of data outliers on the estimation results.
`Regularization`	`Regularization` option modifies the loss function to add a penalty on the variance of the estimated parameters. The loss function is set up with the goal of minimizing the prediction errors. It does not include specific constraints on the variance (a measure of reliability) of estimated parameters. This can sometimes lead to models with large uncertainty in estimated model parameters, especially when the model has many parameters. `Regularization` introduces an additional term in the loss function that penalizes the model flexibility: $V (θ) = \frac{1}{N} \sum_{t = 1}^{N} e^{T} (t, θ) W (θ) e (t, θ) + \frac{1}{N} λ {(θ - θ^{})}^{T} R (θ - θ^{})$ The second term is a weighted (R) and scaled (λ) variance of the estimated parameter set θ about its nominal value θ*.	Specify the `Regularization` option in the estimation option sets. For linear-in-parameter models (FIR models) and ARX models, you can compute optimal values of the regularization variables R and λ using the `arxRegul` command.

Effect of `Focus` and `WeightingFilter` Options on the Loss Function

The Focus option can be interpreted as a weighting filter in the loss function. The WeightingFilter option is an additional custom weighting filter that is applied to the loss function.

To understand the effect of Focus and WeightingFilter, consider a linear single-input single-output model:

$y (t) = G (q, θ) u (t) + H (q, θ) e (t)$

Where G(q,θ) is the measured transfer function, H(q,θ) is the noise model, and e(t) represents the additive disturbances modeled as white Gaussian noise. q is the time-shift operator.

In frequency domain, the linear model can be represented as:

$Y (ω) = G (ω, θ) U (ω) + H (ω, θ) E (ω)$

where Y(ω), U(ω), and E(ω) are the Fourier transforms of the output, input, and output error, respectively. G(ω,θ) and H(ω,θ) represent the frequency response of the input-output and noise transfer functions, respectively.

The loss function to be minimized for the SISO model is given by:

$V (θ) = \frac{1}{N} \sum_{t = 1}^{N} e^{T} (t, θ) e (t, θ)$

Using Parseval’s Identity, the loss function in frequency-domain is:

$V (θ, ω) = \frac{1}{N} {‖ E (ω) ‖}^{2}$

Substituting for E(ω) gives:

$V (θ, ω) = \frac{1}{N} {‖ \frac{Y (ω)}{U (ω)} - G (θ, ω)) ‖}^{2} \frac{{‖ U (ω) ‖}^{2}}{{‖ H (θ, ω) ‖}^{2}}$

Thus, you can interpret minimizing the loss function V as fitting G(θ,ω) to the empirical transfer function $Y (ω) / U (ω)$ , using $\frac{{‖ U (ω) ‖}^{2}}{{‖ H (θ, ω) ‖}^{2}}$ as a weighting filter. This corresponds to specifying Focus as 'prediction'. The estimation emphasizes frequencies where input has more power ( ${‖ U (ω) ‖}^{2}$ is greater) and de-emphasizes frequencies where noise is significant ( ${‖ H (θ, ω) ‖}^{2}$ is large).

When Focus is specified as 'simulation', the inverse weighting with ${‖ H (θ, ω) ‖}^{2}$ is not used. That is, only the input spectrum is used to weigh the relative importance of the estimation fit in a specific frequency range.

When you specify a linear filter $ℒ$ as WeightingFilter, it is used as an additional custom weighting in the loss function.

$V (θ) = \frac{1}{N^{2}} {‖ \frac{Y (ω)}{U (ω)} - G (θ)) ‖}^{2} \frac{{‖ U (ω) ‖}^{2}}{{‖ H (θ) ‖}^{2}} {‖ ℒ (ω) ‖}^{2}$

Here $ℒ (ω)$ is the frequency response of the filter. Use $ℒ (ω)$ to enhance the fit of the model response to observed data in certain frequencies, such as to emphasize the fit close to system resonant frequencies.

The estimated value of input-output transfer function G is the same as what you get if you instead first prefilter the estimation data with $ℒ (.)$ using idfilt, and then estimate the model without specifying WeightingFilter. However, the effect of $ℒ (.)$ on the estimated noise model H depends on the choice of Focus:

Focus is 'prediction' — The software minimizes the weighted prediction error $e_{f} (t) = ℒ (e_{p} (t))$ , and the estimated model has the form:
$y (t) = G (q) u (t) + H_{1} (q) e (t)$
Where $H_{1} (q) = H (q) / ℒ (q)$ . Thus, the estimation with prediction focus creates a biased estimate of H. This is the same estimated noise model you get if you instead first prefilter the estimation data with $ℒ (.)$ using idfilt, and then estimate the model.
When H is parameterized independent of G, you can treat the filter $ℒ (.)$ as a way of affecting the estimation bias distribution. That is, you can shape the trade-off between fitting G to the system frequency response and fitting $H / ℒ$ to the disturbance spectrum when minimizing the loss function. For more details see, section 14.4 in System Identification: Theory for the User, Second Edition, by Lennart Ljung, Prentice Hall PTR, 1999.
Focus is 'simulation' — The software first estimates G by minimizing the weighted simulation error $e_{f} (t) = ℒ (e_{s} (t))$ , where $e_{s} (t) = y_{m e a s u r e d} (t) - G (q) u_{m e a s u r e d} (t)$ . Once G is estimated, the software fixes it and computes H by minimizing pure prediction errors e(t) using unfiltered data. The estimated model has the form:
$y (t) = G (q) u (t) + H e (t)$
If you prefilter the data first, and then estimate the model, you get the same estimate for G but get a biased noise model $H / ℒ$ .

Thus, the WeightingFilter has the same effect as prefiltering the estimation data for estimation of G. For estimation of H, the effect of WeightingFilter depends upon the choice of Focus. A prediction focus estimates a biased version of the noise model $H / ℒ$ , while a simulation focus estimates H. Prefiltering the estimation data, and then estimating the model always gives $H / ℒ$ as the noise model.

Model Quality Metrics

After you estimate a model, use model quality metrics to assess the quality of identified models, compare different models, and pick the best one. The Report.Fit property of an identified model stores various metrics such as FitPercent, LossFcn, FPE, MSE, AIC, nAIC, AICc, and BIC values.

FitPercent, LossFcn, and MSE are measures of the actual quantity that is minimized during the estimation. For example, if Focus is 'simulation', these quantities are computed for the simulation error e_s (t). Similarly, if you specify the WeightingFilter option, then LossFcn, FPE, and MSE are computed using filtered residuals e_f (t).
FPE, AIC, nAIC, AICc, and BIC measures are computed as properties of the output disturbance according to the relationship:
$y (t) = G (q) u (t) + H (q) e (t)$
G(q) and H(q) represent the measured and noise components of the estimated model.
Regardless of how the loss function is configured, the error vector e(t) is computed as 1-step ahead prediction error using a given model and a given dataset. This implies that even when the model is obtained by minimizing the simulation error e_s (t), the FPE and various AIC values are still computed using the prediction error e_p (t). The actual value of e_p (t) is determined using the pe command with prediction horizon of 1 and using the initial conditions specified for the estimation.
These metrics contain two terms — one for describing the model accuracy and another to describe its complexity. For example, in FPE, $d e t (\frac{1}{N} E^{T} E)$ describes the model accuracy and $\frac{1 + \frac{n p}{N}}{1 - \frac{n p}{N}}$ describes the model complexity.
By comparing models using these criteria, you can pick a model that gives the best (smallest criterion value) trade-off between accuracy and complexity.

Quality Metric	Description
`FitPercent`	Normalized Root Mean Squared Error (NRMSE) expressed as a percentage, defined as: $F i t P e r c e n t = 100 (1 - \frac{‖ y_{m e a s u r e d} - y_{m o d e l} ‖}{‖ y_{m e a s u r e d} - \bar{y_{m e a s u r e d}} ‖})$ where: y_measured is the measured output data. $\bar{y_{m e a s u r e d}}$ is its (channel-wise) mean. y_model is the simulated or predicted response of the model, governed by the `Focus`. \|\|.\|\| indicates the 2-norm of a vector. For input or output data, `FitPercent` is an n_y-by-n_exp matrix, where n_y is the number of outputs and n_exp is the number of experiments. For FRD data, `FitPercent` is an n_y-by-n_u matrix, where n_u is the number of inputs (for FRD data, n_exp is always one). `FitPercent` varies between `-Inf` (bad fit) to 100 (perfect fit). If the value is equal to zero, then the model is no better at fitting the measured data than a straight line equal to the mean of the data.
`LossFcn`	Value of the loss function when the estimation completes. It contains effects of error thresholds, output weight, and regularization used for estimation.
`MSE`	Mean Squared Error measure, defined as: $M S E = \frac{1}{N} \sum_{t = 1}^{N} e^{T} (t) e (t)$ where: e(t) is the signal whose norm is minimized for estimation. N is the number of data samples in the estimation dataset.
`FPE`	Akaike’s Final Prediction Error (FPE), defined as: $F P E = d e t (\frac{1}{N} E^{T} E) (\frac{1 + \frac{n_{p}}{N}}{1 - \frac{n_{p}}{N}})$ where: n_p is the number of free parameters in the model. n_p includes the number of estimated initial states. N is the number of samples in the estimation dataset. E is the N-by-n_y matrix of prediction errors, where n_y is the number of output channels.
`AIC`	A raw measure of Akaike's Information Criterion, defined as: $A I C = N * l o g (d e t (\frac{1}{N} E^{T} E)) + 2 * n_{p} + N (n_{y} * \log (2 π) + 1)$
`AICc`	Small sample-size corrected Akaike's Information Criterion, defined as: $A I C c = A I C + 2 * n_{p} * \frac{(n_{p} + 1)}{(N - n_{p} - 1)}$ This metric is often more reliable for picking a model of optimal complexity from a list of candidate models when the data size N is small.
`nAIC`	Normalized measure of Akaike's Information Criterion, defined as: $n A I C = l o g (d e t (\frac{1}{N} E^{T} E)) + \frac{2 * n_{p}}{N}$
`BIC`	Bayesian Information Criterion, defined as: $B I C = N * l o g (d e t (\frac{1}{N} E^{T} E)) + N * (n_{y} * \log (2 π) + 1) + n_{p} * log (N)$