# lbfgsState

State of limited-memory BFGS (L-BFGS) solver

Since R2023a

## Description

An `lbfgsState` object stores information about steps in the L-BFGS algorithm.

The L-BFGS algorithm  is a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. The L-BFGS algorithm is best suited for small networks and data sets that you can process in a single batch.

Use `lbfgsState` objects in conjunction with the `lbfgsupdate` function to train a neural network using the L-BFGS algorithm.

## Creation

### Syntax

``solverState = lbfgsState``
``solverState = lbfgsState(Name=Value)``

### Description

example

````solverState = lbfgsState` creates an L-BFGS state object with a history size of 10 and an initial inverse Hessian factor of 1.```

example

````solverState = lbfgsState(Name=Value)` sets the `HistorySize` and `InitialInverseHessianFactor` properties using one or more name-value arguments.```

## Properties

expand all

### L-BFGS State

Number of state updates to store, specified as a positive integer. Values between 3 and 20 suit most tasks.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

After creating the `lbfgsState` object, this property is read-only.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Initial value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.

To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation ${B}_{k-m}^{-1}\approx {\lambda }_{k}I$, where m is the history size, the inverse Hessian factor ${\lambda }_{k}$ is a scalar, and I is the identity matrix, and stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.

The initial inverse hessian factor is the value of ${\lambda }_{0}$.

After creating the `lbfgsState` object, this property is read-only.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.

To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation ${B}_{k-m}^{-1}\approx {\lambda }_{k}I$, where m is the history size, the inverse Hessian factor ${\lambda }_{k}$ is a scalar, and I is the identity matrix, and stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Since R2023b

Norm of the initial gradients, specified as a `dlarray` scalar or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `InitialGradientsNorm` is the first value that the `GradientsNorm` property takes. Otherwise, `InitialGradientsNorm` is `[]`.

Step history, specified as a cell array.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

Data Types: `cell`

Gradients difference history, specified as a cell array.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

Data Types: `cell`

History indices, specified as a row vector.

`HistoryIndices` is a 1-by-`HistorySize` vector, where `StepHistory(i)` and `GradientsDifferenceHistory(i)` correspond to iteration `HistoryIndices(i)`.

Data Types: `double`

### Iteration Information

Loss, specified as a `dlarray` scalar, a numeric scalar, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `Loss` is the first output of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `Loss` is `[]`.

Gradients, specified as a `dlarray` object, a numeric array, a cell array, a structure, a table, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `Gradients` is the second output of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `Gradients` is `[]`.

Additional loss function outputs, specified as a cell array.

If the state object is the output of the `lbfgsupdate` function, then `AdditionalLossFunctionOutputs` is a cell array containing additional outputs of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `AdditionalLossFunctionOutputs` is a 1-by-0 cell array.

Data Types: `cell`

Norm of the step, specified as a `dlarray` scalar, numeric scalar, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `StepNorm` is the norm of the step that the `lbfgsupdate` function calculates. Otherwise, `StepNorm` is `[]`.

Norm of the gradients, specified as a `dlarray` scalar, a numeric scalar, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `GradientsNorm` is the norm of the second output of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `GradientsNorm` is `[]`.

Status of the line search algorithm, specified as `""`, `"completed"`, or `"failed"`.

If the state object is the output of the `lbfgsupdate` function, then `LineSearchStatus` is one of these values:

• `"completed"` — The algorithm finds a learning rate that satisfies the `LineSearchMethod` and `MaxNumLineSearchIterations` options that the `lbfgsupdate` function uses.

• `"failed"` — The algorithm fails to find a learning rate that satisfies the `LineSearchMethod` and `MaxNumLineSearchIterations` options that the `lbfgsupdate` function uses.

Otherwise, `LineSearchStatus` is `""`.

Method solver uses to find a suitable learning rate, specified as `"weak-wolfe"`, `"strong-wolfe"`, `"backtracking"`, or `""`.

If the state object is the output of the `lbfgsupdate` function, then `LineSearchMethod` is the line search method that the `lbfgsupdate` function uses. Otherwise, `LineSearchMethod` is `""`.

Maximum number of line search iterations, specified as a nonnegative integer.

If the state object is the output of the `lbfgsupdate` function, then `MaxNumLineSearchIterations` is the maximum number of line search iterations that the `lbfgsupdate` function uses. Otherwise, `MaxNumLineSearchIterations` is `0`.

Data Types: `double`

## Examples

collapse all

Create an L-BFGS solver state object.

`solverState = lbfgsState`
```solverState = LBFGSState with properties: InverseHessianFactor: 1 StepHistory: {} GradientsDifferenceHistory: {} HistoryIndices: [1x0 double] Iteration Information Loss: [] Gradients: [] AdditionalLossFunctionOutputs: {1x0 cell} GradientsNorm: [] StepNorm: [] LineSearchStatus: "" Show all properties ```

Read the transmission casing data from the CSV file `"transmissionCasingData.csv"`.

```filename = "transmissionCasingData.csv"; tbl = readtable(filename,TextType="String");```

Convert the labels for prediction to categorical using the `convertvars` function.

```labelName = "GearToothCondition"; tbl = convertvars(tbl,labelName,"categorical");```

To train a network using categorical features, convert the categorical predictors to categorical using the `convertvars` function by specifying a string array containing the names of all the categorical input variables.

```categoricalPredictorNames = ["SensorCondition" "ShaftCondition"]; tbl = convertvars(tbl,categoricalPredictorNames,"categorical");```

Loop over the categorical input variables. For each variable, convert the categorical values to one-hot encoded vectors using the `onehotencode` function.

```for i = 1:numel(categoricalPredictorNames) name = categoricalPredictorNames(i); tbl.(name) = onehotencode(tbl.(name),2); end```

View the first few rows of the table.

`head(tbl)`
``` SigMean SigMedian SigRMS SigVar SigPeak SigPeak2Peak SigSkewness SigKurtosis SigCrestFactor SigMAD SigRangeCumSum SigCorrDimension SigApproxEntropy SigLyapExponent PeakFreq HighFreqPower EnvPower PeakSpecKurtosis SensorCondition ShaftCondition GearToothCondition ________ _________ ______ _______ _______ ____________ ___________ ___________ ______________ _______ ______________ ________________ ________________ _______________ ________ _____________ ________ ________________ _______________ ______________ __________________ -0.94876 -0.9722 1.3726 0.98387 0.81571 3.6314 -0.041525 2.2666 2.0514 0.8081 28562 1.1429 0.031581 79.931 0 6.75e-06 3.23e-07 162.13 0 1 1 0 No Tooth Fault -0.97537 -0.98958 1.3937 0.99105 0.81571 3.6314 -0.023777 2.2598 2.0203 0.81017 29418 1.1362 0.037835 70.325 0 5.08e-08 9.16e-08 226.12 0 1 1 0 No Tooth Fault 1.0502 1.0267 1.4449 0.98491 2.8157 3.6314 -0.04162 2.2658 1.9487 0.80853 31710 1.1479 0.031565 125.19 0 6.74e-06 2.85e-07 162.13 0 1 0 1 No Tooth Fault 1.0227 1.0045 1.4288 0.99553 2.8157 3.6314 -0.016356 2.2483 1.9707 0.81324 30984 1.1472 0.032088 112.5 0 4.99e-06 2.4e-07 162.13 0 1 0 1 No Tooth Fault 1.0123 1.0024 1.4202 0.99233 2.8157 3.6314 -0.014701 2.2542 1.9826 0.81156 30661 1.1469 0.03287 108.86 0 3.62e-06 2.28e-07 230.39 0 1 0 1 No Tooth Fault 1.0275 1.0102 1.4338 1.0001 2.8157 3.6314 -0.02659 2.2439 1.9638 0.81589 31102 1.0985 0.033427 64.576 0 2.55e-06 1.65e-07 230.39 0 1 0 1 No Tooth Fault 1.0464 1.0275 1.4477 1.0011 2.8157 3.6314 -0.042849 2.2455 1.9449 0.81595 31665 1.1417 0.034159 98.838 0 1.73e-06 1.55e-07 230.39 0 1 0 1 No Tooth Fault 1.0459 1.0257 1.4402 0.98047 2.8157 3.6314 -0.035405 2.2757 1.955 0.80583 31554 1.1345 0.0353 44.223 0 1.11e-06 1.39e-07 230.39 0 1 0 1 No Tooth Fault ```

Extract the training data.

```predictorNames = ["SigMean" "SigMedian" "SigRMS" "SigVar" "SigPeak" "SigPeak2Peak" ... "SigSkewness" "SigKurtosis" "SigCrestFactor" "SigMAD" "SigRangeCumSum" ... "SigCorrDimension" "SigApproxEntropy" "SigLyapExponent" "PeakFreq" ... "HighFreqPower" "EnvPower" "PeakSpecKurtosis" "SensorCondition" "ShaftCondition"]; XTrain = table2array(tbl(:,predictorNames)); numInputFeatures = size(XTrain,2);```

Extract the targets and convert them to one-hot encoded vectors.

```TTrain = tbl.(labelName); TTrain = onehotencode(TTrain,2); numClasses = size(TTrain,2);```

Convert the predictors and targets to `dlarray` objects with format `"BC"` (batch, channel).

```XTrain = dlarray(XTrain,"BC"); TTrain = dlarray(TTrain,"BC");```

Define the network architecture.

```numHiddenUnits = 32; layers = [ featureInputLayer(numInputFeatures) fullyConnectedLayer(16) layerNormalizationLayer reluLayer fullyConnectedLayer(numClasses) softmaxLayer]; net = dlnetwork(layers);```

Define the `modelLoss` function, listed in the Model Loss Function section of the example. This function takes as input a neural network, input data, and targets. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

The `lbfgsupdate` function requires a loss function with the syntax `[loss,gradients] = f(net)`. Create a variable that parameterizes the evaluated `modelLoss` function to take a single input argument.

`lossFcn = @(net) dlfeval(@modelLoss,net,XTrain,TTrain);`

Initialize an L-BFGS solver state object with a maximum history size of 3 and an initial inverse Hessian approximation factor of 1.1.

```solverState = lbfgsState( ... HistorySize=3, ... InitialInverseHessianFactor=1.1);```

Train the network a maximum of 200 iterations. Stop training early when the norm of the gradients or steps are smaller than 0.00001. Print the training loss every 10 iterations.

```maxIterations = 200; gradientTolerance = 1e-5; stepTolerance = 1e-5; iteration = 0; while iteration < maxIterations iteration = iteration + 1; [net, solverState] = lbfgsupdate(net,lossFcn,solverState); if iteration==1 || mod(iteration,10)==0 fprintf("Iteration %d: Loss: %d\n",iteration,solverState.Loss); end if solverState.GradientsNorm < gradientTolerance || ... solverState.StepNorm < stepTolerance || ... solverState.LineSearchStatus == "failed" break end end```
```Iteration 1: Loss: 9.343236e-01 Iteration 10: Loss: 4.721475e-01 Iteration 20: Loss: 4.678575e-01 Iteration 30: Loss: 4.666964e-01 Iteration 40: Loss: 4.665921e-01 Iteration 50: Loss: 4.663871e-01 Iteration 60: Loss: 4.662519e-01 Iteration 70: Loss: 4.660451e-01 Iteration 80: Loss: 4.645303e-01 Iteration 90: Loss: 4.591753e-01 Iteration 100: Loss: 4.562556e-01 Iteration 110: Loss: 4.531167e-01 Iteration 120: Loss: 4.489444e-01 Iteration 130: Loss: 4.392228e-01 Iteration 140: Loss: 4.347853e-01 Iteration 150: Loss: 4.341757e-01 Iteration 160: Loss: 4.325102e-01 Iteration 170: Loss: 4.321948e-01 Iteration 180: Loss: 4.318990e-01 Iteration 190: Loss: 4.313784e-01 Iteration 200: Loss: 4.311314e-01 ```

Model Loss Function

The `modelLoss` function takes as input a neural network `net`, input data `X`, and targets `T`. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

```function [loss, gradients] = modelLoss(net, X, T) Y = forward(net,X); loss = crossentropy(Y,T); gradients = dlgradient(loss,net.Learnables); end```

## Algorithms

expand all

 Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45, no. 1 (August 1989): 503-528. https://doi.org/10.1007/BF01589116.