resume
Resume fitting LDA model
Syntax
Description
returns an updated LDA model by training for more iterations on the bag-of-words or
bag-of-n-grams model updatedMdl
= resume(ldaMdl
,bag
)bag
. The input bag
must be the same model used to fit ldaMdl
.
returns an updated LDA model by training for more iterations on the documents
represented by the matrix of word counts updatedMdl
= resume(ldaMdl
,counts
)counts
. The input
counts
must be the same matrix used to fit
ldaMdl
.
specifies additional options using one or more name-value pair arguments.updatedMdl
= resume(___,Name,Value
)
Examples
Resume Fitting of LDA Model
To reproduce the results in this example, set rng
to 'default'
.
rng('default')
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" ... ] (1x3092 string) NumWords: 3092 NumDocuments: 154
Fit an LDA model with four topics. The resume
function does not support the default solver for fitlda
. Set the LDA solver to be collapsed variational Bayes, zeroth order.
numTopics = 4; mdl = fitlda(bag,numTopics,'Solver','cvb0')
===================================================================================== | Iteration | Time per | Relative | Training | Topic | Topic | | | iteration | change in | perplexity | concentration | concentration | | | (seconds) | log(L) | | | iterations | ===================================================================================== | 0 | 0.06 | | 3.292e+03 | 1.000 | 0 | | 1 | 0.01 | 1.4970e-01 | 1.147e+03 | 1.000 | 0 | | 2 | 0.00 | 7.1229e-03 | 1.091e+03 | 1.000 | 0 | | 3 | 0.00 | 8.1261e-03 | 1.031e+03 | 1.000 | 0 | | 4 | 0.00 | 8.8626e-03 | 9.703e+02 | 1.000 | 0 | | 5 | 0.00 | 8.5486e-03 | 9.154e+02 | 1.000 | 0 | | 6 | 0.00 | 7.4632e-03 | 8.703e+02 | 1.000 | 0 | | 7 | 0.00 | 6.0480e-03 | 8.356e+02 | 1.000 | 0 | | 8 | 0.00 | 4.5955e-03 | 8.102e+02 | 1.000 | 0 | | 9 | 0.00 | 3.4068e-03 | 7.920e+02 | 1.000 | 0 | | 10 | 0.00 | 2.5353e-03 | 7.788e+02 | 1.000 | 0 | | 11 | 0.02 | 1.9089e-03 | 7.690e+02 | 1.222 | 10 | | 12 | 0.00 | 1.2486e-03 | 7.626e+02 | 1.176 | 7 | | 13 | 0.00 | 1.1243e-03 | 7.570e+02 | 1.125 | 7 | | 14 | 0.00 | 9.1253e-04 | 7.524e+02 | 1.079 | 7 | | 15 | 0.00 | 7.5878e-04 | 7.486e+02 | 1.039 | 6 | | 16 | 0.00 | 6.6181e-04 | 7.454e+02 | 1.004 | 6 | | 17 | 0.00 | 6.0400e-04 | 7.424e+02 | 0.974 | 6 | | 18 | 0.00 | 5.6244e-04 | 7.396e+02 | 0.948 | 6 | | 19 | 0.00 | 5.0548e-04 | 7.372e+02 | 0.926 | 5 | | 20 | 0.00 | 4.2796e-04 | 7.351e+02 | 0.905 | 5 | ===================================================================================== | Iteration | Time per | Relative | Training | Topic | Topic | | | iteration | change in | perplexity | concentration | concentration | | | (seconds) | log(L) | | | iterations | ===================================================================================== | 21 | 0.00 | 3.4941e-04 | 7.334e+02 | 0.887 | 5 | | 22 | 0.00 | 2.9495e-04 | 7.320e+02 | 0.871 | 5 | | 23 | 0.00 | 2.6300e-04 | 7.307e+02 | 0.857 | 5 | | 24 | 0.00 | 2.5200e-04 | 7.295e+02 | 0.844 | 4 | | 25 | 0.00 | 2.4150e-04 | 7.283e+02 | 0.833 | 4 | | 26 | 0.00 | 2.0549e-04 | 7.273e+02 | 0.823 | 4 | | 27 | 0.00 | 1.6441e-04 | 7.266e+02 | 0.813 | 4 | | 28 | 0.01 | 1.3256e-04 | 7.259e+02 | 0.805 | 4 | | 29 | 0.00 | 1.1094e-04 | 7.254e+02 | 0.798 | 4 | | 30 | 0.00 | 9.2849e-05 | 7.249e+02 | 0.791 | 4 | =====================================================================================
mdl = ldaModel with properties: NumTopics: 4 WordConcentration: 1 TopicConcentration: 0.7908 CorpusTopicProbabilities: [0.2654 0.2531 0.2480 0.2336] DocumentTopicProbabilities: [154x4 double] TopicWordProbabilities: [3092x4 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" ... ] (1x3092 string) TopicOrder: 'initial-fit-probability' FitInfo: [1x1 struct]
View information about the fit.
mdl.FitInfo
ans = struct with fields:
TerminationCode: 1
TerminationStatus: "Relative tolerance on log-likelihood satisfied."
NumIterations: 30
NegativeLogLikelihood: 6.3042e+04
Perplexity: 724.9445
Solver: "cvb0"
History: [1x1 struct]
Resume fitting the LDA model with a lower log-likelihood tolerance.
tolerance = 1e-5; updatedMdl = resume(mdl,bag, ... 'LogLikelihoodTolerance',tolerance)
===================================================================================== | Iteration | Time per | Relative | Training | Topic | Topic | | | iteration | change in | perplexity | concentration | concentration | | | (seconds) | log(L) | | | iterations | ===================================================================================== | 30 | 0.00 | | 7.249e+02 | 0.791 | 0 | | 31 | 0.01 | 8.0569e-05 | 7.246e+02 | 0.785 | 3 | | 32 | 0.00 | 7.4692e-05 | 7.242e+02 | 0.779 | 3 | | 33 | 0.00 | 6.9802e-05 | 7.239e+02 | 0.774 | 3 | | 34 | 0.00 | 6.1154e-05 | 7.236e+02 | 0.770 | 3 | | 35 | 0.00 | 5.3163e-05 | 7.233e+02 | 0.766 | 3 | | 36 | 0.00 | 4.7807e-05 | 7.231e+02 | 0.762 | 3 | | 37 | 0.00 | 4.1820e-05 | 7.229e+02 | 0.759 | 3 | | 38 | 0.00 | 3.6237e-05 | 7.227e+02 | 0.756 | 3 | | 39 | 0.00 | 3.1819e-05 | 7.226e+02 | 0.754 | 2 | | 40 | 0.00 | 2.7772e-05 | 7.224e+02 | 0.751 | 2 | | 41 | 0.00 | 2.5238e-05 | 7.223e+02 | 0.749 | 2 | | 42 | 0.00 | 2.2052e-05 | 7.222e+02 | 0.747 | 2 | | 43 | 0.00 | 1.8471e-05 | 7.221e+02 | 0.745 | 2 | | 44 | 0.00 | 1.5638e-05 | 7.221e+02 | 0.744 | 2 | | 45 | 0.00 | 1.3735e-05 | 7.220e+02 | 0.742 | 2 | | 46 | 0.00 | 1.2298e-05 | 7.219e+02 | 0.741 | 2 | | 47 | 0.00 | 1.0905e-05 | 7.219e+02 | 0.739 | 2 | | 48 | 0.00 | 9.5581e-06 | 7.218e+02 | 0.738 | 2 | =====================================================================================
updatedMdl = ldaModel with properties: NumTopics: 4 WordConcentration: 1 TopicConcentration: 0.7383 CorpusTopicProbabilities: [0.2679 0.2517 0.2495 0.2309] DocumentTopicProbabilities: [154x4 double] TopicWordProbabilities: [3092x4 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" ... ] (1x3092 string) TopicOrder: 'initial-fit-probability' FitInfo: [1x1 struct]
View information about the fit.
updatedMdl.FitInfo
ans = struct with fields:
TerminationCode: 1
TerminationStatus: "Relative tolerance on log-likelihood satisfied."
NumIterations: 48
NegativeLogLikelihood: 6.3001e+04
Perplexity: 721.8357
Solver: "cvb0"
History: [1x1 struct]
Input Arguments
ldaMdl
— Input LDA model
ldaModel
object
Input LDA model, specified as an ldaModel
object. To resume fitting a model, you must fit
ldaMdl
with solver 'savb'
,
'avb'
, or 'cvb0'
.
bag
— Input model
bagOfWords
object | bagOfNgrams
object
Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords
object or a bagOfNgrams
object. If bag
is a
bagOfNgrams
object, then the function treats each n-gram as a
single word.
counts
— Frequency counts of words
matrix of nonnegative integers
Frequency counts of words, specified as a matrix of nonnegative integers. If you specify
'DocumentsIn'
to be 'rows'
, then the value
counts(i,j)
corresponds to the number of times the
jth word of the vocabulary appears in the ith
document. Otherwise, the value counts(i,j)
corresponds to the number
of times the ith word of the vocabulary appears in the
jth document.
Note
The arguments bag
and counts
must be the
same used to fit ldaMdl
.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'LogLikelihoodTolerance',0.001
specifies a
log-likelihood tolerance of 0.001.
DocumentsIn
— Orientation of documents
'rows'
(default) | 'columns'
Orientation of documents in the word count matrix, specified as the comma-separated pair
consisting of 'DocumentsIn'
and one of the following:
'rows'
– Input is a matrix of word counts with rows corresponding to documents.'columns'
– Input is a transposed matrix of word counts with columns corresponding to documents.
This option only applies if you specify the input documents as a matrix of word counts.
Note
If you orient your word count matrix so that documents correspond to columns and specify
'DocumentsIn','columns'
, then you might experience a significant
reduction in optimization-execution time.
FitTopicConcentration
— Option for fitting topic concentration parameter
true
| false
Option for fitting topic concentration, specified as the comma-separated pair consisting of 'FitTopicConcentration'
and either true
or false
.
The default value is the value used to fit
ldaMdl
.
Example: 'FitTopicConcentration',true
Data Types: logical
FitTopicProbabilities
— Option for fitting topic probabilities
true
| false
Option for fitting topic concentration, specified as the comma-separated pair consisting of 'FitTopicConcentration'
and either true
or false
.
The default value is the value used to fit
ldaMdl
.
The function fits the Dirichlet prior on the topic mixtures, where is the topic concentration and are the corpus topic probabilities which sum to 1.
Example: 'FitTopicProbabilities',true
Data Types: logical
LogLikelihoodTolerance
— Relative tolerance on log-likelihood
0.0001
(default) | positive scalar
Relative tolerance on log-likelihood, specified as the comma-separated pair consisting
of 'LogLikelihoodTolerance'
and a positive scalar. The optimization
terminates when this tolerance is reached.
Example: 'LogLikelihoodTolerance',0.001
IterationLimit
— Maximum number of iterations
100
(default) | positive integer
Maximum number of iterations, specified as the comma-separated pair consisting of 'IterationLimit'
and a positive integer.
This option supports models fitted with batch solvers only
('cgs'
, 'avb'
, and
'cvb0'
).
Example: 'IterationLimit',200
DataPassLimit
— Maximum number of passes through data
1 (default) | positive integer
Maximum number of passes through the data, specified as the comma-separated pair consisting of 'DataPassLimit'
and a positive integer.
If you specify 'DataPassLimit'
but not 'MiniBatchLimit'
,
then the default value of 'MiniBatchLimit'
is ignored. If you specify
both 'DataPassLimit'
and 'MiniBatchLimit'
, then
resume
uses the argument that results in processing the fewest
observations.
This option supports models fitted with stochastic solvers only
('savb'
).
Example: 'DataPassLimit',2
MiniBatchLimit
— Maximum number of mini-batch passes
positive integer
Maximum number of mini-batch passes, specified as the comma-separated pair consisting of 'MiniBatchLimit'
and a positive integer.
If you specify 'MiniBatchLimit'
but not 'DataPassLimit'
,
then resume
ignores the default value of
'DataPassLimit'
. If you specify both
'MiniBatchLimit'
and 'DataPassLimit'
, then
resume
uses the argument that results in processing the fewest
observations. The default value is ceil(numDocuments/MiniBatchSize)
,
where numDocuments
is the number of input documents.
This option supports models fitted with stochastic solvers only
('savb'
).
Example: 'MiniBatchLimit',200
MiniBatchSize
— Mini-batch size
1000 (default) | positive integer
Mini-batch size, specified as the comma-separated pair consisting of 'MiniBatchLimit'
and a positive integer. The function processes MiniBatchSize
documents in each iteration.
This option supports models fitted with stochastic solvers only
('savb'
).
Example: 'MiniBatchSize',512
ValidationData
— Validation data
[]
(default) | bagOfWords
object | bagOfNgrams
object | sparse matrix of word counts
Validation data to monitor optimization convergence, specified as the comma-separated
pair consisting of 'ValidationData'
and a bagOfWords
object, a bagOfNgrams
object, or a sparse matrix of word counts. If the
validation data is a matrix, then the data must have the same orientation and the same
number of words as the input documents.
ValidationFrequency
— Frequency of model validation
positive integer
Frequency of model validation in number of iterations, specified as the comma-separated pair consisting of 'ValidationFrequency'
and a positive integer.
The default value depends on the solver used to fit the model. For the stochastic solver, the default value is 10. For the other solvers, the default value is 1.
Verbose
— Verbosity level
1 (default) | 0
Verbosity level, specified as the comma-separated pair consisting of
'Verbose'
and one of the following:
0 – Do not display verbose output.
1 – Display progress information.
Example: 'Verbose',0
Output Arguments
updatedMdl
— Updated LDA model
ldaModel
object (default)
Updated LDA model, returned as an ldaModel
object.
Version History
Introduced in R2017b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)