fitlda

Fit latent Dirichlet allocation (LDA) model

Syntax

mdl = fitlda(bag,numTopics)

mdl = fitlda(counts,numTopics)

mdl = fitlda(___,Name,Value)

Description

A latent Dirichlet allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers word probabilities in topics. If the model was fit using a bag-of-n-grams model, then the software treats the n-grams as individual words.

mdl = fitlda(bag,numTopics) fits an LDA model with numTopics topics to the bag-of-words or bag-of-n-grams model bag.

example

mdl = fitlda(counts,numTopics) fits an LDA model to the documents represented by a matrix of frequency counts.

example

mdl = fitlda(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Fit LDA Model

Open Live Script

To reproduce the results in this example, set rng to 'default'.

rng('default')

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    "contracted"    …    ]
        NumWords: 3092
    NumDocuments: 154

Fit an LDA model with four topics.

numTopics = 4;
mdl = fitlda(bag,numTopics)

Initial topic assignments sampled in 0.263378 seconds.
=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|          0 |       0.17 |            |  1.215e+03 |         1.000 |             0 |
|          1 |       0.02 | 1.0482e-02 |  1.128e+03 |         1.000 |             0 |
|          2 |       0.02 | 1.7190e-03 |  1.115e+03 |         1.000 |             0 |
|          3 |       0.01 | 4.3796e-04 |  1.118e+03 |         1.000 |             0 |
|          4 |       0.01 | 9.4193e-04 |  1.111e+03 |         1.000 |             0 |
|          5 |       0.01 | 3.7079e-04 |  1.108e+03 |         1.000 |             0 |
|          6 |       0.01 | 9.5777e-05 |  1.107e+03 |         1.000 |             0 |
=====================================================================================

mdl = 
  ldaModel with properties:

                     NumTopics: 4
             WordConcentration: 1
            TopicConcentration: 1
      CorpusTopicProbabilities: [0.2500 0.2500 0.2500 0.2500]
    DocumentTopicProbabilities: [154×4 double]
        TopicWordProbabilities: [3092×4 double]
                    Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ]
                    TopicOrder: 'initial-fit-probability'
                       FitInfo: [1×1 struct]

Visualize the topics using word clouds.

figure
for topicIdx = 1:4
    subplot(2,2,topicIdx)
    wordcloud(mdl,topicIdx);
    title("Topic: " + topicIdx)
end

Figure contains objects of type wordcloud. The chart of type wordcloud has title Topic: 1. The chart of type wordcloud has title Topic: 2. The chart of type wordcloud has title Topic: 3. The chart of type wordcloud has title Topic: 4.

Fit LDA Model to Word Count Matrix

Open Live Script

Fit an LDA model to a collection of documents represented by a word count matrix.

To reproduce the results of this example, set rng to 'default'.

rng('default')

Load the example data. sonnetsCounts.mat contains a matrix of word counts and a corresponding vocabulary of preprocessed versions of Shakespeare's sonnets. The value counts(i,j) corresponds to the number of times the jth word of the vocabulary appears in the ith document.

load sonnetsCounts.mat
size(counts)

ans = 1×2

         154        3092

Fit an LDA model with 7 topics. To suppress the verbose output, set 'Verbose' to 0.

numTopics = 7;
mdl = fitlda(counts,numTopics,'Verbose',0);

Visualize multiple topic mixtures using stacked bar charts. Visualize the topic mixtures of the first three input documents.

topicMixtures = transform(mdl,counts(1:3,:));
figure
barh(topicMixtures,'stacked')
xlim([0 1])
title("Topic Mixtures")
xlabel("Topic Probability")
ylabel("Document")
legend("Topic "+ string(1:numTopics),'Location','northeastoutside')

Figure contains an axes object. The axes object with title Topic Mixtures, xlabel Topic Probability, ylabel Document contains 7 objects of type bar. These objects represent Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7.

Predict Top LDA Topics of Documents

Open Live Script

To reproduce the results in this example, set rng to 'default'.

rng('default')

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    "contracted"    …    ]
        NumWords: 3092
    NumDocuments: 154

Fit an LDA model with 20 topics.

numTopics = 20;
mdl = fitlda(bag,numTopics)

Initial topic assignments sampled in 0.513255 seconds.
=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|          0 |       0.04 |            |  1.159e+03 |         5.000 |             0 |
|          1 |       0.05 | 5.4884e-02 |  8.028e+02 |         5.000 |             0 |
|          2 |       0.04 | 4.7400e-03 |  7.778e+02 |         5.000 |             0 |
|          3 |       0.04 | 3.4597e-03 |  7.602e+02 |         5.000 |             0 |
|          4 |       0.03 | 3.4662e-03 |  7.430e+02 |         5.000 |             0 |
|          5 |       0.03 | 2.9259e-03 |  7.288e+02 |         5.000 |             0 |
|          6 |       0.03 | 6.4180e-05 |  7.291e+02 |         5.000 |             0 |
=====================================================================================

mdl = 
  ldaModel with properties:

                     NumTopics: 20
             WordConcentration: 1
            TopicConcentration: 5
      CorpusTopicProbabilities: [0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500 0.0500]
    DocumentTopicProbabilities: [154×20 double]
        TopicWordProbabilities: [3092×20 double]
                    Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ]
                    TopicOrder: 'initial-fit-probability'
                       FitInfo: [1×1 struct]

Predict the top topics for an array of new documents.

newDocuments = tokenizedDocument([
    "what's in a name? a rose by any other name would smell as sweet."
    "if music be the food of love, play on."]);
topicIdx = predict(mdl,newDocuments)

topicIdx = 2×1

    19
     8

Visualize the predicted topics using word clouds.

figure
subplot(1,2,1)
wordcloud(mdl,topicIdx(1));
title("Topic " + topicIdx(1))
subplot(1,2,2)
wordcloud(mdl,topicIdx(2));
title("Topic " + topicIdx(2))

Figure contains objects of type wordcloud. The chart of type wordcloud has title Topic 19. The chart of type wordcloud has title Topic 8.

Input Arguments

collapse all

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object. If bag is a bagOfNgrams object, then the function treats each n-gram as a single word.

`numTopics` — Number of topics
positive integer

Number of topics, specified as a positive integer. For an example showing how to choose the number of topics, see Choose Number of Topics for LDA Model.

Example: 200

`counts` — Frequency counts of words
matrix of nonnegative integers

Frequency counts of words, specified as a matrix of nonnegative integers. If you specify 'DocumentsIn' to be 'rows', then the value counts(i,j) corresponds to the number of times the jth word of the vocabulary appears in the ith document. Otherwise, the value counts(i,j) corresponds to the number of times the ith word of the vocabulary appears in the jth document.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Solver','avb' specifies to use approximate variational Bayes as the solver.

Solver Options

collapse all

`Solver` — Solver for optimization
`'cgs'` (default) | `'savb'` | `'avb'` | `'cvb0'`

Solver for optimization, specified as the comma-separated pair consisting of 'Solver' and one of the following:

Stochastic Solver

'savb' – Use stochastic approximate variational Bayes [1] [2]. This solver is best suited for large datasets and can fit a good model in fewer passes of the data.

Batch Solvers

'cgs' – Use collapsed Gibbs sampling [3]. This solver can be more accurate at the cost of taking longer to run. The resume function does not support models fitted with CGS.
'avb' – Use approximate variational Bayes [4]. This solver typically runs more quickly than collapsed Gibbs sampling and collapsed variational Bayes, but can be less accurate.
'cvb0' – Use collapsed variational Bayes, zeroth order [4] [5]. This solver can be more accurate than approximate variational Bayes at the cost of taking longer to run.

For an example showing how to compare solvers, see Compare LDA Solvers.

Example: 'Solver','savb'

`LogLikelihoodTolerance` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

Relative tolerance on log-likelihood, specified as the comma-separated pair consisting of 'LogLikelihoodTolerance' and a positive scalar. The optimization terminates when this tolerance is reached.

Example: 'LogLikelihoodTolerance',0.001

`FitTopicProbabilities` — Option for fitting corpus topic probabilities
`true` (default) | `false`

Option for fitting topic concentration, specified as the comma-separated pair consisting of 'FitTopicConcentration' and either true or false.

The function fits the Dirichlet prior $α = α_{0} (\begin{matrix} p_{1} & p_{2} & \dots & p_{K} \end{matrix})$ on the topic mixtures, where $α_{0}$ is the topic concentration and $p_{1}, \dots, p_{K}$ are the corpus topic probabilities which sum to 1.

Example: 'FitTopicProbabilities',false

Data Types: logical

`FitTopicConcentration` — Option for fitting topic concentration
`true` | `false`

Option for fitting topic concentration, specified as the comma-separated pair consisting of 'FitTopicConcentration' and either true or false.

For batch the solvers 'cgs', 'avb', and 'cvb0', the default for FitTopicConcentration is true. For the stochastic solver 'savb', the default is false.

Example: 'FitTopicConcentration',false

Data Types: logical

`InitialTopicConcentration` — Initial estimate of the topic concentration
`numTopics/4` (default) | nonnegative scalar

Initial estimate of the topic concentration, specified as the comma-separated pair consisting of 'InitialTopicConcentration' and a nonnegative scalar. The function sets the concentration per topic to TopicConcentration/NumTopics. For more information, see Latent Dirichlet Allocation.

Example: 'InitialTopicConcentration',25

`TopicOrder` — Topic Order
`'initial-fit-probability'` (default) | `'unordered'`

Topic order, specified as one of the following:

'initial-fit-probability' – Sort the topics by the corpus topic probabilities of input document set (the CorpusTopicProbabilities property).
'unordered' – Do not sort the topics.

`WordConcentration` — Word concentration
`1` (default) | nonnegative scalar

Word concentration, specified as the comma-separated pair consisting of 'WordConcentration' and a nonnegative scalar. The software sets the Dirichlet prior on the topics (the word probabilities per topic) to be the symmetric Dirichlet distribution parameter with the value WordConcentration/numWords, where numWords is the vocabulary size of the input documents. For more information, see Latent Dirichlet Allocation.

`DocumentsIn` — Orientation of documents
`'rows'` (default) | `'columns'`

Orientation of documents in the word count matrix, specified as the comma-separated pair consisting of 'DocumentsIn' and one of the following:

'rows' – Input is a matrix of word counts with rows corresponding to documents.
'columns' – Input is a transposed matrix of word counts with columns corresponding to documents.

This option only applies if you specify the input documents as a matrix of word counts.

Note

If you orient your word count matrix so that documents correspond to columns and specify 'DocumentsIn','columns', then you might experience a significant reduction in optimization-execution time.

Batch Solver Options

collapse all

`IterationLimit` — Maximum number of iterations
`100` (default) | positive integer

Maximum number of iterations, specified as the comma-separated pair consisting of 'IterationLimit' and a positive integer.

This option supports batch solvers only ('cgs', 'avb', or 'cvb0').

Example: 'IterationLimit',200

Stochastic Solver Options

collapse all

`DataPassLimit` — Maximum number of passes through data
1 (default) | positive integer

Maximum number of passes through the data, specified as the comma-separated pair consisting of 'DataPassLimit' and a positive integer.

If you specify 'DataPassLimit' but not 'MiniBatchLimit', then the default value of 'MiniBatchLimit' is ignored. If you specify both 'DataPassLimit' and 'MiniBatchLimit', then fitlda uses the argument that results in processing the fewest observations.

This option supports only the stochastic ('savb') solver.

Example: 'DataPassLimit',2

`MiniBatchLimit` — Maximum number of mini-batch passes
positive integer

Maximum number of mini-batch passes, specified as the comma-separated pair consisting of 'MiniBatchLimit' and a positive integer.

If you specify 'MiniBatchLimit' but not 'DataPassLimit', then fitlda ignores the default value of 'DataPassLimit'. If you specify both 'MiniBatchLimit' and 'DataPassLimit', then fitlda uses the argument that results in processing the fewest observations. The default value is ceil(numDocuments/MiniBatchSize), where numDocuments is the number of input documents.

This option supports only the stochastic ('savb') solver.

Example: 'MiniBatchLimit',200

`MiniBatchSize` — Mini-batch size
1000 (default) | positive integer

Mini-batch size, specified as the comma-separated pair consisting of 'MiniBatchLimit' and a positive integer. The function processes MiniBatchSize documents in each iteration.

This option supports only the stochastic ('savb') solver.

Example: 'MiniBatchSize',512

`LearnRateDecay` — Learning rate decay
0.5 (default) | positive scalar less than or equal to 1

Learning rate decay, specified as the comma-separated pair 'LearnRateDecay' and a positive scalar less than or equal to 1.

For mini-batch t, the function sets the learning rate to $η (t) = 1 / {(1 + t)}^{κ}$ , where $κ$ is the learning rate decay.

If LearnRateDecay is close to 1, then the learning rate decays faster and the model learns mostly from the earlier mini-batches. If LearnRateDecay is close to 0, then the learning rate decays slower and the model continues to learn from more mini-batches. For more information, see Stochastic Solver.

This option supports the stochastic solver only ('savb').

Example: 'LearnRateDecay',0.75

Display Options

collapse all

`ValidationData` — Validation data
`[]` (default) | `bagOfWords` object | `bagOfNgrams` object | sparse matrix of word counts

Validation data to monitor optimization convergence, specified as the comma-separated pair consisting of 'ValidationData' and a bagOfWords object, a bagOfNgrams object, or a sparse matrix of word counts. If the validation data is a matrix, then the data must have the same orientation and the same number of words as the input documents.

`ValidationFrequency` — Frequency of model validation
positive integer

Frequency of model validation in number of iterations, specified as the comma-separated pair consisting of 'ValidationFrequency' and a positive integer.

The default value depends on the solver used to fit the model. For the stochastic solver, the default value is 10. For the other solvers, the default value is 1.

`Verbose` — Verbosity level
1 (default) | 0

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and one of the following:

0 – Do not display verbose output.
1 – Display progress information.

Example: 'Verbose',0

Output Arguments

collapse all

`mdl` — Output LDA model
`ldaModel` object

Output LDA model, returned as an ldaModel object.

More About

collapse all

Latent Dirichlet Allocation

A latent Dirichlet allocation (LDA) model is a document topic model which discovers underlying topics in a collection of documents and infers word probabilities in topics. LDA models a collection of D documents as topic mixtures $θ_{1}, \dots, θ_{D}$ , over K topics characterized by vectors of word probabilities $φ_{1}, \dots, φ_{K}$ . The model assumes that the topic mixtures $θ_{1}, \dots, θ_{D}$ , and the topics $φ_{1}, \dots, φ_{K}$ follow a Dirichlet distribution with concentration parameters $α$ and $β$ respectively.

The topic mixtures $θ_{1}, \dots, θ_{D}$ are probability vectors of length K, where K is the number of topics. The entry $θ_{d i}$ is the probability of topic i appearing in the dth document. The topic mixtures correspond to the rows of the DocumentTopicProbabilities property of the ldaModel object.

The topics $φ_{1}, \dots, φ_{K}$ are probability vectors of length V, where V is the number of words in the vocabulary. The entry $φ_{i v}$ corresponds to the probability of the vth word of the vocabulary appearing in the ith topic. The topics $φ_{1}, \dots, φ_{K}$ correspond to the columns of the TopicWordProbabilities property of the ldaModel object.

Given the topics $φ_{1}, \dots, φ_{K}$ and Dirichlet prior $α$ on the topic mixtures, LDA assumes the following generative process for a document:

Sample a topic mixture $θ ~ Dirichlet (α)$ . The random variable $θ$ is a probability vector of length K, where K is the number of topics.
For each word in the document:
1. Sample a topic index $z ~ Categorical (θ)$ . The random variable z is an integer from 1 through K, where K is the number of topics.
2. Sample a word $w ~ Categorical (φ_{z})$ . The random variable w is an integer from 1 through V, where V is the number of words in the vocabulary, and represents the corresponding word in the vocabulary.

Under this generative process, the joint distribution of a document with words $w_{1}, \dots, w_{N}$ , with topic mixture $θ$ , and with topic indices $z_{1}, \dots, z_{N}$ is given by

$p (θ, z, w | α, φ) = p (θ | α) \prod_{n = 1}^{N} p (z_{n} | θ) p (w_{n} | z_{n}, φ),$

where N is the number of words in the document. Summing the joint distribution over z and then integrating over $θ$ yields the marginal distribution of a document w:

$p (w | α, φ) = \int_{θ} p (θ | α) \prod_{n = 1}^{N} \sum_{z_{n}} p (z_{n} | θ) p (w_{n} | z_{n}, φ) d θ .$

The following diagram illustrates the LDA model as a probabilistic graphical model. Shaded nodes are observed variables, unshaded nodes are latent variables, nodes without outlines are the model parameters. The arrows highlight dependencies between random variables and the plates indicate repeated nodes.

Dirichlet Distribution

The Dirichlet distribution is a continuous generalization of the multinomial distribution. Given the number of categories $K \geq 2$ , and concentration parameter $α$ , where $α$ is a vector of positive reals of length K, the probability density function of the Dirichlet distribution is given by

$p (θ ∣ α) = \frac{1}{B (α)} \prod_{i = 1}^{K} θ_{i}^{α_{i} - 1},$

where B denotes the multivariate Beta function given by

$B (α) = \frac{\prod_{i = 1}^{K} Γ (α_{i})}{Γ (\sum_{i = 1}^{K} α_{i})} .$

A special case of the Dirichlet distribution is the symmetric Dirichlet distribution. The symmetric Dirichlet distribution is characterized by the concentration parameter $α$ , where all the elements of $α$ are the same.

Stochastic Solver

The stochastic solver processes documents in mini-batches. It updates the per-topic word probabilities using a weighted sum of the probabilities calculated from each mini-batch, and the probabilities from all previous mini-batches.

For mini-batch t, the solver sets the learning rate to $η (t) = 1 / {(1 + t)}^{κ}$ , where $κ$ is the learning rate decay.

The function uses the learning rate decay to update $Φ$ , the matrix of word probabilities per topic, by setting

$Φ^{(t)} = (1 - η (t)) Φ^{(t - 1)} + η (t) Φ^{(t *)},$

where $Φ^{(t *)}$ is the matrix learned from mini-batch t, and $Φ^{(t - 1)}$ is the matrix learned from mini-batches 1 through t-1.

Before learning begins (when t = 0), the function initializes the initial word probabilities per topic $Φ^{(0)}$ with random values.

References

[1] Foulds, James, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. "Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation." In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 446–454. ACM, 2013.

[2] Hoffman, Matthew D., David M. Blei, Chong Wang, and John Paisley. "Stochastic variational inference." The Journal of Machine Learning Research 14, no. 1 (2013): 1303–1347.

[3] Griffiths, Thomas L., and Mark Steyvers. "Finding scientific topics." Proceedings of the National academy of Sciences 101, no. suppl 1 (2004): 5228–5235.

[4] Asuncion, Arthur, Max Welling, Padhraic Smyth, and Yee Whye Teh. "On smoothing and inference for topic models." In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 27–34. AUAI Press, 2009.

[5] Teh, Yee W., David Newman, and Max Welling. "A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation." In Advances in neural information processing systems, pp. 1353–1360. 2007.

Version History

Introduced in R2017b

expand all

R2018b: `fitlda` sorts topics

Starting in R2018b, fitlda, by default, sorts the topics in descending order of the topic probabilities of the input document set. This behavior makes it easier to find the topics with the highest probabilities.

In previous versions, fitlda does not change the topic order. To reproduce the behavior, set the 'TopicOrder' option to 'unordered'.

fitlda

Syntax

Description

Examples

Fit LDA Model

Fit LDA Model to Word Count Matrix

Predict Top LDA Topics of Documents

Input Arguments

bag — Input model bagOfWords object | bagOfNgrams object

numTopics — Number of topics positive integer

counts — Frequency counts of words matrix of nonnegative integers

Name-Value Arguments

Solver — Solver for optimization 'cgs' (default) | 'savb' | 'avb' | 'cvb0'

LogLikelihoodTolerance — Relative tolerance on log-likelihood 0.0001 (default) | positive scalar

FitTopicProbabilities — Option for fitting corpus topic probabilities true (default) | false

FitTopicConcentration — Option for fitting topic concentration true | false

InitialTopicConcentration — Initial estimate of the topic concentration numTopics/4 (default) | nonnegative scalar

TopicOrder — Topic Order 'initial-fit-probability' (default) | 'unordered'

WordConcentration — Word concentration 1 (default) | nonnegative scalar

DocumentsIn — Orientation of documents 'rows' (default) | 'columns'

IterationLimit — Maximum number of iterations 100 (default) | positive integer

DataPassLimit — Maximum number of passes through data 1 (default) | positive integer

MiniBatchLimit — Maximum number of mini-batch passes positive integer

MiniBatchSize — Mini-batch size 1000 (default) | positive integer

LearnRateDecay — Learning rate decay 0.5 (default) | positive scalar less than or equal to 1

ValidationData — Validation data [] (default) | bagOfWords object | bagOfNgrams object | sparse matrix of word counts

ValidationFrequency — Frequency of model validation positive integer

Verbose — Verbosity level 1 (default) | 0

Output Arguments

mdl — Output LDA model ldaModel object

More About

Latent Dirichlet Allocation

Dirichlet Distribution

Stochastic Solver

References

Version History

R2018b: fitlda sorts topics

See Also

Topics

`bag` — Input model
`bagOfWords` object | `bagOfNgrams` object

`numTopics` — Number of topics
positive integer

`counts` — Frequency counts of words
matrix of nonnegative integers

`Solver` — Solver for optimization
`'cgs'` (default) | `'savb'` | `'avb'` | `'cvb0'`

`LogLikelihoodTolerance` — Relative tolerance on log-likelihood
`0.0001` (default) | positive scalar

`FitTopicProbabilities` — Option for fitting corpus topic probabilities
`true` (default) | `false`

`FitTopicConcentration` — Option for fitting topic concentration
`true` | `false`

`InitialTopicConcentration` — Initial estimate of the topic concentration
`numTopics/4` (default) | nonnegative scalar

`TopicOrder` — Topic Order
`'initial-fit-probability'` (default) | `'unordered'`

`WordConcentration` — Word concentration
`1` (default) | nonnegative scalar

`DocumentsIn` — Orientation of documents
`'rows'` (default) | `'columns'`

`IterationLimit` — Maximum number of iterations
`100` (default) | positive integer

`DataPassLimit` — Maximum number of passes through data
1 (default) | positive integer

`MiniBatchLimit` — Maximum number of mini-batch passes
positive integer

`MiniBatchSize` — Mini-batch size
1000 (default) | positive integer

`LearnRateDecay` — Learning rate decay
0.5 (default) | positive scalar less than or equal to 1

`ValidationData` — Validation data
`[]` (default) | `bagOfWords` object | `bagOfNgrams` object | sparse matrix of word counts

`ValidationFrequency` — Frequency of model validation
positive integer

`Verbose` — Verbosity level
1 (default) | 0

`mdl` — Output LDA model
`ldaModel` object

R2018b: `fitlda` sorts topics