Main Content

trainWordEmbedding

Train word embedding

Description

example

emb = trainWordEmbedding(filename) trains a word embedding using the training data stored in the text file filename. The file is a collection of documents stored in UTF-8 with one document per line and words separated by whitespace.

example

emb = trainWordEmbedding(documents) trains a word embedding using documents by creating a temporary file with writeTextDocument, and then trains an embedding using the temporary file.

example

emb = trainWordEmbedding(___,Name,Value) specifies additional options using one or more name-value pair arguments. For example, 'Dimension',50 specifies the word embedding dimension to be 50.

Examples

collapse all

Train a word embedding of dimension 100 using the example text file sonnetsPreprocessed.txt. This file contains preprocessed versions of Shakespeare's sonnets, with one sonnet per line and words separated by a space.

filename = "sonnetsPreprocessed.txt";
emb = trainWordEmbedding(filename)
Training: 100% Loss: 3.19099  Remaining time: 0 hours 0 minutes.
emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    ...    ] (1x401 string)

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Train a word embedding using the example data sonnetsPreprocessed.txt. This file contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Train a word embedding using trainWordEmbedding.

emb = trainWordEmbedding(documents)
Training: 100% Loss: 3.35065  Remaining time: 0 hours 0 minutes.
emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    ...    ] (1x401 string)

Visualize the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb,words);
XY = tsne(V);
textscatter(XY,words)

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Specify the word embedding dimension to be 50. To reduce the number of words discarded by the model, set 'MinCount' to 3. To train for longer, set the number of epochs to 10.

emb = trainWordEmbedding(documents, ...
    'Dimension',50, ...
    'MinCount',3, ...
    'NumEpochs',10)
Training: 100% Loss: 3.09153  Remaining time: 0 hours 0 minutes.
emb = 
  wordEmbedding with properties:

     Dimension: 50
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    ...    ] (1x750 string)

View the word embedding in a text scatter plot using tsne.

words = emb.Vocabulary;
V = word2vec(emb, words);
XY = tsne(V);
textscatter(XY,words)

Input Arguments

collapse all

Name of the file, specified as a string scalar, character vector, or a 1-by-1 cell array containing a character vector.

Data Types: string | char | cell

Input documents, specified as a tokenizedDocument array.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Dimension',50 specifies the word embedding dimension to be 50.

Dimension of the word embedding, specified as the comma-separated pair consisting of 'Dimension' and a nonnegative integer.

Example: 300

Size of the context window, specified as the comma-separated pair consisting of 'Window' and a nonnegative integer.

Example: 10

Model, specified as the comma-separated pair consisting of 'Model' and 'skipgram' (skip gram) or 'cbow' (continuous bag-of-words).

Example: 'cbow'

Factor to determine the word discard rate, specified as the comma-separated pair consisting of 'DiscardFactor' and a positive scalar. The function discards a word from the input window with probability 1-sqrt(t/f) - t/f where f is the unigram probability of the word, and t is DiscardFactor. Usually, DiscardFactor is in the range of 1e-3 through 1e-5.

Example: 0.005

Loss function, specified as the comma-separated pair consisting of 'LossFunction' and 'ns' (negative sampling), 'hs' (hierarchical softmax), or 'softmax' (softmax).

Example: 'hs'

Number of negative samples for the negative sampling loss function, specified as the comma-separated pair consisting of 'NumNegativeSamples' and a positive integer. This option is only valid when LossFunction is 'ns'.

Example: 10

Number of epochs for training, specified as the comma-separated pair consisting of 'NumEpochs' and a positive integer.

Example: 10

Minimum count of words to include in the embedding, specified as the comma-separated pair consisting of 'MinCount' and a positive integer. The function discards words that appear fewer than MinCount times in the training data from the vocabulary.

Example: 10

Inclusive range for subword n-grams, specified as the comma-separated pair consisting of 'NGramRange' and a vector of two nonnegative integers [min max]. If you do not want to use n-grams, then set 'NGramRange' to [0 0].

Example: [5 10]

Initial learn rate, specified as the comma-separated pair consisting of 'InitialLearnRate' and a positive scalar.

Example: 0.01

Rate for updating the learn rate, specified as the comma-separated pair consisting of 'UpdateRate' and a positive integer. The learn rate decreases to zero linearly in steps every N words where N is the UpdateRate.

Example: 50

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and one of the following:

  • 0 – Do not display verbose output.

  • 1 – Display progress information.

Example: 'Verbose',0

Output Arguments

collapse all

Output word embedding, returned as a wordEmbedding object.

More About

collapse all

Language Considerations

File input to the trainWordEmbedding function requires words separated by whitespace.

For files containing non-English text, you might need to input a tokenizedDocument array to trainWordEmbedding.

To create a tokenizedDocument array from pretokenized text, use the tokenizedDocument function and set the 'TokenizeMethod' option to 'none'.

Tips

The training algorithm uses the number of threads given by the function maxNumCompThreads. To learn how to change the number of threads used by MATLAB®, see maxNumCompThreads.

Version History

Introduced in R2017b