addDocument

Add documents to bag-of-words or bag-of-n-grams model

collapse all in page

Syntax

newBag = addDocument(bag,documents)

Description

newBag = addDocument(bag,documents) adds documents to the bag-of-words or bag-of-n-grams model bag.

example

Examples

collapse all

Add Documents to Bag-of-Words Model

Open Live Script

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument([
    "an example of a short sentence"
    "a second short sentence"]);
bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [2x7 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"]
        NumWords: 7
    NumDocuments: 2

Create another array of tokenized documents and add it to the same bag-of-words model.

documents = tokenizedDocument([ 
    "a third example of a short sentence" 
    "another short sentence"]);
newBag = addDocument(bag,documents)

newBag = 
  bagOfWords with properties:

          Counts: [4x9 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"    "third"    "another"]
        NumWords: 9
    NumDocuments: 4

Import Text from Multiple Files Using a File Datastore

Open Live Script

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

Create an empty bag-of-words model.

bag = bagOfWords

bag = 
  bagOfWords with properties:

          Counts: []
      Vocabulary: [1x0 string]
        NumWords: 0
    NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

View the updated bag-of-words model.

bag

bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    ...    ] (1x276 string)
        NumWords: 276
    NumDocuments: 4

Input Arguments

collapse all

`bag` — Input bag-of-words or bag-of-n-grams model
`bagOfWords` object | `bagOfNgrams` object

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object.

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Output Arguments

collapse all

`newBag` — Output model
`bagOfWords` object | `bagOfNgrams` object

Output model, returned as a bagOfWords object or a bagOfNgrams object. The type of newBag is the same as the type of bag.

Version History

Introduced in R2017b

addDocument

Syntax

Description

Examples

Add Documents to Bag-of-Words Model

Import Text from Multiple Files Using a File Datastore

Input Arguments

`bag` — Input bag-of-words or bag-of-n-grams model
`bagOfWords` object | `bagOfNgrams` object

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Output Arguments

`newBag` — Output model
`bagOfWords` object | `bagOfNgrams` object

Version History

See Also

Topics

addDocument

Syntax

Description

Examples

Add Documents to Bag-of-Words Model

Import Text from Multiple Files Using a File Datastore

Input Arguments

bag — Input bag-of-words or bag-of-n-grams model bagOfWords object | bagOfNgrams object

documents — Input documents tokenizedDocument array | string array | cell array of character vectors

Output Arguments

newBag — Output model bagOfWords object | bagOfNgrams object

Version History

See Also

Topics

`bag` — Input bag-of-words or bag-of-n-grams model
`bagOfWords` object | `bagOfNgrams` object

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

`newBag` — Output model
`bagOfWords` object | `bagOfNgrams` object