removeEmptyDocuments

Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model

collapse all in page

Syntax

newDocuments = removeEmptyDocuments(documents)

newBag = removeEmptyDocuments(bag)

[___,idx] = removeEmptyDocuments(___)

Description

newDocuments = removeEmptyDocuments(documents) removes documents which have no words from documents.

example

newBag = removeEmptyDocuments(bag) removes documents which have no words or n-grams from the bag-of-words or bag-of-n-grams model bag.

example

[___,idx] = removeEmptyDocuments(___) also returns the indices of the removed documents.

example

Examples

collapse all

Remove Empty Documents from Array

Open Live Script

Remove documents containing no words from an array of tokenized documents.

Create an array of tokenized documents which includes empty documents.

documents = tokenizedDocument([
    "an example of a short sentence"
    ""
    "a second short sentence"
    ""])

documents = 
  4x1 tokenizedDocument:

    6 tokens: an example of a short sentence
    0 tokens:
    4 tokens: a second short sentence
    0 tokens:

Remove the empty documents.

newDocuments = removeEmptyDocuments(documents)

newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

Remove Empty Documents from Bag-of-Words Model

Open Live Script

Remove documents containing no words from bag-of-words model.

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument([
    "An example of a short sentence."
    ""
    "A second short sentence."
    ""]);
bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [4x9 double]
      Vocabulary: ["An"    "example"    "of"    "a"    "short"    "sentence"    "."    "A"    "second"]
        NumWords: 9
    NumDocuments: 4

Remove the empty documents from the bag-of-words model.

newBag = removeEmptyDocuments(bag)

newBag = 
  bagOfWords with properties:

          Counts: [2x9 double]
      Vocabulary: ["An"    "example"    "of"    "a"    "short"    "sentence"    "."    "A"    "second"]
        NumWords: 9
    NumDocuments: 2

Remove Documents and Corresponding Labels

Open Live Script

Remove documents containing no words from an array and use the indices of removed documents to remove the corresponding labels also.

Create an array of tokenized documents which includes empty documents.

documents = tokenizedDocument([
    "an example of a short sentence"
    ""
    "a second short sentence"
    ""])

documents = 
  4x1 tokenizedDocument:

    6 tokens: an example of a short sentence
    0 tokens:
    4 tokens: a second short sentence
    0 tokens:

Create a vector of labels.

labels = ["T"; "F"; "F"; "T"]

labels = 4x1 string
    "T"
    "F"
    "F"
    "T"

Remove the empty documents and get the indices of the removed documents.

[newDocuments, idx] = removeEmptyDocuments(documents)

newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

idx = 2×1

     2
     4

Remove the corresponding labels from labels.

labels(idx) = []

labels = 2x1 string
    "T"
    "F"

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

`bag` — Input bag-of-words or bag-of-n-grams model
`bagOfWords` object | `bagOfNgrams` object

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object.

Output Arguments

collapse all

`newDocuments` — Output documents
`tokenizedDocument` array

Output documents, returned as a tokenizedDocument array.

`newBag` — Output model
`bagOfWords` object | `bagOfNgrams` object

Output model, returned as a bagOfWords object or a bagOfNgrams object. The type of newBag is the same as the type of bag.

`idx` — Indices of removed documents
vector of positive integers

Indices of removed documents, returned as a vector of positive integers.

Version History

Introduced in R2017b

removeEmptyDocuments

Syntax

Description

Examples

Remove Empty Documents from Array

Remove Empty Documents from Bag-of-Words Model

Remove Documents and Corresponding Labels

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

`bag` — Input bag-of-words or bag-of-n-grams model
`bagOfWords` object | `bagOfNgrams` object

Output Arguments

`newDocuments` — Output documents
`tokenizedDocument` array

`newBag` — Output model
`bagOfWords` object | `bagOfNgrams` object

`idx` — Indices of removed documents
vector of positive integers

Version History

See Also

Topics

removeEmptyDocuments

Syntax

Description

Examples

Remove Empty Documents from Array

Remove Empty Documents from Bag-of-Words Model

Remove Documents and Corresponding Labels

Input Arguments

documents — Input documents tokenizedDocument array

bag — Input bag-of-words or bag-of-n-grams model bagOfWords object | bagOfNgrams object

Output Arguments

newDocuments — Output documents tokenizedDocument array

newBag — Output model bagOfWords object | bagOfNgrams object

idx — Indices of removed documents vector of positive integers

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array

`bag` — Input bag-of-words or bag-of-n-grams model
`bagOfWords` object | `bagOfNgrams` object

`newDocuments` — Output documents
`tokenizedDocument` array

`newBag` — Output model
`bagOfWords` object | `bagOfNgrams` object

`idx` — Indices of removed documents
vector of positive integers