removeInfrequentNgrams

Remove infrequently seen n-grams from bag-of-n-grams model

Syntax

newBag = removeInfrequentNgrams(bag,count)

newBag = removeInfrequentNgrams(bag,count,'NgramLengths',lengths)

newBag = removeInfrequentNgrams(___,'IgnoreCase',true)

Description

newBag = removeInfrequentNgrams(bag,count) removes the n-grams that appear at most count times in total from the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

newBag = removeInfrequentNgrams(bag,count,'NgramLengths',lengths) only removes n-grams with lengths specified by lengths. The function, by default, is case sensitive.

example

newBag = removeInfrequentNgrams(___,'IgnoreCase',true) removes the n-grams that appear at most count times ignoring case. If n-grams differ only by case, then the corresponding counts are merged.

Examples

collapse all

Remove Infrequent N-Grams from Bag-of-N-Grams Model

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. Specify to count bigrams (pairs of words) and trigrams (triples of words).

bag = bagOfNgrams(documents,'NgramLengths',[2 3])

bag = 
  bagOfNgrams with properties:

          Counts: [154x18022 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    ...    ] (1x3092 string)
          Ngrams: [18022x3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

Remove n-grams of any length that appear two or fewer times in total.

bag = removeInfrequentNgrams(bag,2)

bag = 
  bagOfNgrams with properties:

          Counts: [154x103 double]
      Vocabulary: ["thine"    "thy"    "self"    "sweet"    "thou"    "time"    "why"    "dost"    "upon"    "eye"    "thee"    "ten"    "beauty"    "love"    "wilt"    "dear"    "truth"    "own"    "yet"    "hast"    "mens"    ...    ] (1x73 string)
          Ngrams: [103x3 string]
    NgramLengths: [2 3]
       NumNgrams: 103
    NumDocuments: 154

Remove bigrams that appear four or fewer times in total.

bag = removeInfrequentNgrams(bag,4,'NgramLengths',2)

bag = 
  bagOfNgrams with properties:

          Counts: [154x41 double]
      Vocabulary: ["thine"    "thy"    "sweet"    "thou"    "dost"    "upon"    "why"    "thee"    "ten"    "love"    "dear"    "hast"    "true"    "mine"    "beauty"    "fair"    "own"    "self"    "art"    "times"    "shouldst"    ...    ] (1x30 string)
          Ngrams: [41x3 string]
    NgramLengths: [2 3]
       NumNgrams: 41
    NumDocuments: 154

Input Arguments

collapse all

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

Input bag-of-n-grams model, specified as a bagOfNgrams object.

`count` — Count threshold
positive integer

Count threshold, specified as a positive integer. The function removes the n-grams that appear count times in total or fewer.

`lengths` — N-gram lengths
positive integer | vector of positive integers

N-gram lengths, specified as a positive integer or a vector of positive integers.

If you specify lengths, the function removes infrequent n-grams of the specified lengths only. If you do not specify lengths, then the function removes infrequent n-grams regardless of length.

Example: [1 2 3]

Output Arguments

collapse all

`newBag` — Output bag-of-n-grams model
`bagOfNgrams` object

Output bag-of-n-grams model, returned as a bagOfNgrams object.

Version History

Introduced in R2018a

removeInfrequentNgrams

Syntax

Description

Examples

Remove Infrequent N-Grams from Bag-of-N-Grams Model

Input Arguments

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`count` — Count threshold
positive integer

`lengths` — N-gram lengths
positive integer | vector of positive integers

Output Arguments

`newBag` — Output bag-of-n-grams model
`bagOfNgrams` object

Version History

See Also

Topics

removeInfrequentNgrams

Syntax

Description

Examples

Remove Infrequent N-Grams from Bag-of-N-Grams Model

Input Arguments

bag — Input bag-of-n-grams model bagOfNgrams object

count — Count threshold positive integer

lengths — N-gram lengths positive integer | vector of positive integers

Output Arguments

newBag — Output bag-of-n-grams model bagOfNgrams object

Version History

See Also

Topics

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`count` — Count threshold
positive integer

`lengths` — N-gram lengths
positive integer | vector of positive integers

`newBag` — Output bag-of-n-grams model
`bagOfNgrams` object