rakeKeywords

Extract keywords using RAKE

Since R2020b

Syntax

tbl = rakeKeywords(documents)

tbl = rakeKeywords(documents,Name=Value)

Description

tbl = rakeKeywords(documents) extracts keywords and respective scores using the Rapid Automatic Keyword Extraction (RAKE) algorithm. The function supports English, Japanese, German, and Korean text. To learn how to use rakeKeywords for other languages, see Language Considerations.

example

tbl = rakeKeywords(documents,Name=Value) specifies additional options using one or more name-value arguments.

Tip

The rakeKeywords function, by default, extracts keywords using stop words and punctuation characters. When using the default values for the Delimiters and MergingDelimiters options, do not remove stop words or punctuation characters from the input text.

example

Examples

collapse all

Extract Keywords Using RAKE

Open Live Script

Create an array of tokenized documents containing the text data.

textData = [
    "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
    "Analyze text and images. You can import text and images."
    "Analyze text and images. Analyze text, images, and videos in MATLAB."];
documents = tokenizedDocument(textData);

Extract the keywords using the rakeKeywords function.

tbl = rakeKeywords(documents)

tbl=12×3 table
                     Keyword                     DocumentNumber    Score
    _________________________________________    ______________    _____

    "MATLAB"        "provides"    "tools"              1             8  
    "MATLAB"        ""            ""                   1             2  
    "scientists"    "and"         "engineers"          1             2  
    "scientists"    ""            ""                   1             1  
    "engineers"     ""            ""                   1             1  
    "Analyze"       "text"        ""                   2             4  
    "import"        "text"        ""                   2             4  
    "images"        ""            ""                   2             1  
    "Analyze"       "text"        ""                   3             4  
    "images"        ""            ""                   3             1  
    "videos"        ""            ""                   3             1  
    "MATLAB"        ""            ""                   3             1

If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For readability, transform the multi-word keywords into a single string using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
tbl

tbl=12×3 table
             Keyword              DocumentNumber    Score
    __________________________    ______________    _____

    "MATLAB provides tools"             1             8  
    "MATLAB"                            1             2  
    "scientists and engineers"          1             2  
    "scientists"                        1             1  
    "engineers"                         1             1  
    "Analyze text"                      2             4  
    "import text"                       2             4  
    "images"                            2             1  
    "Analyze text"                      3             4  
    "images"                            3             1  
    "videos"                            3             1  
    "MATLAB"                            3             1

Specify Maximum Number of Keywords Per Document

Open Live Script

Create an array of tokenized document containing the text data.

textData = [
    "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
    "Analyze text and images. You can import text and images."
    "Analyze text and images. Analyze text, images, and videos in MATLAB."];
documents = tokenizedDocument(textData);

Extract the top two keywords using the rakeKeywords function and setting the MaxNumKeywords option to 2.

tbl = rakeKeywords(documents,MaxNumKeywords=2)

tbl=6×3 table
                 Keyword                  DocumentNumber    Score
    __________________________________    ______________    _____

    "MATLAB"     "provides"    "tools"          1             8  
    "MATLAB"     ""            ""               1             2  
    "Analyze"    "text"        ""               2             4  
    "import"     "text"        ""               2             4  
    "Analyze"    "text"        ""               3             4  
    "images"     ""            ""               3             1

For readability, transform the multi-word keywords into a single string using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
tbl

tbl=6×3 table
            Keyword            DocumentNumber    Score
    _______________________    ______________    _____

    "MATLAB provides tools"          1             8  
    "MATLAB"                         1             2  
    "Analyze text"                   2             4  
    "import text"                    2             4  
    "Analyze text"                   3             4  
    "images"                         3             1

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: rakeKeywords(documents,MaxNumKeywords=20) returns at most 20 keywords per document.

`MaxNumKeywords` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

Maximum number of keywords to return per document, specified as a positive integer or Inf.

If MaxNumKeywords is Inf, then the function returns all identified keywords.

`IgnoreKeywordCase` — Option to ignore keyword case
`0` (`false`) (default) | `1` (`true`)

Option to ignore keyword case, specified as one of the following:

0 (false) – extract case-sensitive keywords.
1 (true) – extract keywords ignoring case. Use this option when you expect the same keywords to appear with variations in letter case and want to treat them as the same keyword, for example, the words "analytics", "Analytics", and "ANALYTICS".

When IgnoreKeywordCase is 1, the function returns keywords with the most commonly occurring letter case pattern. When two or more patterns appear with the same frequency, then the function returns the keyword with the letter case pattern that occurs first in the input.

`Delimiters` — Tokens for splitting documents into keywords
string array | character vector | cell array of character vectors

Tokens for splitting documents into keywords, specified a string array, a character vector, or a cell array of character vectors. If Delimiters is a character vector, then it must represent a single delimiter.

The default list of delimiters is a list of punctuation characters.

If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

To specify delimiters for merging, use the MergingDelimiters option.

Data Types: char | string | cell

`MergingDelimiters` — Delimiters also used for merging keywords
string array | character vector | cell array of character vectors

Delimiters also used for merging keywords, specified as a string array, a character vector, or a cell array of character vectors. If MergingDelimiters is a character vector, then it must represent a single delimiter.

The default list of merging delimiters is the list of stop words given by the stopWords function.

If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

To specify delimiters that should not be used for merging, use the Delimiters option.

Data Types: char | string | cell

`IgnoreDelimiterCase` — Option to ignore delimiter case
`1` (`true`) (default) | `0` (`false`)

Option to ignore delimiter case, specified as one of the following:

1 (true) – ignore delimiter case.
0 (false) – use case-sensitive delimiters. Use this option when you expect there to be keywords and delimiters differ only by case, for example the delimiter "and" and the acronym "AND".

Output Arguments

collapse all

`tbl` — Extracted keywords and scores
table

Extracted keywords and scores, returned as a table with the following variables:

Keyword – Extracted keyword, specified as a 1-by-maxNgramLength string array, where maxNgramLength is the number of words in the longest keyword.
DocumentNumber – Document number containing the corresponding keyword.
Score – Score of keyword.

If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

If a keyword contains multiple words, then the ith element of the corresponding string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For more information, see Rapid Automatic Keyword Extraction.

More About

collapse all

Language Considerations

The rakeKeywords function supports English, Japanese, German, and Korean text only.

The rakeKeywords function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the stopWords with language given by the language details of the input documents as delimiters.

For other languages, specify an appropriate set of delimiters using the Delimiters and MergingDelimiters options.

Tips

You can experiment with different keyword extraction algorithms to see what works best with your data. Because the RAKE keywords algorithm uses a delimiter-based approach to extract candidate keywords, the extracted keywords can be very long. Alternatively, you can try extracting keywords using TextRank algorithm which starts with individual tokens as candidate keywords and then merges them when appropriate. To extract keywords using TextRank, use the textrankKeywords function. To learn more, see Extract Keywords from Text Data Using TextRank.

Algorithms

collapse all

Rapid Automatic Keyword Extraction

For each document, the rakeKeywords function extracts keywords independently using the following steps based on [1]:

Determine candidate keywords:
- Extract sequences of tokens between the delimiters specified by the Delimiters and MergingDelimiters options. The function treats each sequence as a single candidate keyword.
Calculate scores for the candidate keywords:
- Create an undirected, unweighted graph with nodes corresponding to the individual tokens in the candidate keywords.
- Add edges between nodes where tokens co-occur in a candidate keyword, including self co-occurrences, weighted by the number of candidate keywords containing that co-occurrence.
- Score each token using the formula deg(token) / freq(token), where deg(token) is the number of edges for the specified token and freq(token) is the number of times that the specified token occurs in the document.
- For each candidate keyword, assign a score given by the sum of scores of the contained tokens.
Extract top keywords from candidates:
- If there are multiple instances of the same pair of candidate keywords separated by the same single merging delimiter, then merge the candidate keywords and the delimiter into a single keyword and sum the corresponding scores.
- Return the top k keywords, where k is given by the MaxNumKeywords option.

Language Details

tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of rakeKeywords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the Language option of tokenizedDocument. To view the token details, use the tokenDetails function.

References

[1] Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. "Automatic keyword extraction from individual documents." Text mining: applications and theory 1 (2010): 1-20.

Version History

Introduced in R2020b

rakeKeywords

Syntax

Description

Examples

Extract Keywords Using RAKE

Specify Maximum Number of Keywords Per Document

Input Arguments

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Name-Value Arguments

`MaxNumKeywords` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

`IgnoreKeywordCase` — Option to ignore keyword case
`0` (`false`) (default) | `1` (`true`)

`Delimiters` — Tokens for splitting documents into keywords
string array | character vector | cell array of character vectors

`MergingDelimiters` — Delimiters also used for merging keywords
string array | character vector | cell array of character vectors

`IgnoreDelimiterCase` — Option to ignore delimiter case
`1` (`true`) (default) | `0` (`false`)

Output Arguments

`tbl` — Extracted keywords and scores
table

More About

Language Considerations

Tips

Algorithms

Rapid Automatic Keyword Extraction

Language Details

References

Version History

See Also

Topics

rakeKeywords

Syntax

Description

Examples

Extract Keywords Using RAKE

Specify Maximum Number of Keywords Per Document

Input Arguments

documents — Input documents tokenizedDocument array | string array | cell array of character vectors

Name-Value Arguments

MaxNumKeywords — Maximum number of keywords to return per document Inf (default) | positive integer

IgnoreKeywordCase — Option to ignore keyword case 0 (false) (default) | 1 (true)

Delimiters — Tokens for splitting documents into keywords string array | character vector | cell array of character vectors

MergingDelimiters — Delimiters also used for merging keywords string array | character vector | cell array of character vectors

IgnoreDelimiterCase — Option to ignore delimiter case 1 (true) (default) | 0 (false)

Output Arguments

tbl — Extracted keywords and scores table

More About

Language Considerations

Tips

Algorithms

Rapid Automatic Keyword Extraction

Language Details

References

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

`MaxNumKeywords` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

`IgnoreKeywordCase` — Option to ignore keyword case
`0` (`false`) (default) | `1` (`true`)

`Delimiters` — Tokens for splitting documents into keywords
string array | character vector | cell array of character vectors

`MergingDelimiters` — Delimiters also used for merging keywords
string array | character vector | cell array of character vectors

`IgnoreDelimiterCase` — Option to ignore delimiter case
`1` (`true`) (default) | `0` (`false`)

`tbl` — Extracted keywords and scores
table