replace

Replace substrings in documents

Syntax

newDocuments = replace(documents,old,new)

Description

newDocuments = replace(documents,old,new) replaces all occurrences of the substring or pattern old in documents with new.

Tip

Use the replace function to replace substrings of the words in documents by specifying substrings or patterns. To replace entire words and n-grams in documents, use the replaceWords and replaceNgrams functions respectively.

Examples

collapse all

Replace Substrings in Documents

Open Live Script

Replace words in a document array.

documents = tokenizedDocument([
    "an extreme example"
    "another extreme example"])

documents = 
  2x1 tokenizedDocument:

    3 tokens: an extreme example
    3 tokens: another extreme example

newDocuments = replace(documents,"example","sentence")

newDocuments = 
  2x1 tokenizedDocument:

    3 tokens: an extreme sentence
    3 tokens: another extreme sentence

Replace substrings of the words.

newDocuments = replace(documents,"ex","X-")

newDocuments = 
  2x1 tokenizedDocument:

    3 tokens: an X-treme X-ample
    3 tokens: another X-treme X-ample

Replace Substrings in Documents Using Patterns

Open Live Script

Remove digits from a document using a digits pattern.

Create an array of tokenized documents.

textData = [
    "Text Analytics Toolbox provides over 50 functions to analyze text data."
    "The bm25Similarity function measures document similarity."];
documents = tokenizedDocument(textData);

Replace instances of consecutive digits with the token "<NUMBER>" using the replace function. Specify a digits pattern using the digitsPattern function.

pat = digitsPattern;
newDocuments = replace(documents,pat,"<NUMBER>")

newDocuments = 
  2x1 tokenizedDocument:

    12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
     7 tokens: The bm<NUMBER>Similarity function measures document similarity .

Notice that the function replaces the digits in the token "bm25Similarity".

To replace tokens consisting entirely of digits, use the replace function and specify a pattern that also includes text boundaries. Specify text boundaries using the textBoundary function.

pat = textBoundary + digitsPattern + textBoundary;
newDocuments = replace(documents,pat,"<NUMBER>")

newDocuments = 
  2x1 tokenizedDocument:

    12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
     7 tokens: The bm25Similarity function measures document similarity .

In this case, the function does not replace the digits in the token "bm25Similarity".

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

`old` — Substring or pattern to replace
string array | character vector | cell array of character vectors | `pattern` array

Substring or pattern to replace, specified as one of the following:

String array
Character vector
Cell array of character vectors
pattern array

`new` — New substring
string array | character vector | cell array of character vectors

New substring, specified as a string array, character vector, or cell array of character vectors.

Data Types: string | char | cell

Output Arguments

collapse all

`newDocuments` — Output documents
`tokenizedDocument` array

Output documents, returned as a tokenizedDocument array.

Version History

Introduced in R2017b

replace

Syntax

Description

Examples

Replace Substrings in Documents

Replace Substrings in Documents Using Patterns

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

`old` — Substring or pattern to replace
string array | character vector | cell array of character vectors | `pattern` array

`new` — New substring
string array | character vector | cell array of character vectors

Output Arguments

`newDocuments` — Output documents
`tokenizedDocument` array

Version History

See Also

Topics

replace

Syntax

Description

Examples

Replace Substrings in Documents

Replace Substrings in Documents Using Patterns

Input Arguments

documents — Input documents tokenizedDocument array

old — Substring or pattern to replace string array | character vector | cell array of character vectors | pattern array

new — New substring string array | character vector | cell array of character vectors

Output Arguments

newDocuments — Output documents tokenizedDocument array

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array

`old` — Substring or pattern to replace
string array | character vector | cell array of character vectors | `pattern` array

`new` — New substring
string array | character vector | cell array of character vectors

`newDocuments` — Output documents
`tokenizedDocument` array