Unicode decomposed normalized form (NFD)
Normalize String to Unicode Canonical Decomposition Form
Strings that look identical can have different underlying representations. The Unicode canonical decomposition form (NFD) ensures that equivalent strings have a unique binary representation.
Consider the string
"jalapeño" which contains 8 letters.
str = "jalapeño"; strlength(str)
ans = 8
Normalize the string using the
textanalytics.unicode.nfd function. On some systems, the output string appears to be identical to the input string.
newStr = textanalytics.unicode.nfd(str)
newStr = "jalapeño"
View the number of code points in the new string. The normalized representation includes one extra code point. In this case, the function splits the accented letter
"ñ" into two separate code points.
ans = 9
Extract the seventh and eighth code points in the normalized string. On some systems, the output appears to be a single character.
ans = "ñ"
Check whether the strings
newStr are equal using the
== operator. The operator returns
0 because the strings have different underlying representations.
tf = str == newStr
tf = logical 0
str — Input text
string array | character vector | cell array of character vectors
Input text, specified as a string array, character vector, or cell array of character vectors.
["An example of a short sentence."; "A second short
Unicode Normalization Forms
For more information about Unicode normalization forms, see Unicode Standard Annex #15 Unicode Normalization Forms.
 Whistler, Ken, ed. "Unicode Standard Annex #15: Unicode Normalization Forms." Unicode Technical Reports, August 27, 2021. https://unicode.org/reports/tr15/.
Introduced in R2021a