tokenDetails

Details of tokens in tokenized document array

Syntax

tdetails = tokenDetails(documents)

Description

tdetails = tokenDetails(documents) returns a table of token details for the tokens in the tokenizedDocument array documents.

example

Examples

collapse all

View Token Details of Documents

Open Live Script

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence and an emoticon. :)"
    "Here is another example document. :D"];
documents = tokenizedDocument(str);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)

      Token       DocumentNumber    LineNumber       Type        Language
    __________    ______________    __________    ___________    ________

    "This"              1               1         letters           en   
    "is"                1               1         letters           en   
    "an"                1               1         letters           en   
    "example"           1               1         letters           en   
    "document"          1               1         letters           en   
    "."                 1               1         punctuation       en   
    "It"                1               1         letters           en   
    "has"               1               1         letters           en

The type variable contains the type of each token. View the emoticons in the documents.

idx = tdetails.Type == "emoticon";
tdetails(idx,:)

ans=2×5 table
    Token    DocumentNumber    LineNumber      Type      Language
    _____    ______________    __________    ________    ________

    ":)"           2               1         emoticon       en   
    ":D"           3               1         emoticon       en

Add Sentence Details to Documents

Open Live Script

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence."
    "Here is another example document. It also has two sentences."];
documents = tokenizedDocument(str);

Add sentence details to the documents using addSentenceDetails. This function adds the sentence numbers to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addSentenceDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)

      Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    __________    ______________    ______________    __________    ___________    ________

    "This"              1                 1               1         letters           en   
    "is"                1                 1               1         letters           en   
    "an"                1                 1               1         letters           en   
    "example"           1                 1               1         letters           en   
    "document"          1                 1               1         letters           en   
    "."                 1                 1               1         punctuation       en   
    "It"                1                 2               1         letters           en   
    "has"               1                 2               1         letters           en

View the token details of the second sentence of the third document.

idx = tdetails.DocumentNumber == 3 & ...
    tdetails.SentenceNumber == 2;
tdetails(idx,:)

ans=6×6 table
       Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    ___________    ______________    ______________    __________    ___________    ________

    "It"                 3                 2               1         letters           en   
    "also"               3                 2               1         letters           en   
    "has"                3                 2               1         letters           en   
    "two"                3                 2               1         letters           en   
    "sentences"          3                 2               1         letters           en   
    "."                  3                 2               1         punctuation       en

Add Part-of-Speech Details to Documents

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)

       Token       DocumentNumber    LineNumber     Type      Language
    ___________    ______________    __________    _______    ________

    "fairest"            1               1         letters       en   
    "creatures"          1               1         letters       en   
    "desire"             1               1         letters       en   
    "increase"           1               1         letters       en   
    "thereby"            1               1         letters       en   
    "beautys"            1               1         letters       en   
    "rose"               1               1         letters       en   
    "might"              1               1         letters       en

Add part-of-speech details to the documents using the addPartOfSpeechDetails function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)

       Token       DocumentNumber    SentenceNumber    LineNumber     Type      Language     PartOfSpeech 
    ___________    ______________    ______________    __________    _______    ________    ______________

    "fairest"            1                 1               1         letters       en       adjective     
    "creatures"          1                 1               1         letters       en       noun          
    "desire"             1                 1               1         letters       en       noun          
    "increase"           1                 1               1         letters       en       noun          
    "thereby"            1                 1               1         letters       en       adverb        
    "beautys"            1                 1               1         letters       en       noun          
    "rose"               1                 1               1         letters       en       noun          
    "might"              1                 1               1         letters       en       auxiliary-verb

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

`tdetails` — Table of token details
table

Table of token details. tdetails has the following variables:

Name	Description
`Token`	Token text, returned as a string scalar.
`DocumentNumber`	Index of document that the token belongs to, returned as a positive integer.
`SentenceNumber`	Sentence number of token in document, returned as a positive integer. If these details are missing, then first add sentence details to `documents` using the `addSentenceDetails` function.
`LineNumber`	Line number of token in document, returned as a positive integer.
`Type`	The type of token, returned as one of these types: `letters` — string of letter characters only `digits` — string of digits only `punctuation` — string of punctuation and symbol characters only `email-address` — detected email address `web-address` — detected web address `hashtag` — detected hashtag (starts with `"#"` character followed by a letter) `at-mention` — detected at-mention (starts with `"@"` character, followed by 1 to 15 ASCII letter, digit, or underscore characters) `emoticon` — detected emoticon `emoji` — detected emoji `other` — does not belong to the previous types and is not a custom type If these details are missing, then first add type details to `documents` using the `addTypeDetails` function.
`Language`	Language of the token, returned as one of these languages: `en` — English `ja` — Japanese `de` — German `ko` — Korean These language details determine the behavior of the `removeStopWords`, `addPartOfSpeechDetails`, `normalizeWords`, `addSentenceDetails`, and `addEntityDetails` functions on the tokens. If these details are missing, then first add language details to `documents` using the `addLanguageDetails` function. For more information about language support in Text Analytics Toolbox™, see Language Considerations.
`PartOfSpeech`	Part of speech tag, returned as one of these tags: `adjective` — Adjective `adposition` — Adposition `adverb` — Adverb `auxiliary-verb` — Auxiliary verb `coord-conjunction` — Coordinating conjunction `determiner` — Determiner `interjection` — Interjection `noun` — Noun `numeral` — Numeral `particle` — Particle `pronoun` — Pronoun `proper-noun` — Proper noun `punctuation` — Punctuation `subord-conjunction` — Subordinating conjunction `symbol` — Symbol `verb` — Verb `other` — Other If these details are missing, then first add part-of-speech details to `documents` using the `addPartOfSpeechDetails` function.
`Entity`	Entity tag, specified as one of these tags: `location` — detected location `organization` — detected organization `person` — detected person `other` — detected entity, not belonging to the above categories `non-entity` — no entity detected If these details are missing, then first add entity details to `documents` using the `addEntityDetails` function.
`Lemma`	Lemma form. If these details are missing, then first add lemma details to `documents` using the `addLemmaDetails` function.
`Head`	Grammatical dependency head, specified as the index of the token that this token modifies. If these details are missing, then first add grammatical dependency details to `documents` using the `addDependencyDetails` function.
`Dependency`	Grammatical dependency type, specified as one of these tags. The dependency types listed here are only a subset. For a complete list of dependency types, including subtypes, see [1]. `acl` — clausal modifier of noun (adnominal clause) `advcl` — adverbial clause modifier `advmod` — adverbial modifier `amod` — adjectival modifier `appos` — appositional modifier `aux` — auxiliary `case` — case marking `cc` — coordinating conjunction `ccomp` — clausal complement `clf` — classifier `compound` — compound `conj` — conjunct `cop` — copula `csubj` — clausal subject `dep` — unspecified dependency `det` — determiner `discourse` — discourse element `dislocated` — dislocated elements `expl` — expletive `fixed` — fixed multiword expression `flat` — flat multiword expression `goeswith` — goes with `iobj` — indirect object `list` — list `mark` — marker `nmod` — nominal modifier `nsubj` — nominal subject `nummod` — numeric modifier `obj` — object `obl` — oblique nominal `orphan` — orphan `parataxis` — parataxis `punct` — punctuation `reparandum` — overridden disfluency `root` — root `vocative` — vocative `xcomp` — open clausal complement If these details are missing, then first add grammatical dependency details to `documents` using the `addDependencyDetails` function.

References

[1] Universal Dependency Relations https://universaldependencies.org/u/dep/index.html.

Version History

Introduced in R2018a

expand all

R2018b: `tokenDetails` returns token type `emoji` for emoji characters

Starting in R2018b, tokenizedDocument detects emoji characters and the tokenDetails function reports these tokens with type "emoji". This makes it easier to analyze text containing emoji characters.

In R2018a, tokenDetails reports emoji characters with type "other". To find the indices of the tokens with type "emoji" or "other", use the indices idx = tdetails.Type == "emoji" | tdetails.Type == "other", where tdetails is a table of token details.

tokenDetails

Syntax

Description

Examples

View Token Details of Documents

Add Sentence Details to Documents

Add Part-of-Speech Details to Documents

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

Output Arguments

`tdetails` — Table of token details
table

References

Version History

R2018b: `tokenDetails` returns token type `emoji` for emoji characters

See Also

Topics

tokenDetails

Syntax

Description

Examples

View Token Details of Documents

Add Sentence Details to Documents

Add Part-of-Speech Details to Documents

Input Arguments

documents — Input documents tokenizedDocument array

Output Arguments

tdetails — Table of token details table

References

Version History

R2018b: tokenDetails returns token type emoji for emoji characters

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array

`tdetails` — Table of token details
table

R2018b: `tokenDetails` returns token type `emoji` for emoji characters