Japanese Language Support

This topic summarizes the Text Analytics Toolbox™ features that support Japanese text. For an example showing how to analyze Japanese text data, see Analyze Japanese Text Data.


The tokenizedDocument function automatically detects Japanese input. Alternatively, set the 'Language' option in tokenizedDocument to 'ja'. This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

To specify additional MeCab options for tokenization, create a mecabOptions object. To tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod' option of tokenizedDocument.

Tokenize Japanese Text

Tokenize Japanese text using tokenizedDocument. The function automatically detects Japanese text.

str = [
documents = tokenizedDocument(str)
documents = 
  4x1 tokenizedDocument:

     6 tokens: 恋 に 悩み 、 苦しむ 。
     6 tokens: 恋 の 悩み で 苦しむ 。
    10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。
    10 tokens: 空 の 星 が 輝き を 増し て いる 。

Part of Speech Details

The tokenDetails function, by default, includes part of speech details with the token details.

Get Part of Speech Details of Japanese Text

Tokenize Japanese text using tokenizedDocument.

str = [
    "恋の悩みで 苦しむ。"
documents = tokenizedDocument(str);

For Japanese text, you can get the part-of-speech details using tokenDetails. For English text, you must first use addPartOfSpeechDetails.

tdetails = tokenDetails(documents);
     Token     DocumentNumber    LineNumber       Type        Language    PartOfSpeech     Lemma       Entity  
    _______    ______________    __________    ___________    ________    ____________    _______    __________

    "恋"             1               1         letters           ja       noun            "恋"       non-entity
    "に"             1               1         letters           ja       adposition      "に"       non-entity
    "悩み"           1               1         letters           ja       verb            "悩む"      non-entity
    "、"             1               1         punctuation       ja       punctuation     "、"       non-entity
    "苦しむ"          1               1         letters           ja       verb            "苦しむ"    non-entity
    "。"             1               1         punctuation       ja       punctuation     "。"       non-entity
    "恋"             2               1         letters           ja       noun            "恋"       non-entity
    "の"             2               1         letters           ja       adposition      "の"       non-entity

Named Entity Recognition

The tokenDetails function, by default, includes entity details with the token details.

Add Named Entity Tags to Japanese Text

Tokenize Japanese text using tokenizedDocument.

str = [
documents = tokenizedDocument(str);

For Japanese text, the software automatically adds named entity tags, so you do not need to use the addEntityDetails function. This software detects person names, locations, organizations, and other named entities. To view the entity details, use the tokenDetails function.

tdetails = tokenDetails(documents);
       Token        DocumentNumber    LineNumber     Type      Language    PartOfSpeech       Lemma          Entity  
    ____________    ______________    __________    _______    ________    ____________    ____________    __________

    "マリー"               1               1         letters       ja       proper-noun     "マリー"         person    
    "さん"                1               1         letters       ja       noun            "さん"           person    
    "は"                  1               1         letters       ja       adposition      "は"            non-entity
    "ボストン"             1               1         letters       ja       proper-noun     "ボストン"        location  
    "から"                1               1         letters       ja       adposition      "から"           non-entity
    "ニューヨーク"          1               1         letters       ja       proper-noun     "ニューヨーク"    location  
    "に"                  1               1         letters       ja       adposition      "に"            non-entity
    "引っ越し"             1               1         letters       ja       verb            "引っ越す"        non-entity

View the words tagged with entity "person", "location", "organization", or "other". These words are the words not tagged "non-entity".

idx = tdetails.Entity ~= "non-entity";
ans = 11x1 string

Stop Words

To remove stop words from documents according to the token language details, use removeStopWords. For a list of Japanese stop words set the 'Language' option in stopWords to 'ja'.

Remove Japanese Stop Words

Tokenize Japanese text using tokenizedDocument. The function automatically detects Japanese text.

str = [
documents = tokenizedDocument(str);

Remove stop words using removeStopWords. The function uses the language details from documents to determine which language stop words to remove.

documents = removeStopWords(documents)
documents = 
  3x1 tokenizedDocument:

     4 tokens: 静か 、 とても 穏やか
    10 tokens: 企業 顧客 データ 利用 、 今年 売り上げ 調べる 出来 。
     5 tokens: 先生 。 英語 教え 。


To lemmatize tokens according to the token language details, use normalizeWords and set the 'Style' option to 'lemma'.

Lemmatize Japanese Text

Tokenize Japanese text using the tokenizedDocument function. The function automatically detects Japanese text.

str = [
documents = tokenizedDocument(str);

Lemmatize the tokens using normalizeWords.

documents = normalizeWords(documents)
documents = 
  4x1 tokenizedDocument:

    10 tokens: 空 に 星 が 輝く 、 瞬く て いる 。
    10 tokens: 空 の 星 が 輝き を 増す て いる 。
     9 tokens: 駅 まで は 遠い て 、 歩ける ない 。
     7 tokens: 遠く の 駅 まで 歩ける ない 。

Language-Independent Features

Word and N-Gram Counting

The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.

Modeling and Prediction

The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.

The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.

