Main Content

stopWords

설명

"a", "and", "to", "the" 같은 단어(불용어라고 함)는 데이터에 잡음을 추가할 수 있습니다. 불용어 목록을 사용하면 분석 전에 제거할 단어의 사용자 지정 목록을 만드는 데 도움이 됩니다.

토큰화된 문서에서 문서의 언어 세부 정보를 사용하여 디폴트 목록에 있는 불용어를 제거하려면 removeStopWords를 사용하십시오. 토큰화된 문서에서 사용자 지정 목록에 있는 단어를 제거하려면 removeWords를 사용하십시오.

이 함수는 한국어, 영어, 일본어, 독일어의 불용어 목록을 반환합니다.

예제

words = stopWords는 분석 전에 문서에서 제거할 수 있는 일반적인 영어 단어로 구성된 string형 배열을 반환합니다.

예제

words = stopWords('Language',language)는 불용어 언어를 지정합니다.

예제

모두 축소

문서의 언어 세부 정보를 사용하여 디폴트 목록에 있는 불용어를 제거하려면 removeStopWords를 사용하십시오.

사용자 지정 목록에 있는 불용어를 제거하려면 removeWords 함수를 사용하십시오. stopWords 함수에서 반환된 불용어 목록을 시작점으로 사용할 수 있습니다.

예제 데이터를 불러옵니다. 파일 sonnetsPreprocessed.txt에는 셰익스피어 소네트의 전처리된 버전이 들어 있습니다. 파일에는 한 줄에 하나씩 소네트가 들어 있으며 단어가 공백으로 구분되어 있습니다. sonnetsPreprocessed.txt에서 텍스트를 추출하고, 추출한 텍스트를 새 줄 문자에서 문서로 분할한 후 그 문서를 토큰화합니다.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

처음 몇 개의 문서를 표시합니다.

documents(1:5)
ans = 
  5x1 tokenizedDocument:

    70 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory thou contracted thine own bright eyes feedst thy lights flame selfsubstantial fuel making famine abundance lies thy self thy foe thy sweet self cruel thou art worlds fresh ornament herald gaudy spring thine own bud buriest thy content tender churl makst waste niggarding pity world else glutton eat worlds due grave thee
    71 tokens: forty winters shall besiege thy brow dig deep trenches thy beautys field thy youths proud livery gazed tatterd weed small worth held asked thy beauty lies treasure thy lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd thy beautys thou couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made thou art old thy blood warm thou feelst cold
    65 tokens: look thy glass tell face thou viewest time face form another whose fresh repair thou renewest thou dost beguile world unbless mother fair whose uneard womb disdains tillage thy husbandry fond tomb selflove stop posterity thou art thy mothers glass thee calls back lovely april prime thou windows thine age shalt despite wrinkles thy golden time thou live rememberd die single thine image dies thee
    71 tokens: unthrifty loveliness why dost thou spend upon thy self thy beautys legacy natures bequest gives nothing doth lend frank lends free beauteous niggard why dost thou abuse bounteous largess thee give profitless usurer why dost thou great sum sums yet canst live traffic thy self alone thou thy self thy sweet self dost deceive nature calls thee gone acceptable audit canst thou leave thy unused beauty tombed thee lives th executor
    61 tokens: hours gentle work frame lovely gaze every eye doth dwell play tyrants same unfair fairly doth excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

stopWords 함수의 출력값으로 시작하는 불용어 목록을 만듭니다.

customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];

문서에서 사용자 지정 불용어를 제거하고 처음 몇 개의 문서를 표시합니다.

documents = removeWords(documents,customStopWords);
documents(1:5)
ans = 
  5x1 tokenizedDocument:

    62 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory contracted thine own bright eyes feedst lights flame selfsubstantial fuel making famine abundance lies self foe sweet self cruel art worlds fresh ornament herald gaudy spring thine own bud buriest content tender churl makst waste niggarding pity world else glutton eat worlds due grave
    61 tokens: forty winters shall besiege brow dig deep trenches beautys field youths proud livery gazed tatterd weed small worth held asked beauty lies treasure lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd beautys couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made art old blood warm feelst cold
    52 tokens: look glass tell face viewest time face form another whose fresh repair renewest beguile world unbless mother fair whose uneard womb disdains tillage husbandry fond tomb selflove stop posterity art mothers glass calls back lovely april prime windows thine age shalt despite wrinkles golden time live rememberd die single thine image dies
    52 tokens: unthrifty loveliness why spend upon self beautys legacy natures bequest gives nothing lend frank lends free beauteous niggard why abuse bounteous largess give profitless usurer why great sum sums yet canst live traffic self alone self sweet self deceive nature calls gone acceptable audit canst leave unused beauty tombed lives th executor
    59 tokens: hours gentle work frame lovely gaze every eye dwell play tyrants same unfair fairly excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet

stopWords 함수를 사용하여 영어 불용어 목록을 가져옵니다. 가독성을 위해 출력값 형태를 변경합니다.

words = stopWords;
reshape(words,[25 9])
ans = 25x9 string
    "a"          "but"         "during"     "hows"       "it's"     "said"         "this"       "we’re"      "who’ve"  
    "about"      "by"          "each"       "however"    "it’s"     "says"         "those"      "we've"      "whove"   
    "above"      "can"         "either"     "i"          "its"      "see"          "through"    "we’ve"      "will"    
    "across"     "can't"       "for"        "i'd"        "let's"    "she"          "to"         "weve"       "with"    
    "after"      "can’t"       "from"       "i’d"        "let’s"    "she'd"        "too"        "were"       "within"  
    "all"        "cant"        "given"      "i'll"       "lets"     "she’d"        "towards"    "what"       "without" 
    "along"      "cannot"      "had"        "i’ll"       "may"      "shed"         "under"      "what's"     "won't"   
    "also"       "could"       "has"        "i'm"        "me"       "she'll"       "until"      "what’s"     "won’t"   
    "am"         "couldn't"    "have"       "i’m"        "more"     "she’ll"       "us"         "whats"      "would"   
    "an"         "couldn’t"    "having"     "im"         "most"     "shell"        "use"        "when"       "wouldn't"
    "and"        "couldnt"     "he"         "i've"       "much"     "should"       "used"       "when's"     "wouldn’t"
    "any"        "did"         "he'd"       "i’ve"       "must"     "since"        "uses"       "when’s"     "you"     
    "are"        "didn't"      "he’d"       "ive"        "my"       "so"           "using"      "whens"      "you'd"   
    "aren't"     "didn’t"      "hed"        "if"         "no"       "some"         "very"       "where"      "you’d"   
    "aren’t"     "didnt"       "he'll"      "in"         "not"      "such"         "want"       "whether"    "youd"    
    "arent"      "do"          "he’ll"      "instead"    "now"      "than"         "was"        "which"      "you'll"  
    "as"         "does"        "her"        "into"       "of"       "that"         "wasn't"     "while"      "you’ll"  
    "at"         "doesn't"     "here"       "is"         "on"       "the"          "wasn’t"     "who"        "youll"   
    "be"         "doesn’t"     "hers"       "isn't"      "one"      "their"        "wasnt"      "who'll"     "you're"  
    "because"    "doesnt"      "him"        "isn’t"      "only"     "them"         "we"         "who’ll"     "you’re"  
    "been"       "doing"       "himself"    "isnt"       "or"       "then"         "we'd"       "wholl"      "youre"   
    "before"     "done"        "his"        "it"         "other"    "there"        "we’d"       "who's"      "you've"  
    "being"      "don't"       "how"        "it'll"      "our"      "therefore"    "we'll"      "who’s"      "you’ve"  
    "between"    "don’t"       "how's"      "it’ll"      "out"      "these"        "we’ll"      "whos"       "youve"   
    "both"       "dont"        "how’s"      "itll"       "over"     "they"         "we're"      "who've"     "your"    

stopWords 함수를 사용하여 일본어 불용어 목록을 가져옵니다. 가독성을 위해 출력값 형태를 변경합니다.

words = stopWords('Language','ja');
reshape([words strings(1,8)],[35 11])
ans = 35x11 string
    "あそこ"      "さらい"      "なかば"      "下"    "今"    "地"      "列"    "秋"      "本当"     "う"       "どう" 
    "あたり"      "さん"       "なに"       "字"    "部"    "員"      "事"    "冬"      "確か"     "え"       "な"   
    "あちら"      "しかた"      "など"       "年"    "課"    "線"      "士"    "一"      "時点"     "お"       "ない" 
    "あっち"      "しよう"      "なん"       "月"    "係"    "点"      "台"    "二"      "全部"     "か"       "なり" 
    "あと"       "すか"       "はじめ"      "日"    "外"    "書"      "集"    "三"      "関係"     "が"       "なる" 
    "あな"       "ずつ"       "はず"       "時"    "類"    "品"      "様"    "四"      "近く"     "こそ"     "に"   
    "あなた"      "すね"       "はるか"      "分"    "達"    "力"      "所"    "五"      "方法"     "この"     "ね"   
    "あれ"       "すべて"      "ひと"       "秒"    "気"    "法"      "歴"    "六"      "我々"     "さ"       "の"   
    "いくつ"      "ぜんぶ"      "ひとつ"      "週"    "室"    "感"      "器"    "七"      "違い"     "さえ"     "ので" 
    "いつ"       "そう"       "ふく"       "火"    "口"    "作"      "名"    "八"      "多く"     "し"       "のに" 
    "いま"       "そこ"       "ぶり"       "水"    "誰"    "元"      "情"    "九"      "扱い"     "しか"     "は"   
    "いや"       "そちら"      "べつ"       "木"    "用"    "手"      "連"    "十"      "新た"     "する"     "ばかり"
    "いろいろ"    "そっち"      "へん"       "金"    "界"    "数"      "毎"    "百"      "その後"    "ず"       "へ"   
    "うち"       "そで"       "ぺん"       "土"    "会"    "彼"      "式"    "千"      "半ば"     "せる"     "ほど" 
    "おおまか"    "それ"       "ほう"       "国"    "首"    "彼女"    "簿"    "万"      "結局"     "そして"    "ます" 
    "おまえ"      "それぞれ"    "ほか"       "都"    "男"    "子"      "回"    "億"      "様々"     "その"     "ませ" 
    "おれ"       "それなり"    "まさ"       "道"    "女"    "内"      "匹"    "兆"      "以前"     "た"       "また" 
    "がい"       "たくさん"    "まし"       "府"    "別"    "楽"      "個"    "下記"    "以後"     "たい"     "まで" 
    "かく"       "たち"       "まとも"      "県"    "話"    "喜"      "席"    "上記"    "以降"     "ただ"     "も"   
    "かたち"      "たび"       "まま"       "市"    "私"    "怒"      "束"    "時間"    "未満"     "だ"       "や"   
    "かやの"      "ため"       "みたい"      "区"    "屋"    "哀"      "歳"    "今回"    "以上"     "だけ"     "やら" 
    "から"       "だめ"       "みつ"       "町"    "店"    "輪"      "目"    "前回"    "以下"     "だに"     "よ"   
    "がら"       "ちゃ"       "みなさん"    "村"    "家"    "頃"      "通"    "場合"    "幾つ"     "だの"     "より" 
    "きた"       "ちゃん"      "みんな"      "各"    "場"    "化"      "面"    "一つ"    "毎日"     "ち"       "れる" 
    "くせ"       "てん"       "もと"       "第"    "等"    "境"      "円"    "年生"    "自体"     "って"     "わ"   
    "ここ"       "とおり"      "もの"       "方"    "見"    "俺"      "玉"    "自分"    "向こう"    "て"       "を"   
    "こっち"      "とき"       "もん"       "何"    "際"    "奴"      "枚"    "ヶ所"    "何人"     "で"       "ん"   
    "こと"       "どこ"       "やつ"       "的"    "観"    "高"      "前"    "ヵ所"    "手段"     "でし"     ""     
    "ごと"       "どこか"      "よう"       "度"    "段"    "校"      "後"    "カ所"    "同じ"     "です"     ""     
    "こちら"      "ところ"      "よそ"       "文"    "略"    "婦"      "左"    "箇所"    "感じ"     "では"     ""     
      ⋮

stopWords 함수를 사용하여 독일어 불용어 목록을 가져옵니다. 가독성을 위해 출력값 형태를 변경합니다.

words = stopWords('Language','de');
reshape([words strings(1,7)],[25 8])
ans = 25x8 string
    "ab"         "dann"      "doch"       "hattet"     "jene"        "mein"       "seine"      "welcher"
    "aber"       "das"       "du"         "her"        "jenem"       "meine"      "seinem"     "welches"
    "alle"       "dass"      "durch"      "hin"        "jenen"       "meinem"     "seinen"     "wenn"   
    "allem"      "daß"       "ein"        "hätte"      "jener"       "meinen"     "seiner"     "wer"    
    "allen"      "dein"      "eine"       "hättest"    "jenes"       "meiner"     "seines"     "werde"  
    "aller"      "deine"     "einem"      "hättet"     "kann"        "meines"     "sich"       "werden" 
    "alles"      "deinem"    "einen"      "ich"        "kannst"      "mich"       "sie"        "weshalb"
    "als"        "deiner"    "einer"      "ihm"        "kein"        "mir"        "sind"       "wie"    
    "also"       "deines"    "eines"      "ihn"        "keine"       "mit"        "so"         "wieder" 
    "am"         "dem"       "er"         "ihr"        "keinem"      "muss"       "um"         "wieso"  
    "an"         "den"       "es"         "ihre"       "keinen"      "musst"      "und"        "wir"    
    "andere"     "denn"      "euch"       "ihrem"      "keiner"      "musste"     "uns"        "wirst"  
    "anderem"    "der"       "euer"       "ihren"      "keines"      "muß"        "unter"      "wo"     
    "anderen"    "derer"     "eure"       "ihrer"      "können"      "müssen"     "vom"        "während"
    "anderer"    "des"       "eurem"      "ihres"      "könnte"      "müssten"    "von"        "zu"     
    "anderes"    "dessen"    "euren"      "im"         "könnten"     "nach"       "vor"        "zum"    
    "auch"       "dich"      "eures"      "in"         "könntest"    "nicht"      "war"        "zur"    
    "auf"        "die"       "für"        "ins"        "ließ"        "nichts"     "waren"      "über"   
    "aus"        "dies"      "ganz"       "ist"        "man"         "noch"       "warst"      ""       
    "bei"        "diese"     "gar"        "ja"         "manche"      "nun"        "warum"      ""       
    "bin"        "diesem"    "habe"       "jede"       "manchem"     "nur"        "was"        ""       
    "bis"        "diesen"    "haben"      "jedem"      "manchen"     "ob"         "weil"       ""       
    "bist"       "dieser"    "hat"        "jeden"      "mancher"     "oder"       "welche"     ""       
    "da"         "dieses"    "hatte"      "jeder"      "manches"     "seid"       "welchem"    ""       
    "damit"      "dir"       "hattest"    "jedes"      "mehr"        "sein"       "welchen"    ""       

입력 인수

모두 축소

불용어 언어로, 다음 중 하나로 지정됩니다.

  • 'en' – 영어

  • 'ja' – 일본어

  • 'de' – 독일어

  • 'ko' – 한국어

Text Analytics Toolbox™의 언어 지원에 대한 자세한 내용은 언어 고려 사항 항목을 참조하십시오.

세부 정보

모두 축소

언어 고려 사항

stopWords 함수와 removeStopWords 함수는 한국어, 영어, 일본어, 독일어 불용어만 지원합니다.

다른 언어에서 불용어를 제거하려면 removeWords를 사용하여 제거할 불용어를 직접 지정해야 합니다.

버전 내역

R2017b에 개발됨