분석할 텍스트 데이터 준비하기

이 예제에서는 분석할 텍스트 데이터를 정리하고 전처리하는 함수를 만드는 방법을 보여줍니다.

텍스트 데이터가 클수록 통계 분석에 부정적 영향을 주는 잡음 데이터가 많이 들어 있을 수 있습니다. 예를 들어 텍스트 데이터에 다음이 포함되어 있을 수 있습니다.

대/소문자가 변형된 단어. 예를 들면 "new"와 "New"
어형이 변형된 단어. 예를 들면 "walk"와 "walking"
잡음을 추가하는 단어. 예를 들면 "the"와 "of" 같은 불용어
문장 부호 및 특수 문자
HTML 및 XML 태그

다음 워드 클라우드는 공장 보고서의 원시 텍스트 데이터에 단어 빈도 분석을 적용한 버전과 동일한 텍스트 데이터를 전처리한 버전을 나타낸 것입니다.

텍스트 데이터 불러오기 및 추출하기

예제 데이터를 불러옵니다. factoryReports.csv 파일에는 각 이벤트에 대한 텍스트 설명과 범주 레이블이 포함된 공장 보고서가 들어 있습니다.

filename = "factoryReports.csv";
data = readtable(filename,'TextType','string');

Description 필드에서 텍스트 데이터를 추출한 다음, Category 필드에서 레이블 데이터를 추출합니다.

textData = data.Description;
labels = data.Category;
textData(1:10)

ans = 10×1 string
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."
    "There are cuts to the power when starting the plant."
    "Fried capacitors in the assembler."
    "Mixer tripped the fuses."
    "Burst pipe in the constructing agent is spraying coolant."
    "A fuse is blown in the mixer."
    "Things continue to tumble off of the belt."
    "Falling items from the conveyor belt."
    "The scanner reel is split, it will soon begin to curve."

토큰화된 문서 만들기

토큰화된 문서로 구성된 배열을 만듭니다.

cleanedDocuments = tokenizedDocument(textData);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    10 tokens: Items are occasionally getting stuck in the scanner spools .
    11 tokens: Loud rattling and banging sounds are coming from assembler pistons .
    11 tokens: There are cuts to the power when starting the plant .
     6 tokens: Fried capacitors in the assembler .
     5 tokens: Mixer tripped the fuses .
    10 tokens: Burst pipe in the constructing agent is spraying coolant .
     8 tokens: A fuse is blown in the mixer .
     9 tokens: Things continue to tumble off of the belt .
     7 tokens: Falling items from the conveyor belt .
    13 tokens: The scanner reel is split , it will soon begin to curve .

표제어 추출을 개선하기 위해 addPartOfSpeechDetails를 사용하여 품사 세부 정보를 문서에 추가합니다. 불용어를 제거하고 표제어를 추출하기 전에 addPartOfSpeech 함수를 사용하십시오.

cleanedDocuments = addPartOfSpeechDetails(cleanedDocuments);

"a", "and", "to", "the" 같은 단어(불용어라고 함)는 데이터에 잡음을 추가할 수 있습니다. removeStopWords 함수를 사용하여 불용어 목록을 제거합니다. normalizeWords 함수를 사용하기 전에 removeStopWords 함수를 사용하십시오.

cleanedDocuments = removeStopWords(cleanedDocuments);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    7 tokens: Items occasionally getting stuck scanner spools .
    8 tokens: Loud rattling banging sounds coming assembler pistons .
    5 tokens: cuts power starting plant .
    4 tokens: Fried capacitors assembler .
    4 tokens: Mixer tripped fuses .
    7 tokens: Burst pipe constructing agent spraying coolant .
    4 tokens: fuse blown mixer .
    6 tokens: Things continue tumble off belt .
    5 tokens: Falling items conveyor belt .
    8 tokens: scanner reel split , soon begin curve .

normalizeWords를 사용하여 단어의 표제어를 추출합니다.

cleanedDocuments = normalizeWords(cleanedDocuments,'Style','lemma');
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    7 tokens: items occasionally get stuck scanner spool .
    8 tokens: loud rattle bang sound come assembler piston .
    5 tokens: cut power start plant .
    4 tokens: fry capacitor assembler .
    4 tokens: mixer trip fuse .
    7 tokens: burst pipe constructing agent spray coolant .
    4 tokens: fuse blow mixer .
    6 tokens: thing continue tumble off belt .
    5 tokens: fall item conveyor belt .
    8 tokens: scanner reel split , soon begin curve .

문서에서 문장 부호를 지웁니다.

cleanedDocuments = erasePunctuation(cleanedDocuments);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    6 tokens: items occasionally get stuck scanner spool
    7 tokens: loud rattle bang sound come assembler piston
    4 tokens: cut power start plant
    3 tokens: fry capacitor assembler
    3 tokens: mixer trip fuse
    6 tokens: burst pipe constructing agent spray coolant
    3 tokens: fuse blow mixer
    5 tokens: thing continue tumble off belt
    4 tokens: fall item conveyor belt
    6 tokens: scanner reel split soon begin curve

2자 이하로 이루어진 단어와 15자 이상으로 이루어진 단어를 제거합니다.

cleanedDocuments = removeShortWords(cleanedDocuments,2);
cleanedDocuments = removeLongWords(cleanedDocuments,15);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

    6 tokens: items occasionally get stuck scanner spool
    7 tokens: loud rattle bang sound come assembler piston
    4 tokens: cut power start plant
    3 tokens: fry capacitor assembler
    3 tokens: mixer trip fuse
    6 tokens: burst pipe constructing agent spray coolant
    3 tokens: fuse blow mixer
    5 tokens: thing continue tumble off belt
    4 tokens: fall item conveyor belt
    6 tokens: scanner reel split soon begin curve

Bag-of-Words 모델 만들기

bag-of-words 모델을 만듭니다.

cleanedBag = bagOfWords(cleanedDocuments)

cleanedBag = 
  bagOfWords with properties:

          Counts: [480×352 double]
      Vocabulary: [1×352 string]
        NumWords: 352
    NumDocuments: 480

bag-of-words 모델에서 2회 이하로 나타나는 단어를 제거합니다.

cleanedBag = removeInfrequentWords(cleanedBag,2)

cleanedBag = 
  bagOfWords with properties:

          Counts: [480×163 double]
      Vocabulary: [1×163 string]
        NumWords: 163
    NumDocuments: 480

removeInfrequentWords 같은 일부 전처리 단계를 거치면 bag-of-words 모델에 빈 문서가 생길 수 있습니다. 전처리 후 bag-of-words 모델에 빈 문서가 남아 있지 않도록 하려면 removeEmptyDocuments를 마지막 단계로 수행하십시오.

bag-of-words 모델에서 빈 문서를 제거하고 labels에서 해당 레이블을 제거합니다.

[cleanedBag,idx] = removeEmptyDocuments(cleanedBag);
labels(idx) = [];
cleanedBag

cleanedBag = 
  bagOfWords with properties:

          Counts: [480×163 double]
      Vocabulary: [1×163 string]
        NumWords: 163
    NumDocuments: 480

전처리 함수 만들기

서로 다른 텍스트 데이터 모음을 동일한 방식으로 준비할 수 있기 때문에, 전처리를 수행하는 함수를 만드는 것이 유용할 수 있습니다. 예를 들어 함수를 사용하여 훈련 데이터와 동일한 단계로 새 데이터를 전처리할 수 있습니다.

분석에 사용할 수 있도록 텍스트 데이터를 토큰화하고 전처리하는 함수를 만듭니다. 함수 preprocessText는 다음 단계를 수행합니다.

tokenizedDocument를 사용하여 텍스트를 토큰화합니다.
removeStopWords를 사용하여 불용어 목록(예: "and", "of", "the")을 제거합니다.
normalizeWords를 사용하여 단어의 표제어를 추출합니다.
erasePunctuation을 사용하여 문장 부호를 지웁니다.
removeShortWords를 사용하여 2자 이하로 이루어진 단어를 제거합니다.
removeLongWords를 사용하여 15자 이상으로 이루어진 단어를 제거합니다.

예제 전처리 함수 preprocessText를 사용하여 텍스트 데이터를 준비합니다.

newText = "The sorting machine is making lots of loud noises.";
newDocuments = preprocessText(newText)

newDocuments = 
  tokenizedDocument:

   6 tokens: sorting machine make lot loud noise

원시 데이터와 비교하기

전처리된 데이터를 원시 데이터와 비교합니다.

rawDocuments = tokenizedDocument(textData);
rawBag = bagOfWords(rawDocuments)

rawBag = 
  bagOfWords with properties:

          Counts: [480×555 double]
      Vocabulary: [1×555 string]
        NumWords: 555
    NumDocuments: 480

데이터가 얼마나 줄었는지 계산합니다.

numWordsCleaned = cleanedBag.NumWords;
numWordsRaw = rawBag.NumWords;
reduction = 1 - numWordsCleaned/numWordsRaw

reduction = 0.7063

워드 클라우드를 사용하여 두 bag-of-words 모델을 시각화하여 원시 데이터와 정리된 데이터를 비교합니다.

figure
subplot(1,2,1)
wordcloud(rawBag);
title("Raw Data")
subplot(1,2,2)
wordcloud(cleanedBag);
title("Cleaned Data")

전처리 함수

함수 preprocessText는 다음 단계를 순서대로 수행합니다.

tokenizedDocument를 사용하여 텍스트를 토큰화합니다.
removeStopWords를 사용하여 불용어 목록(예: "and", "of", "the")을 제거합니다.
normalizeWords를 사용하여 단어의 표제어를 추출합니다.
erasePunctuation을 사용하여 문장 부호를 지웁니다.
removeShortWords를 사용하여 2자 이하로 이루어진 단어를 제거합니다.
removeLongWords를 사용하여 15자 이상으로 이루어진 단어를 제거합니다.

function documents = preprocessText(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Remove a list of stop words then lemmatize the words. To improve
% lemmatization, first use addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

참고 항목