bagOfWords

Bag-of-words 모델

페이지 내 모두 확장

설명

bag-of-words 모델(단어 빈도 카운터라고도 함)은 단어가 문서 모음의 각 문서에서 나타나는 횟수를 기록합니다.

bagOfWords는 텍스트를 단어로 분할하지 않습니다. 토큰화된 문서로 구성된 배열을 만들려면 tokenizedDocument 항목을 참조하십시오.

생성

구문

bag = bagOfWords

bag = bagOfWords(documents)

bag = bagOfWords(uniqueWords,counts)

설명

bag = bagOfWords는 빈 bag-of-words 모델을 만듭니다.

bag = bagOfWords(documents)는 documents에 나타나는 단어 개수를 계산하고 bag-of-words 모델을 반환합니다.

예제

bag = bagOfWords(uniqueWords,counts)는 uniqueWords의 단어와 counts의 해당 빈도 수를 사용하여 bag-of-words 모델을 만듭니다.

예제

입력 인수

모두 확장

`documents` — 입력 문서
`tokenizedDocument` 배열 | string형 배열 | 문자형 벡터로 구성된 셀형 배열

입력 문서로, tokenizedDocument 배열, 단어로 구성된 string형 배열 또는 문자형 벡터로 구성된 셀형 배열로 지정됩니다. documents가 tokenizedDocument 배열이 아닌 경우 이는 단일 문서를 나타내고 각 요소가 단어인 행 벡터여야 합니다. 문서를 여러 개 지정하려면 tokenizedDocument 배열을 사용하십시오.

`uniqueWords` — 고유한 단어 목록
string형 벡터 | 문자형 벡터로 구성된 셀형 배열

고유한 단어 목록으로, string형 벡터 또는 문자형 벡터로 구성된 셀형 배열로 지정됩니다. uniqueWords에 <missing>이 있는 경우 함수는 누락값을 무시합니다. uniqueWords의 크기는 1xV여야 하며, 여기서 V는 counts의 열 개수입니다.

예: ["an" "example" "list"]

데이터형: string | cell

`counts` — 단어의 빈도 수
음이 아닌 정수로 구성된 행렬

uniqueWords에 해당하는 단어의 빈도 수로, 음이 아닌 정수로 구성된 행렬로 지정됩니다. 값 counts(i,j)는 i번째 문서에서 단어 uniqueWords(j)가 나오는 횟수에 해당합니다.

counts에는 numel(uniqueWords)개의 열이 있어야 합니다.

속성

모두 확장

`Counts` — 문서당 단어 개수
희소 행렬

문서당 단어 개수로, 희소 행렬로 지정됩니다.

`NumDocuments` — 문서가 나오는 횟수
음이 아닌 정수

문서가 나오는 횟수로, 음이 아닌 정수로 지정됩니다.

`NumWords` — 모델 내 고유한 단어 개수
음이 아닌 정수

모델 내 고유한 단어 개수로, 음이 아닌 정수로 지정됩니다.

`Vocabulary` — 모델 내 고유한 단어
string형 벡터

모델 내 고유한 단어로, string형 벡터로 지정됩니다.

데이터형: string

객체 함수

`encode`	문서를 단어 개수 또는 n-gram 개수로 구성된 행렬로 인코딩
`tfidf`	TF-IDF(단어 빈도-역 문서 빈도) 행렬
`topkwords`	bag-of-words 모델 또는 LDA 토픽에서 가장 중요한 단어
`addDocument`	bag-of-words 모델 또는 bag-of-n-grams 모델에 문서 추가
`removeDocument`	bag-of-words 모델 또는 bag-of-n-grams 모델에서 문서 제거
`removeEmptyDocuments`	토큰화된 문서 배열, bag-of-words 모델 또는 bag-of-n-grams 모델에서 빈 문서 제거
`removeWords`	문서 또는 bag-of-words 모델에서 선택한 단어 제거
`removeInfrequentWords`	bag-of-words 모델에서 개수가 적은 단어 제거
`join`	Combine multiple bag-of-words or bag-of-n-grams models
`wordcloud`	텍스트, bag-of-words 모델, bag-of-n-grams 모델 또는 LDA 모델에서 워드 클라우드 차트 만들기

예제

모두 축소

Bag-of-Words 모델 만들기

라이브 스크립트 열기

예제 데이터를 불러옵니다. 파일 sonnetsPreprocessed.txt에는 셰익스피어 소네트의 전처리된 버전이 들어 있습니다. 파일에는 한 줄에 하나씩 소네트가 들어 있으며 단어가 공백으로 구분되어 있습니다. sonnetsPreprocessed.txt에서 텍스트를 추출하고, 추출한 텍스트를 새 줄 문자에서 문서로 분할한 후 그 문서를 토큰화합니다.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords를 사용하여 bag-of-words 모델을 만듭니다.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

상위 10개의 단어와 총 개수를 표시합니다.

tbl = topkwords(bag,10)

tbl=10×2 table
     Word      Count
    _______    _____

    "thy"       281 
    "thou"      234 
    "love"      162 
    "thee"      161 
    "doth"       88 
    "mine"       63 
    "shall"      59 
    "eyes"       56 
    "sweet"      55 
    "time"       53

고유한 단어와 해당 개수로부터 Bag-of-Words 모델 만들기

라이브 스크립트 열기

고유한 단어로 구성된 string형 배열과 단어 개수로 구성된 행렬을 사용하여 bag-of-words 모델을 만듭니다.

uniqueWords = ["a" "an" "another" "example" "final" "sentence" "third"];
counts = [ ...
    1 2 0 1 0 1 0;
    0 0 3 1 0 4 0;
    1 0 0 5 0 3 1;
    1 0 0 1 7 0 0];
bag = bagOfWords(uniqueWords,counts)

bag = 
  bagOfWords with properties:

        NumWords: 7
          Counts: [4×7 double]
      Vocabulary: ["a"    "an"    "another"    "example"    "final"    "sentence"    "third"]
    NumDocuments: 4

파일 데이터저장소를 사용하여 여러 파일에서 텍스트 가져오기

라이브 스크립트 열기

텍스트 데이터가 한 폴더 내 여러 파일에 포함되어 있는 경우 파일 데이터저장소를 사용하여 텍스트 데이터를 MATLAB으로 가져올 수 있습니다.

예제 소네트 텍스트 파일을 위한 파일 데이터저장소를 만듭니다. 예제 소네트의 파일 이름은 "exampleSonnetN.txt"입니다. 여기서 N은 소네트 번호입니다. extractFileText를 읽기 함수로 지정합니다.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

빈 bag-of-words 모델을 만듭니다.

bag = bagOfWords

bag = 
  bagOfWords with properties:

        NumWords: 0
          Counts: []
      Vocabulary: [1×0 string]
    NumDocuments: 0

루프를 사용해 데이터저장소에 있는 파일을 순회하여 각 파일을 읽어옵니다. 각 파일의 텍스트를 토큰화하고 문서를 bag에 추가합니다.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

업데이트된 bag-of-words 모델을 표시합니다.

bag

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

Bag-of-Words 모델에서 불용어 제거하기

라이브 스크립트 열기

removeWords에 불용어 목록을 입력하여 bag-of-words 모델에서 불용어를 제거합니다. 불용어는 "a", "the", "in"과 같이 일반적으로 분석 전에 텍스트에서 제거되는 단어입니다.

documents = tokenizedDocument([
    "an example of a short sentence" 
    "a second short sentence"]);
bag = bagOfWords(documents);
newBag = removeWords(bag,stopWords)

newBag = 
  bagOfWords with properties:

        NumWords: 4
          Counts: [2×4 double]
      Vocabulary: ["example"    "short"    "sentence"    "second"]
    NumDocuments: 2

Bag-of-Words 모델에서 빈도가 가장 높은 단어

라이브 스크립트 열기

bag-of-words 모델에서 빈도가 가장 높은 단어의 테이블을 만듭니다.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords를 사용하여 bag-of-words 모델을 만듭니다.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

상위 5개 단어를 찾습니다.

T = topkwords(bag);

모델에서 상위 20개 단어를 찾습니다.

k = 20;
T = topkwords(bag,k)

T=20×2 table
      Word      Count
    ________    _____

    "thy"        281 
    "thou"       234 
    "love"       162 
    "thee"       161 
    "doth"        88 
    "mine"        63 
    "shall"       59 
    "eyes"        56 
    "sweet"       55 
    "time"        53 
    "beauty"      52 
    "nor"         52 
    "art"         51 
    "yet"         51 
    "o"           50 
    "heart"       50 
      ⋮

Tf-idf 행렬 만들기

라이브 스크립트 열기

bag-of-words 모델에서 TF-IDF(단어 빈도-역 문서 빈도) 행렬을 만듭니다.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords를 사용하여 bag-of-words 모델을 만듭니다.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

tf-idf 행렬을 만듭니다. 처음 10개의 행과 열을 표시합니다.

M = tfidf(bag);
full(M(1:10,1:10))

ans = 10×10

    3.6507    4.3438    2.7344    3.6507    4.3438    2.2644    3.2452    3.8918    2.4720    2.5520
         0         0         0         0         0    4.5287         0         0         0         0
         0         0         0         0         0         0         0         0         0    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0    2.5520
         0         0    2.7344         0         0         0         0         0         0         0

Bag-of-Words 모델에서 워드 클라우드 만들기

라이브 스크립트 열기

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

bagOfWords를 사용하여 bag-of-words 모델을 만듭니다.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    "contracted"    …    ]
        NumWords: 3092
    NumDocuments: 154

워드 클라우드를 사용하여 bag-of-words 모델을 시각화합니다.

figure
wordcloud(bag);

Figure contains an object of type wordcloud.

병렬로 Bag-of-Words 모델 만들기

라이브 스크립트 열기

텍스트 데이터가 한 폴더 내 여러 파일에 포함되어 있는 경우 parfor를 사용하여 병렬로 텍스트 데이터를 가져오고 bag-of-words 모델을 만들 수 있습니다. Parallel Computing Toolbox™가 설치된 경우에는 parfor 루프가 병렬로 실행됩니다. 그렇지 않으면 직렬로 실행됩니다. bag-of-words 모델로 구성된 배열을 하나의 모델로 결합하려면 join을 사용하십시오.

파일 이름 목록을 만듭니다. 예제 소네트의 파일 이름은 "exampleSonnetN.txt"입니다. 여기서 N은 소네트 번호입니다.

filenames = [
    "exampleSonnet1.txt"
    "exampleSonnet2.txt"
    "exampleSonnet3.txt"
    "exampleSonnet4.txt"];

파일 모음에서 bag-of-words 모델을 만듭니다. 빈 bag-of-words 모델을 초기화한 다음, 루프를 사용해 파일을 순회하여 각 파일에 대한 bag-of-words 모델을 만듭니다.

bag = bagOfWords;

numFiles = numel(filenames);
parfor i = 1:numFiles
    filename = filenames(i);
    
    textData = extractFileText(filename);
    document = tokenizedDocument(textData);
    bag(i) = bagOfWords(document);
end

join을 사용하여 bag-of-words 모델을 결합합니다.

bag = join(bag)

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

팁

작업에 홀드아웃 테스트 세트를 사용하려면 bagOfWords를 사용하기 전에 텍스트 데이터를 분할하십시오. 그렇게 하지 않으면 bag-of-words 모델에서 분석이 편향될 수 있습니다.

버전 내역

R2017b에 개발됨

참고 항목

bagOfWords

설명

생성

구문

설명

입력 인수

documents — 입력 문서 tokenizedDocument 배열 | string형 배열 | 문자형 벡터로 구성된 셀형 배열

uniqueWords — 고유한 단어 목록 string형 벡터 | 문자형 벡터로 구성된 셀형 배열

counts — 단어의 빈도 수 음이 아닌 정수로 구성된 행렬

속성

Counts — 문서당 단어 개수 희소 행렬

NumDocuments — 문서가 나오는 횟수 음이 아닌 정수

NumWords — 모델 내 고유한 단어 개수 음이 아닌 정수

Vocabulary — 모델 내 고유한 단어 string형 벡터

객체 함수

예제

Bag-of-Words 모델 만들기

고유한 단어와 해당 개수로부터 Bag-of-Words 모델 만들기

파일 데이터저장소를 사용하여 여러 파일에서 텍스트 가져오기

Bag-of-Words 모델에서 불용어 제거하기

Bag-of-Words 모델에서 빈도가 가장 높은 단어

Tf-idf 행렬 만들기

Bag-of-Words 모델에서 워드 클라우드 만들기

병렬로 Bag-of-Words 모델 만들기

팁

버전 내역

참고 항목

도움말 항목

`documents` — 입력 문서
`tokenizedDocument` 배열 | string형 배열 | 문자형 벡터로 구성된 셀형 배열

`uniqueWords` — 고유한 단어 목록
string형 벡터 | 문자형 벡터로 구성된 셀형 배열

`counts` — 단어의 빈도 수
음이 아닌 정수로 구성된 행렬

`Counts` — 문서당 단어 개수
희소 행렬

`NumDocuments` — 문서가 나오는 횟수
음이 아닌 정수

`NumWords` — 모델 내 고유한 단어 개수
음이 아닌 정수

`Vocabulary` — 모델 내 고유한 단어
string형 벡터