Main Content

딥러닝을 사용하여 단어 단위로 텍스트 생성하기

이 예제에서는 단어 단위로 텍스트를 생성하도록 딥러닝 LSTM 신경망을 훈련시키는 방법을 보여줍니다.

단어 단위로 텍스트를 생성하도록 딥러닝 신경망을 훈련시키려면 단어 시퀀스 내의 다음 단어를 예측하도록 sequence-to-sequence LSTM 신경망을 훈련시킵니다. 다음 단어를 예측하도록 신경망을 훈련시키기 위해서는 입력 시퀀스를 시간 스텝 하나만큼 이동시킨 값을 응답 변수로 지정합니다.

이 예제에서는 웹사이트에서 텍스트를 읽어옵니다. HTML 코드를 읽고 구문 분석하여 관련 텍스트를 추출한 다음, 사용자 지정 미니 배치 데이터저장소 documentGenerationDatastore를 사용하여 문서를 시퀀스 데이터로 구성된 미니 배치로 신경망에 입력합니다. 데이터저장소는 문서를 숫자형 단어 인덱스로 구성된 시퀀스로 변환합니다. 딥러닝 신경망은 단어 임베딩 계층을 포함하는 LSTM 신경망입니다.

미니 배치 데이터저장소란 배치 단위로 데이터를 읽을 수 있는 데이터저장소를 구현한 것입니다. 미니 배치 데이터저장소는 딥러닝 응용 프로그램을 위한 훈련, 검증, 테스트 및 예측 데이터 세트의 소스로 사용할 수 있습니다. 미니 배치 데이터저장소를 사용하여 메모리에 다 담을 수 없을 정도로 큰 데이터를 읽어 들이거나, 데이터를 배치 단위로 읽을 때 특정 전처리 연산을 수행할 수 있습니다.

함수를 사용자 지정하여 documentGenerationDatastore.m에서 지정한 사용자 지정 미니 배치 데이터저장소를 데이터에 맞게 수정할 수 있습니다. 이 파일은 이 예제에 지원 파일로 첨부되어 있습니다. 이 파일에 액세스하려면 예제를 라이브 스크립트로 여십시오. 고유한 사용자 지정 미니 배치 데이터저장소를 만드는 예제는 Develop Custom Mini-Batch Datastore (Deep Learning Toolbox) 항목을 참조하십시오.

훈련 데이터 불러오기

훈련 데이터를 불러옵니다. Project Gutenberg에서 제공하는 Alice's Adventures in Wonderland(Lewis Carroll 저)의 HTML 코드를 읽어옵니다.

url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
code = webread(url);

HTML 코드 구문 분석하기

HTML 코드의 <p>(단락) 요소에 관련 텍스트가 들어 있습니다. htmlTree를 사용하여 HTML 코드를 구문 분석한 후 요소 이름 "p"를 갖는 모든 요소를 찾아 관련 텍스트를 추출합니다.

tree = htmlTree(code);
selector = "p";
subtrees = findElement(tree,selector);

extractHTMLText를 사용하여 HTML 하위 트리에서 텍스트 데이터를 추출한 다음, 처음 10개의 단락을 표시합니다.

textData = extractHTMLText(subtrees);
textData(1:10)
ans = 10×1 string
    "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”"
    "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."
    "There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."
    "In another moment down went Alice after it, never once considering how in the world she was to get out again."
    "The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well."
    "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled “ORANGE MARMALADE”, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody underneath, so managed to put it into one of the cupboards as she fell past it."
    "“Well!” thought Alice to herself, “after such a fall as this, I shall think nothing of tumbling down stairs! How brave they’ll all think me at home! Why, I wouldn’t say anything about it, even if I fell off the top of the house!” (Which was very likely true.)"
    "Down, down, down. Would the fall never come to an end? “I wonder how many miles I’ve fallen by this time?” she said aloud. “I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think-” (for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a very good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over) “-yes, that’s about the right distance-but then I wonder what Latitude or Longitude I’ve got to?” (Alice had no idea what Latitude was, or Longitude either, but thought they were nice grand words to say.)"
    "Presently she began again. “I wonder if I shall fall right through the earth! How funny it’ll seem to come out among the people that walk with their heads downward! The Antipathies, I think-” (she was rather glad there was no one listening, this time, as it didn’t sound at all the right word) “-but I shall have to ask them what the name of the country is, you know. Please, Ma’am, is this New Zealand or Australia?” (and she tried to curtsey as she spoke-fancy curtseying as you’re falling through the air! Do you think you could manage it?) “And what an ignorant little girl she’ll think me for asking! No, it’ll never do to ask: perhaps I shall see it written up somewhere.”"
    "Down, down, down. There was nothing else to do, so Alice soon began talking again. “Dinah’ll miss me very much to-night, I should think!” (Dinah was the cat.) “I hope they’ll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I’m afraid, but you might catch a bat, and that’s very like a mouse, you know. But do cats eat bats, I wonder?” And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, “Do cats eat bats? Do cats eat bats?” and sometimes, “Do bats eat cats?” for, you see, as she couldn’t answer either question, it didn’t much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, “Now, Dinah, tell me the truth: did you ever eat a bat?” when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over."

빈 단락을 제거하고 남은 처음 10개의 단락을 표시합니다.

textData(textData == "") = [];
textData(1:10)
ans = 10×1 string
    "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”"
    "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."
    "There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."
    "In another moment down went Alice after it, never once considering how in the world she was to get out again."
    "The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well."
    "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled “ORANGE MARMALADE”, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody underneath, so managed to put it into one of the cupboards as she fell past it."
    "“Well!” thought Alice to herself, “after such a fall as this, I shall think nothing of tumbling down stairs! How brave they’ll all think me at home! Why, I wouldn’t say anything about it, even if I fell off the top of the house!” (Which was very likely true.)"
    "Down, down, down. Would the fall never come to an end? “I wonder how many miles I’ve fallen by this time?” she said aloud. “I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think-” (for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a very good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over) “-yes, that’s about the right distance-but then I wonder what Latitude or Longitude I’ve got to?” (Alice had no idea what Latitude was, or Longitude either, but thought they were nice grand words to say.)"
    "Presently she began again. “I wonder if I shall fall right through the earth! How funny it’ll seem to come out among the people that walk with their heads downward! The Antipathies, I think-” (she was rather glad there was no one listening, this time, as it didn’t sound at all the right word) “-but I shall have to ask them what the name of the country is, you know. Please, Ma’am, is this New Zealand or Australia?” (and she tried to curtsey as she spoke-fancy curtseying as you’re falling through the air! Do you think you could manage it?) “And what an ignorant little girl she’ll think me for asking! No, it’ll never do to ask: perhaps I shall see it written up somewhere.”"
    "Down, down, down. There was nothing else to do, so Alice soon began talking again. “Dinah’ll miss me very much to-night, I should think!” (Dinah was the cat.) “I hope they’ll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I’m afraid, but you might catch a bat, and that’s very like a mouse, you know. But do cats eat bats, I wonder?” And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, “Do cats eat bats? Do cats eat bats?” and sometimes, “Do bats eat cats?” for, you see, as she couldn’t answer either question, it didn’t much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, “Now, Dinah, tell me the truth: did you ever eat a bat?” when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over."

텍스트 데이터를 워드 클라우드로 시각화합니다.

figure
wordcloud(textData);
title("Alice's Adventures in Wonderland")

훈련용 데이터 준비하기

documentGenerationDatastore를 사용하여 훈련용 데이터를 포함하는 데이터저장소를 만듭니다. 예측 변수의 경우 데이터저장소는 단어 인코딩을 사용하여 문서를 단어 인덱스로 구성된 시퀀스로 변환합니다. 각 문서의 첫 번째 단어 인덱스는 "텍스트 시작" 토큰에 해당합니다. "텍스트 시작" 토큰은 문자열 "startOfText"로 주어집니다. 응답 변수의 경우 데이터저장소는 1만큼 이동시킨 단어의 categorical형 시퀀스를 반환합니다.

tokenizedDocument를 사용하여 텍스트 데이터를 토큰화합니다.

documents = tokenizedDocument(textData);

토큰화된 문서를 사용하여 문서 생성 데이터저장소를 만듭니다.

ds = documentGenerationDatastore(documents);

시퀀스에 추가되는 채우기의 양을 줄이기 위해 데이터저장소의 문서를 시퀀스 길이별로 정렬합니다.

ds = sort(ds);

LSTM 신경망 만들고 훈련시키기

LSTM 신경망 아키텍처를 정의합니다. 신경망에 시퀀스 데이터를 입력하려면 시퀀스 입력 계층을 포함시키고 입력 크기를 1로 설정하십시오. 다음으로, 차원이 100이고 단어 인코딩과 동일한 단어 개수를 갖는 단어 임베딩 계층을 포함시킵니다. 다음으로, LSTM 계층을 포함시키고 은닉 크기를 100으로 지정합니다. 마지막으로, 클래스 개수와 동일한 크기를 갖는 완전 연결 계층, 소프트맥스 계층, 분류 계층을 추가합니다. 클래스 수는 단어집의 단어 수에 "텍스트 끝"에 해당하는 클래스를 더한 것입니다.

inputSize = 1;
embeddingDimension = 100;
numWords = numel(ds.Encoding.Vocabulary);
numClasses = numWords + 1;

layers = [ 
    sequenceInputLayer(inputSize)
    wordEmbeddingLayer(embeddingDimension,numWords)
    lstmLayer(100)
    dropoutLayer(0.2)
    fullyConnectedLayer(numClasses)
    softmaxLayer
    classificationLayer];

훈련 옵션을 지정합니다. 솔버를 'adam'으로 지정합니다. 학습률 0.01로 Epoch 300회 훈련시킵니다. 미니 배치 크기는 32로 설정합니다. 데이터를 시퀀스 길이대로 정렬된 채로 유지하기 위해 'Shuffle' 옵션을 'never'로 설정합니다. 훈련 진행 상황을 모니터링하려면 'Plots' 옵션을 'training-progress'로 설정하십시오. 세부 정보가 출력되지 않도록 하려면 'Verbose'false로 설정하십시오.

options = trainingOptions('adam', ...
    'MaxEpochs',300, ...
    'InitialLearnRate',0.01, ...
    'MiniBatchSize',32, ...
    'Shuffle','never', ...
    'Plots','training-progress', ...
    'Verbose',false);

trainNetwork를 사용하여 신경망을 훈련시킵니다.

net = trainNetwork(ds,layers,options);

새 텍스트 생성하기

훈련 데이터에서 텍스트의 첫 번째 단어들의 확률 분포에서 단어를 추출하여 텍스트의 첫 번째 단어를 생성합니다. 현재 생성된 텍스트 시퀀스를 사용하여 다음 시간 스텝을 예측하려면 훈련된 LSTM 신경망을 사용하여 나머지 단어를 생성하십시오. 신경망이 "텍스트 끝" 단어를 예측할 때까지 계속해서 단어를 하나씩 생성합니다.

신경망을 사용하여 첫 번째 예측을 하기 위해 "텍스트 시작" 토큰을 나타내는 인덱스를 입력합니다. 문서 데이터저장소에서 사용하는 단어 인코딩과 함께 word2ind 함수를 사용하여 인덱스를 찾습니다.

enc = ds.Encoding;
wordIndex = word2ind(enc,"startOfText")
wordIndex = 1

나머지 예측에 대해서는 신경망의 예측 점수에 따라 다음 단어를 추출합니다. 예측 점수는 다음 단어의 확률 분포를 나타냅니다. 신경망의 출력 계층의 클래스 이름으로 주어지는 단어집으로부터 단어들을 추출합니다.

vocabulary = string(net.Layers(end).Classes);

predictAndUpdateState를 사용하여 단어 단위로 예측합니다. 각 예측에 대해 이전 단어의 인덱스를 입력합니다. 신경망이 텍스트 끝 단어를 예측하거나 생성된 텍스트의 길이가 500자가 되면 예측을 중단합니다. 대규모의 데이터 모음, 긴 시퀀스 또는 큰 신경망의 경우에는 일반적으로 GPU에서의 예측이 CPU에서의 예측보다 연산 속도가 빠릅니다. 그 밖의 경우에는 일반적으로 CPU에서의 예측이 연산 속도가 빠릅니다. 단일 시간 스텝 예측에는 CPU를 사용하십시오. 예측에 CPU를 사용하려면 predictAndUpdateState'ExecutionEnvironment' 옵션을 'cpu'로 설정하십시오.

generatedText = "";
maxLength = 500;
while strlength(generatedText) < maxLength
    % Predict the next word scores.
    [net,wordScores] = predictAndUpdateState(net,wordIndex,'ExecutionEnvironment','cpu');
    
    % Sample the next word.
    newWord = datasample(vocabulary,1,'Weights',wordScores);
    
    % Stop predicting at the end of text.
    if newWord == "EndOfText"
        break
    end
    
    % Add the word to the generated text.
    generatedText = generatedText + " " + newWord;
    
    % Find the word index for the next input.
    wordIndex = word2ind(enc,newWord);
end

생성 프로세스는 각 예측 사이에 공백 문자를 도입합니다. 즉, 일부 문장 부호 문자는 앞뒤에 불필요한 공백과 함께 표시됩니다. 적절한 문장 부호 문자 앞뒤의 공백을 제거하여 생성된 텍스트를 재구성합니다.

지정된 문장 부호 문자 앞에 나타나는 공백을 제거합니다.

punctuationCharacters = ["." "," "’" ")" ":" "?" "!"];
generatedText = replace(generatedText," " + punctuationCharacters,punctuationCharacters);

지정된 문장 부호 문자 뒤에 나타나는 공백을 제거합니다.

punctuationCharacters = ["(" "‘"];
generatedText = replace(generatedText,punctuationCharacters + " ",punctuationCharacters)
generatedText = 
" “ Just about as much right, ” said the Duchess, “ and that’s all the least, ” said the Hatter. “ Fetch me to my witness at the shepherd heart of him."

여러 조각의 텍스트를 생성하려면 resetState를 사용하여 생성과 생성 사이에 신경망 상태를 재설정하십시오.

net = resetState(net);

참고 항목

| | | (Deep Learning Toolbox) | (Deep Learning Toolbox) | (Deep Learning Toolbox) | (Deep Learning Toolbox) | | | |

관련 항목