Generate Text with Deep Learning "Invalid training data. Labels must not contain undefined values" ERROR

조회 수: 11 (최근 30일)
I am using the Generate Text with Deep Learning Matlab example, here
It works fine when I use the Shakespeare text provided in the example, but none of my texts are accepted. I always get the error: "Invalid training data. Labels must not contain undefined values."
My text and code provided below.
filename = 'RWE Nature.txt';
textData = fileread(filename);
textData = replace(textData," ","");
textData = split(textData,[newline]); % USE NEWLINE TO SPLIT TEXT INTO CELLS
% textData = textData(5:2:end);
textData(1:5) % 154 X 1 string array
startOfTextCharacter = compose("\x0002");
whitespaceCharacter = compose("\x00B7");
endOfTextCharacter = compose("\x2403");
newlineCharacter = compose("\x00B6");
textData = startOfTextCharacter + textData;
textData = replace(textData,[" " newline],[whitespaceCharacter newlineCharacter]);
uniqueCharacters = unique([textData{:}]); % '!'(),-.:;?ABCDEFGHIJKLMNOPRSTUVWYabcdefghijklmnopqrstuvwxyz¶·'
numUniqueCharacters = numel(uniqueCharacters); % 62
%
numDocuments = numel(textData); % 154 SONNETS, 89 PARAGRAPHS IN MAYER
XTrain = cell(1,numDocuments);
YTrain = cell(1,numDocuments);
for i = 1:numel(textData)
characters = textData{i};
sequenceLength = numel(characters);
% Get indices of characters.
[~,idx] = ismember(characters,uniqueCharacters);
% Convert characters to vectors.
X = zeros(numUniqueCharacters,sequenceLength);
for j = 1:sequenceLength
X(idx(j),j) = 1;
end
% Create vector of categorical responses with end of text character.
charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter];
Y = categorical(charactersShifted);
XTrain{i} = X;
YTrain{i} = Y;
end
% textData{1}
inputSize = size(XTrain{1},1);
numHiddenUnits = 200;
numClasses = numel(categories([YTrain{:}]));
layers = [
sequenceInputLayer(inputSize)
lstmLayer(numHiddenUnits,'OutputMode','sequence')
fullyConnectedLayer(numClasses)
softmaxLayer
classificationLayer];
options = trainingOptions('adam', ...
'MaxEpochs',500, ...
'InitialLearnRate',0.01, ...
'GradientThreshold',2, ...
'MiniBatchSize',77,...
'Shuffle','every-epoch', ...
'Plots','training-progress', ...
'Verbose',false);
% Train the network.
'a'
net = trainNetwork(XTrain,YTrain,layers,options);
'b'
% Generate text using the trained network.
generatedText = generateText(net,uniqueCharacters,startOfTextCharacter,newlineCharacter,whitespaceCharacter,endOfTextCharacter)
'end'
function generatedText = generateText(net,uniqueCharacters,startOfTextCharacter,newlineCharacter,whitespaceCharacter,endOfTextCharacter)
numUniqueCharacters = numel(uniqueCharacters);
X = zeros(numUniqueCharacters,1);
idx = strfind(uniqueCharacters,startOfTextCharacter);
X(idx) = 1;
generatedText = "";
vocabulary = string(net.Layers(end).Classes);
maxLength = 500;
while strlength(generatedText) < maxLength
% Predict the next character scores.
[net,characterScores] = predictAndUpdateState(net,X,'ExecutionEnvironment','cpu');
% Sample the next character.
newCharacter = datasample(vocabulary,1,'Weights',characterScores);
% Stop predicting at the end of text.
if newCharacter == endOfTextCharacter
break
end
% Add the character to the generated text.
generatedText = generatedText + newCharacter;
% Create a new vector for the next input.
X(:) = 0;
idx = strfind(uniqueCharacters,newCharacter);
X(idx) = 1;
end
generatedText = replace(generatedText,[newlineCharacter whitespaceCharacter],[newline " "]);
end

답변 (1개)

Ben
Ben 2022년 11월 28일
There are a few issues to fix this:
  1. The call to Y = categorical(charactersShifted) needs to include a valueset that includes all the unique characters in your dataset, Y = categorical(charactersShifted,allUniqueCharacters)
  2. To make that work with the uniqueCharacters variable you need to convert it to the same class as charactersShifted, a string.
  3. The endOfTextCharacter will need to be included too, otherwise it'll become an <undefined> category in Y.
  4. Finally the logic charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter]; might prepend an empty "" when characters was only 1 character long. That will make Y have length 2, but X have length 1 and you'll get a sequence length mismatch when you try to train.
I think training should work once you resolve these things. Hope that helps.

카테고리

Help CenterFile Exchange에서 Modeling and Prediction에 대해 자세히 알아보기

제품


릴리스

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by