Hybrid method to sentiment analysis column number error

Question

oliver 2023년 4월 12일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error

편집: Sanguk 2023년 4월 14일

IMBD_reviews_smol.csv

I am trying to perofrm the hybrid approach to sentiment analysis using both vader sentiment scores and machine learning using smv. I am trying to concatenate the sentiment scores with the bag of words features and then predict the sentiment labels for the next test set before evaluating th performance. however my X data is 1 coumn off being correct for the test. The error i recieve is

"Error using classreg.learning.internal.numPredictorsCheck

X data must have 1662 column(s).

Error in classreg.learning.classif.CompactClassificationECOC/predict (line 335)

classreg.learning.internal.numPredictorsCheck(X,...

Error in assessment_hybrid (line 50)

YPred = predict(mdl, XTest);"

my code is

% Load the movie review dataset
filename = "IMBD_reviews_first5000.csv"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.review;
textDataTest = dataTest.review;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Drew 2023년 4월 12일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error#answer_1214554

편집: Drew 2023년 4월 12일

MATLAB Online에서 열기

Updating this answer based on the comment below:

You have built one svm classifier based on the bag-of-words features. It sounds like what you want to do is build another classifier that uses both the bag-of-words features and the vader sentiment score feature. So, after you concatentate the sentiment scores with the bag-of-words features, you should train a new model on those combined features, and then evaluate that model.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  

Original answer:

As the error message indicates, on line 50 of your code (shown below), the number of predictors in XTest does not match the number of predictors used to train the model.

YPred = predict(mdl, XTest);

After training the model, the lines below added one extra predictor, and hence the mismatch.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];

The number of predictors at model training time needs to match the number of predictors at model testing time.

댓글 수: 2
없음 표시없음 숨기기

oliver 2023년 4월 12일

do you know how i would change the code to prevent adding one extra predictor

Sanguk 2023년 4월 13일

편집: Sanguk 2023년 4월 14일

MATLAB Online에서 열기

I get this error

"Error using horzcat

Dimensions of arrays being concatenated are not consistent.

Error in reun (line 44)

XTrain = [XTrain, sentimentScoresTrain];"

from running:

filename = "sentiment_irrelevantdrop_Chegg"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.text;
textDataTest = dataTest.text;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

btw, thank for your coding Oliver

댓글을 달려면 로그인하십시오.

Hybrid method to sentiment analysis column number error

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Hybrid method to sentiment analysis column number error

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기