Cleveland heart disease dataset - how to improve test accurarcy?

Question

HOCK WENG 2024년 5월 10일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2117461-cleveland-heart-disease-dataset-how-to-improve-test-accurarcy

답변: Drew 2024년 6월 28일

processed.cleveland.csv

I'm using Multilayer Perceptron with backpropagation to predict heart diseases, and I am using the dataset linked here:https://drive.google.com/file/d/1ZuVXGbE6UVQFJ5ab5m1k4LzvDNTtLqYQ/view?usp=sharing

It has 303 records and 4 output classes (0,1,2,3,4) that represent the severity of heart disease on a scale of 0 to 4.

There's missing data in the dataset that I took care of by replacing missing values with the mean of the respective feature.

Here is the parameter that I set to train the model:

Number of hidden layer neurons = 100 with single hidden layer
Number of output layer neurons = 5
activation function of hidden layer = logsig
activation function of output layer = softmax
training function = trainlm
learning rate = 0.001
Maximum validation failures = 10
Maximum epochs = 5000
Minimum gradient = 1.99e-8

But no matter what I do - adjust the learning rate, no. of hidden layers, etc - the test accuracy stays between 55% to 60%, but the training accuracy can reach 80% above, And I need test accuracy to be >80% too.

How can I achieve my target? Please help me solve this problem, thank you.

Here is the result that I got, hope can as the reference:

% Load the dataset
data = readtable('C:\Users\User\Desktop\processed.cleveland.csv');
% Convert table to matrix
X = table2array(data(:, 1:13)); % Assuming the first 13 columns are input features
y = table2array(data(:, 14));   % Assuming the last column is the output label
% Handle missing values (if any)
% Replace missing values with the mean of the respective feature
X = fillmissing(X, 'linear'); % or 'nearest'
% % Normalize the input features
% X = normalize(X);
% Z-score normalization
mu = nanmean(X); % Compute mean of each feature, ignoring NaNs
sigma = nanstd(X); % Compute standard deviation of each feature, ignoring NaNs
X = (X - mu) ./ sigma; % Perform Z-score normalization
% Split the data into training and testing sets
cv = cvpartition(size(X,1),'HoldOut',0.5); % 30% of data for testing
idxTrain = training(cv); % Indices for training set
idxTest = test(cv); % Indices for testing set
X_train = X(idxTrain,:);
y_train = y(idxTrain,:);
X_test = X(idxTest,:);
y_test = y(idxTest,:);
% Define the MLP architecture
hiddenLayerSize = [100]; % Single hidden layer with 10 neurons
% Choose activation function for the output layer
outputLayerActivation = 'softmax'; % Softmax for multi-class classification
% Create the MLP model
net = patternnet(hiddenLayerSize);
% Set activation functions for hidden layers
for i = 1:numel(hiddenLayerSize)
    net.layers{i}.transferFcn = 'logsig'; % Apply ReLU
    net.layers{i}.userdata.dropoutFraction = 0.5; % Dropout fraction (adjust as needed)
end
% Set activation function for output layer
net.layers{end}.transferFcn = outputLayerActivation;
% Set training function
net.trainFcn = 'trainlm';
% Set training options
net.trainParam.lr = 0.001; % Learning rate
net.trainParam.max_fail = 10; % Maximum validation failures
net.trainParam.epochs = 5000; % Maximum epochs
net.trainParam.min_grad = 1.99e-8; % Minimum gradient
% Train the MLP model using training data
net = train(net, X_train', ind2vec(y_train'+1)); % '+1' to convert labels to 1-based indexing
% Test the trained model using testing data
y_pred = net(X_test');

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Drew 2024년 6월 28일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2117461-cleveland-heart-disease-dataset-how-to-improve-test-accurarcy#answer_1478476

MATLAB Online에서 열기

processed.cleveland.csv

Main ideas:

The main ideas for the answer are:

(1) Collapse the target classes to just two classes, namely, presence or absence of heart disease. As seen at https://archive.ics.uci.edu/dataset/45/heart+disease, "Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0)." So, collapse target values 1,2,3,4 to just "1".

(2) Categorical variables should be properly encoded for use with Neural Network classifiers. For example, use one hot encoding on the categorical variables. Again, info at https://archive.ics.uci.edu/dataset/45/heart+disease indicates which variables are categorical.

(3) This is a small dataset, so the choice of the validation and test data will affect the bias and variance of the observed accuracy. Using k-fold cross-validation, it is easy to observe accuracies over 80% for the two-class problem of presence vs absence of heart disease.

Implementation with (1) fitcnet & Classification Learner app, OR (2) patternnet

For classification of this tabular data, both fitcnet or patternnet could be used. See the accepted answer at https://www.mathworks.com/matlabcentral/answers/834428-difference-between-fitcnet-and-patternnet-functions for some similarities and differences. As mentioned in that answer, "Finally, note that fitcnet is available in the Classification Learner app, which facilitates easy comparison of multiple machine learning models for tabular classification problems."

(1) fitcnet and Classification Learner app

Let's first try easy comparison of multiple machine learning models using Classification Learner. First, prepare the data for loading into the Classification Learner app. This little script starts from the data you attached, adds variable names and categorical variable designation, imputes missing values using the mode, and collapses the target to just two classes.

% Load data
data = readtable('processed.cleveland.csv');
% Add Variable Names, info from https://archive.ics.uci.edu/dataset/45/heart+disease
data.Properties.VariableNames = {'age', 'sex', 'cp', 'trestbps', 'chol', ...
    'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'};
% Convert variables to categorical
iscat=[0 1 1 0 0 1 1 0 1 0 1 0 1 0]; % Leave target as double for now
for i=1:width(data)
    if (iscat(i)==1)
        data.(i) = categorical(data.(i));
    end
end
% Replace missing with the mode of that variable. 
for i=1:width(data)
    if (sum(ismissing(data.(i)))) % If true, then this var has some missing values
        % replace missing with mode of the variable
        data(:,i) = fillmissing(data(:,i),'constant',table2array(mode(data(:,i))));
    end
end
% Collapse the targets to 0 or 1, and convert to categorical.
data.(14) = categorical( double( data.(14)>=1 ) );
head(data)
    age    sex    cp    trestbps    chol    fbs    restecg    thalach    exang    oldpeak    slope    ca    thal    num
    ___    ___    __    ________    ____    ___    _______    _______    _____    _______    _____    __    ____    ___

    63      1     1       145       233      1        2         150        0        2.3        3      0      6       0 
    67      1     4       160       286      0        2         108        1        1.5        2      3      3       1 
    67      1     4       120       229      0        2         129        1        2.6        2      2      7       1 
    37      1     3       130       250      0        0         187        0        3.5        3      0      3       0 
    41      0     2       130       204      0        2         172        0        1.4        1      0      3       0 
    56      1     2       120       236      0        0         178        0        0.8        1      0      3       0 
    62      0     4       140       268      0        2         160        0        3.6        3      2      3       1 
    57      0     4       120       354      0        0         163        1        0.6        1      0      3       0 

Now, load the data into the Classification Learner app, and choose 10-fold cross-validation:

Once in the app, choose "All" models, "Optimizable Neural Network", and "Optimizable Ensemble" from the models gallery. After training those models, the following results (or similar) are obtained, with many models achieving over 80% accuracy. The exact results will vary, depending on the cross-validation partition and optimization results. The Optimizable Neural Network achieves 85.5% accuracy on the validation data (this is 10-fold cross-validation accuracy). In this case, it turns out that the optimization process chose a neural network with just one layer, and one node in that layer. So, a very simple neural network can do pretty well for this data.

Next, export the best performing neural network to the workspace, using the "Export Model" option. Inside the model, the expanded predictor names can be seen (look at trainedModel.ClassificationNeuralNetwork.ExpandedPredictorNames) indicating that fitcnet has automatically done the one hot encoding, based on which variables are categorical.

>> trainedModel.ClassificationNeuralNetwork.ExpandedPredictorNames

ans =

1×25 cell array

Columns 1 through 8

{'age'} {'sex == 0'} {'sex == 1'} {'cp == 1'} {'cp == 2'} {'cp == 3'} {'cp == 4'} {'trestbps'}

Columns 9 through 15

{'chol'} {'fbs == 0'} {'fbs == 1'} {'restecg == 0'} {'restecg == 1'} {'restecg == 2'} {'thalach'}

Columns 16 through 22

{'exang == 0'} {'exang == 1'} {'oldpeak'} {'slope == 1'} {'slope == 2'} {'slope == 3'} {'ca'}

Columns 23 through 25

{'thal == 3'} {'thal == 6'} {'thal == 7'}

(2) Patternnet

Similar results can be obtained using patternnet, but there will be some differences from fitcnet due to the different training algorithm. Remember to collapse the target classes to just two, and one-hot encode the categorical variables. Also, given the train/validation/test split used by patternnet training, one will generally be looking at the test accuracy, which is roughly similar to looking at the accuracy of one fold in a cross-validation scheme. Due to the smaller sample size, the per-fold accuracy will have much higher variance than the k-fold cross-validation accuracy which is averaged across all folds. I observed "per-fold" test accuracies ranging from a low around 73% to a high around 90%, with the average around 82-83% using a simple patternnet and the default training algorithm (no hyperparameter optimization). In the fitcnet case above, the app doesn't report the per-fold validation accuracy, but the per-fold accuracy will similarly be in a relatively wide range, with the average across all folds being around 85% after hyperparameter optimization (we observed 85.5% above) for a simple neural network.

If this answer helps you, please remember to accept the answer.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Cleveland heart disease dataset - how to improve test accurarcy?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Cleveland heart disease dataset - how to improve test accurarcy?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기