Why do I get this message : Error using kmeans ---X must have more rows than the number of clusters.

Question

Alayt Abraham Issak 2019년 4월 10일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/455569-why-do-i-get-this-message-error-using-kmeans-x-must-have-more-rows-than-the-number-of-clusters

댓글: Adam Danz 2019년 4월 11일

clear all; close all
load BRI
rng(0,'twister');
% A_train is 2055 x 89 factors
A_train = D_num(:,[1:66,68:89]);
all_factors={'recommended_for_research','umbrella','year','donor','Gov''t_funding_agency'...
    'State-owned_funding_company','Other_private_funding_company','implementing_agency_china'...
    'Pipeline: Pledge','Pipeline: Commitment','Implementation','Completion','Suspended','Cancelled'...
    'Debt forgiveness','Export Credits','Grant','Strategic/Supplier Credit','Debt Rescheduling'...
    'Free-standing Technical Assistance','Scholarship/Training in Donor Country','Joint Venture with Recipient'...
    'Loan','Foreign Direct Investment','ODA-like_flow class','OOF-like_flow class','Vague_flow class','Other flow '...
   'Development Intent','Commercial Intent','Representational Intent','Mixed Intent','amount','Cash/physical_money'...
   'USD_currency','CMY_currency','Other currency','usd_defl_2014','usd_current_publish','usd_current_2019','crs_sector_code'...
   'sources_count','cofinancing_agency','Gov''t_recepient_agency','State-owned_recepient_company','Other_private_agency'...
   'recipient_agencies_count','deflators_used','exchange_rates_used','start_actual','start_planned','end_actual','end_planned'...
   'Beginning_date_since_2000','End_date_since_2014','Planned_start > Actual_start','Planned_start > Actual_start','Planned_start = Actual_start'...
  'Planned_end > Actual_end','Planned_end < Actual_end','Planned_end = Actual_end','Planned_duration','Actual_Duration','year_uncertain'...
  '2019 population','GDP(IMF)_of _reipient ','GDP_per_capita','recipient_count','recipient_cow_code','recipient_oecd_code'...
  'recipient_un_code','recipient_imf_code','Africa ','Middle East','Asia','The Pacific','Latin America and the Caribbean','Central and Eastern Europe'...
  'line_of_credit','is_cofinanced','is_ground_truthing','loan_type','interest_rate','maturity','grace_period','grant_element','source_triangulation'...
  'field_completeness'};
% B_train is 2055 x 1 (1 if debt distressed, 0 if not)
B_train = D_num(:,90);
% Deal with missing GDP (IMF)
no_GDP=(A_train(:,66)==0); %row numbers of those missing an age (showing 0 instead)
avg_age=nanmean(A_train(no_GDP==0,66)); % average age of those with one listed
A_train(no_GDP==1,66)=avg_age; %fill in those missing ages with the average value
% Deal with missing GDP per capita
no_GDP2=(A_train(:,67)==0); %row numbers of those missing an age (showing 0 instead)
avg_age2=nanmean(A_train(no_GDP2==0,67)); % average age of those with one listed
A_train(no_GDP2==1,67)=avg_age2; %fill in those missing ages with the average value

The Eerror occurs Here:

k=8; % Number of clusters
dist_type='sqeuclidean'; % Distance metric (others include 'cityblock' (L1), 'cosine', and 'correlation')
[clust,centr]=kmeans(A_train,k,'dist',dist_type); % returns cluster assignments & centroid of each cluster

And I have not been able to continue on wards

figure(1) 
colstyle = {'cs','rd','b^','go','k+','d',':bs','-mo'}; %define 8 color/style combos for this plot
attribs=[1 2 3]; %categories for x, y, and z axes
for j=1:k 
    q=find(clust==j); %ID numbers of the items in this cluster
    nsample(j)=length(q); %Sample size in the cluster
    survival(j)=mean(B_train(q)); %Survival rate withini this cluster
    plot3(A_train(q,attribs(1)),A_train(q,attribs(2)),A_train(q,attribs(3)),colstyle{j}) % 3-D plot with marker types by cluster
    hold on
end
hold off
legend('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6','Cluster 7','Cluster 8');
xlabel(all_factors(attribs(1)));
ylabel(all_factors(attribs(2)));
zlabel(all_factors(attribs(3)));
figure(2);
silhouette(A_train,clust,dist_type)
Try various numbers of clusters
nn=100; dist_type='sqeuclidean';
for j=2:nn
    [clust,centr,sumd]=kmeans(A_train,j,'dist',dist_type);
    Dtot(j,1)=sum(sumd);
end
figure(3)
plot(2:nn,Dtot(2:nn),'b-');

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Adam Danz 2019년 4월 10일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/455569-why-do-i-get-this-message-error-using-kmeans-x-must-have-more-rows-than-the-number-of-clusters#answer_370031

편집: Adam Danz 2019년 4월 10일

MATLAB Online에서 열기

Possibility 1

Your variable 'A_train' does not have enough rows. You are requesting 8 clusters (k=8) and, as the error indicates, 'A_train' needs to have at least k+1 rows.

[clust,centr]=kmeans(A_train,k,'dist',dist_type);

To confirm this is the problem, call this just prior to the kmean() funciton.

size(A_train)

Possibility 2

Your variable 'A_train' has too many rows that contain at least one NaN value. kmeans() ignores any rows that contain at least one NaN value. To determine that you have enough rows that do not contain NaN values, run this line:

sum(any(~isnan(A_train), 2))
ans =
     5    % only 5 rows have no-nan values which is less than K (8)

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Alayt Abraham Issak 2019년 4월 11일

편집: Alayt Abraham Issak 2019년 4월 11일

This makes a lot of sense as I do have a lot of NaN values in my data set. This is because there are columns in which I do not know many of the values and so I did not want to tamper with the data by adding values.

However, I have managed to fix the issue by eliminating the columns in the data set. Nonetheless, as tey are essential to my analysis, could a recommend another method of using kmeans despite the prevalence of Nan values? or another function of similarity?

Adam Danz 2019년 4월 11일

To answer that, I'd step away from thinking about how to implement the analysis to the more fundamental problem of classifying mising data. There is no easy solution for this problem.

Sometimes missing data only accounts for a small portion of the dataset and those samples can just be ignored. That doesn't seem to be the case with your data.

If you're classifying a matrix with many variables (columns) and there's just one variable that contains most of the missing data, you could run the analysis without that variable as long as it's not an influential variable.

You could determine the number of rows that contain a complete set of data and reduce your cluster size accordingly but that's usually a poor solution since the number of klusters should be chosen with intention.

Some sources suggest that you could fill in missing values but means or randoms but such arbitrary decisions are bad practice and can really throw off the results such that they no longer represent the underlying unknown reality.

A simple search on google scholar lists these two papers with >100 citations. They discuss the problem of missing data in classification models and potential solutions.

http://www.jmlr.org/papers/v8/saar-tsechansky07a.html - See the pdf link

https://link.springer.com/chapter/10.1007/BFb0052868

댓글을 달려면 로그인하십시오.