Clustering using Gower's Distance

조회 수: 5 (최근 30일)
MByk
MByk 2025년 7월 22일
편집: Torsten 2025년 7월 22일
Hello all, I have a dataset that includes both categorical and numerical features, and I'm looking to perform clustering on it. I've read that Gower's Distance (code is available) is suitable for handling mixed data types. However, I am getting an "isnan" error. How can I fix the problem? Thanks for the help.
DataSet = readtable("Test.xlsx", 'ReadVariableNames', true);
GowerDst = gower(DataSet);
[Idx, C] = kmedoids(DataSet, 2, 'Distance', GowerDst);
Error using isnan
Invalid data type. Argument must be numeric, char, or logical.
Error in kmedoids (line 220)
wasnan = any(isnan(X),2);
^^^^^^^^
Error in Gower_Distance (line 9)
[Idx, C] = kmedoids(DataSet, 2, 'Distance', GowerDst);
  댓글 수: 2
the cyclist
the cyclist 2025년 7월 22일
Can you upload the data, or a representative sample that illustrates the problem? You can use the paper clip icon in the INSERT section of the toolbar.
MByk
MByk 2025년 7월 22일
The dataset is pretty big, so I only uploaded part of it.

댓글을 달려면 로그인하십시오.

답변 (1개)

Torsten
Torsten 2025년 7월 22일
편집: Torsten 2025년 7월 22일
To use a distance that is not implemented, you have to define a function handle. Since I guess that GowerDst is not a function handle, MATLAB errors.
Look at the documentation for "kmedoids" for more details:
@distfun
Custom distance function handle. A distance function has the form
function D2 = distfun(ZI,ZJ)
% calculation of distance
...where
  • ZI is a 1-by-n vector containing a single observation.
  • ZJ is an m2-by-n matrix containing multiple observations. distfun must accept a matrix ZJ with an arbitrary number of observations.
  • D2 is an m2-by-1 vector of distances, and D2(k) is the distance between observations ZI and ZJ(k,:).
If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle.
  댓글 수: 2
MByk
MByk 2025년 7월 22일
편집: MByk 2025년 7월 22일
Looks like I have to modify the "Gower's Distance" function as well and that’s exactly what I was worried about.
Torsten
Torsten 2025년 7월 22일
편집: Torsten 2025년 7월 22일
See Edward Barnard's answer here:
I suggest you test whether it's correct for implemented distances once by supplying the distance matrix as below, second by using the 'Distance','...' option and comparing the results.
Or take a look at
DataSet = readtable("Test.xlsx", 'ReadVariableNames', true);
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.
GowerDst = gower(DataSet);
K = 2;
N = 18;
[idx, C, sumd] = kmedoids((1:N)', K, 'Distance', @(ZI, ZJ) GowerDst(ZJ, ZI));
function D = gower(data)
[n, p] = size(data);
D = zeros(n, n);
for i = 1:p
column = data{:, i};
if isnumeric(column)
range = max(column) - min(column);
if range == 0
continue;
end
d = abs(column - column') / range;
elseif iscell(column) || iscategorical(column) || isobject(column)
d = zeros(n, n);
for j = 1:n
for k = 1:n
d(j,k) = ~isequal(column{j}, column{k});
end
end
else
warning('Skipping column %d: unsupported data type', i);
continue;
end
D = D + d;
end
D = D / p;
end

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Graphics Performance에 대해 자세히 알아보기

제품


릴리스

R2025a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by