silhouette

Silhouette plot

Syntax

silhouette(X,clust)
silhouette(X,clust,Distance)
silhouette(X,clust,Distance,DistParameter)
s = silhouette(___)
[s,h] = silhouette(___)

Description

example

silhouette(X,clust) plots cluster silhouettes for the n-by-p input data matrix X, given the cluster assignment clust of each point (observation) in X.

example

silhouette(X,clust,Distance) plots the silhouettes using the inter-point distance metric specified in Distance.

example

silhouette(X,clust,Distance,DistParameter) accepts one or more additional distance metric parameter values when you specify Distance as a custom distance function handle @distfun that accepts the additional parameter values.

example

s = silhouette(___) returns the silhouette values in s for any of the input argument combinations in the previous syntaxes without plotting the cluster silhouettes.

example

[s,h] = silhouette(___) plots the silhouettes and returns the figure handle h in addition to the silhouette values in s.

Examples

collapse all

Create silhouette plots from clustered data using different distance metrics.

Generate random sample data.

rng('default')  % For reproducibility
X = [randn(10,2)+3;randn(10,2)-3];

Create a scatter plot of the data.

scatter(X(:,1),X(:,2));
title('Randomly Generated Data');

The scatter plot shows that the data appears to be split into two clusters of equal size.

Partition the data into two clusters using kmeans with the default squared Euclidean distance metric.

clust = kmeans(X,2);

clust contains the cluster indices of the data.

Create a silhouette plot from the clustered data using the default squared Euclidean distance metric.

silhouette(X,clust)

The silhouette plot shows that the data is split into two clusters of equal size. All the points in the two clusters have large silhouette values (0.8 or greater), indicating that the clusters are well separated.

Create a silhouette plot from the clustered data using the Euclidean distance metric.

silhouette(X,clust,'Euclidean')

The silhouette plot shows that the data is split into two clusters of equal size. All the points in the two clusters have large silhouette values (0.6 or greater), indicating that the clusters are well separated.

Compute the silhouette values from clustered data.

Generate random sample data.

rng('default')  % For reproducibility
X = [randn(10,2)+1;randn(10,2)-1];

Cluster the data in X based on the sum of absolute differences in distance by using kmeans.

clust = kmeans(X,2,'distance','cityblock');

clust contains the cluster indices of the data.

Compute the silhouette values from the clustered data. Specify the distance metric as 'cityblock' to indicate that the kmeans clustering is based on the sum of absolute differences.

s = silhouette(X,clust,'cityblock')
s = 20×1

    0.0816
    0.5848
    0.1906
    0.2781
    0.3954
    0.4050
    0.0897
    0.5416
    0.6203
    0.6664
      ⋮

Find silhouette values from clustered data using a custom chi-square distance metric. Verify that the chi-square distance metric is equivalent to the Euclidean distance metric, but with an optional scaling parameter.

Generate random sample data.

rng('default'); % For reproducibility
X = [randn(10,2)+3;randn(10,2)-3];

Cluster the data in X using kmeans with the default squared Euclidean distance metric.

clust = kmeans(X,2);

Find silhouette values and create a silhouette plot from the clustered data using the Euclidean distance metric.

[s,h] = silhouette(X,clust,'Euclidean')

s = 20×1

    0.6472
    0.7241
    0.5682
    0.7658
    0.7864
    0.6397
    0.7253
    0.7783
    0.7054
    0.7442
      ⋮

h = 
  Figure (1) with properties:

      Number: 1
        Name: ''
       Color: [0.9400 0.9400 0.9400]
    Position: [348 480 583 437]
       Units: 'pixels'

  Show all properties

The chi-square distance between J-dimensional points x and z is

χ(x,z)=j=1Jwj(xj-zj)2,

where wj is the weight associated with dimension j.

Set weights for each dimension and specify the chi-square distance function. The distance function must:

  • Take as input arguments the n-by-p input data matrix X, one row of X (for example, x), and a scaling (or weight) parameter w.

  • Calculate the distance from x to each row of X.

  • Return a vector of length n. Each element of the vector is the distance between the observation corresponding to x and the observations corresponding to each row of X.

w = [0.4; 0.6]; % Set arbitrary weights for illustration
chiSqrDist = @(x,Z,w)sqrt((bsxfun(@minus,x,Z).^2)*w);

Find silhouette values from the clustered data using the custom distance metric chiSqrDist.

s1 = silhouette(X,clust,chiSqrDist,w)
s1 = 20×1

    0.6288
    0.7239
    0.6244
    0.7696
    0.7957
    0.6688
    0.7386
    0.7865
    0.7223
    0.7572
      ⋮

Set the weight for both dimensions to 1 to use chiSqrDist as the Euclidean distance metric. Find silhouette values and verify that they are the same as the values in s.

w2 = [1; 1];
s2 = silhouette(X,clust,chiSqrDist,w2);
AreValuesEqual = isequal(s2,s)
AreValuesEqual = logical
   1

The silhouette values are the same in s and s2.

Input Arguments

collapse all

Input data, specified as a numeric matrix of size n-by-p. Rows correspond to points, and columns correspond to coordinates.

Data Types: single | double

Cluster assignment, specified as a categorical variable, numeric vector, character matrix, string array, or cell array of character vectors containing a cluster name for each point in X.

silhouette treats NaNs and empty values in clust as missing values and ignores the corresponding rows of X.

Data Types: single | double | char | string | cell | categorical

Distance metric, specified as a character vector, string scalar, or function handle, as described in this table.

MetricDescription
'Euclidean'

Euclidean distance

'sqEuclidean'

Squared Euclidean distance (default)

'cityblock'

Sum of absolute differences

'cosine'

One minus the cosine of the included angle between points (treated as vectors)

'correlation'

One minus the sample correlation between points (treated as sequences of values)

'Hamming'

Percentage of coordinates that differ

'Jaccard'

Percentage of nonzero coordinates that differ

VectorA numeric row vector of pairwise distances, in the form created by the pdist function. X is not used in this case, and can safely be set to [].
@distfun

Custom distance function handle. A distance function has the form

function D = distfun(X0,X,DistParameter)
% calculation of distance
...
where

  • X0 is a 1-by-p vector containing a single point (observation) of the input data matrix X.

  • X is an n-by-p matrix of points.

  • DistParameter represents one or more additional parameter values specific to @distfun.

  • D is an n-by-1 vector of distances, and D(k) is the distance between observations X0 and X(k,:).

For more information, see Distance Metrics.

Example: 'cosine'

Data Types: char | string | function_handle | single | double

Distance metric parameter value, specified as a positive scalar, numeric vector, or numeric matrix. This argument is valid only when you specify a custom distance function handle @distfun that accepts one or more parameter values in addition to the input parameters X0 and X.

Example: silhouette(X,clust,distfun,p1,p2) where p1 and p2 are additional distance metric parameter values for @distfun

Data Types: single | double

Output Arguments

collapse all

Silhouette values, returned as an n-by-1 vector of values ranging from –1 to 1. A silhouette value measures how similar a point is to points in its own cluster, when compared to points in other clusters. Values range from –1 to 1. A high silhouette value indicates that a point is well matched to its own cluster, and poorly matched to other clusters.

Data Types: single | double

Figure handle, returned as a scalar. You can use the figure handle to query and modify figure properties. For more information, see Figure Properties.

More About

collapse all

Silhouette Value

The silhouette value for each point is a measure of how similar that point is to points in its own cluster, when compared to points in other clusters. The silhouette value Si for the ith point is defined as

Si = (bi-ai)/ max(ai,bi)

where ai is the average distance from the ith point to the other points in the same cluster as i, and bi is the minimum average distance from the ith point to points in a different cluster, minimized over clusters.

The silhouette value ranges from –1 to 1. A high silhouette value indicates that i is well matched to its own cluster, and poorly matched to other clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution might have too many or too few clusters. You can use silhouette values as a clustering evaluation criterion with any distance metric.

References

[1] Kaufman L., and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken, NJ: John Wiley & Sons, Inc., 1990.

Introduced before R2006a