Main Content

cluster

Validate clusters in phylogenetic tree

Syntax

LeafClusters = cluster(Tree, Threshold)
[LeafClusters, NodeClusters] = cluster(Tree, Threshold)
[LeafClusters, NodeClusters, Branches] = cluster(Tree, Threshold)
cluster(..., 'Criterion', CriterionValue, ...)
cluster(..., 'MaxClust', MaxClustValue, ...)
cluster(..., 'Distances', DistancesValue, ...)

Input Arguments

Tree

Phylogenetic tree object created, such as created with the phytree constructor function.

Threshold

Scalar specifying a threshold value.

CriterionValue

Character vector or string specifying the criterion to determine the number of clusters as a function of the species pairwise distances. Choices are:

  • 'maximum' (default) — Maximum within cluster pairwise distance (Wmax). Cluster splitting stops when WmaxThreshold.

  • 'median' — Median within cluster pairwise distance (Wmed). Cluster splitting stops when WmedThreshold.

  • 'average' — Average within cluster pairwise distance (Wavg). Cluster splitting stops when WavgThreshold.

  • 'ratio' — Between/within cluster pairwise distance ratio, defined as

    BWrat = (trace(B)/(k - 1)) / (trace(W)/(n - k))

    where B and W are the between- and within-scatter matrices, respectively. k is the number of clusters, and n is the number of species in the tree. Cluster splitting stops when BWratThreshold.

  • 'gain' — Within cluster pairwise distance gain, defined as

    Wgain = (trace(Wold)/ (trace(W) - 1) * (n - k - 1))

    where W and Wold are the within-scatter matrices for k and k - 1, respectively. k is the number of clusters, and n is the number of species in the tree. Cluster splitting stops when WgainThreshold.

  • 'silhouette' — Average silhouette width (SWavg). SWavg ranges from -1 to +1. Cluster splitting stops when SWavgThreshold. For more information, see silhouette.

MaxClustValue

Positive integer specifying the maximum number of possible clusters for the tested partitions. Default is the number of leaves in the tree.

Tip

When using the 'maximum', 'median', or 'average' criteria, set Threshold to [] (empty) to force cluster to return MaxClustValue clusters. It does so because such metrics monotonically decrease as k increases.

Tip

When using the 'ratio', 'gain', or 'silhouette' criteria, you may find it hard to estimate an appropriate Threshold in advance. Set Threshold to [] (empty) to find the optimal number of clusters below MaxClustValue. Also, set MaxClustValue to a small value to avoid expensive computation due to testing all possible number of clusters.

DistancesValue

Matrix of pairwise distances, such as returned by the seqpdist function, containing biological distances between each pair of sequences. cluster substitutes this matrix for the patristic distances in Tree. For example, this matrix can contain the real sample pairwise distances.

Output Arguments

LeafClusters

Column vector containing a cluster index for each species (leaf) in Tree, a phylogenetic tree object.

NodeClusters

Column vector containing the cluster index for each leaf node and branch node in Tree.

Tip

Use the LeafClusters or NodeClusters output vectors with the handle returned by the plot method to modify graphic elements of the phylogenetic tree object. For more information, see Examples.

Branches

Two-column matrix containing, for each step in the algorithm, the index of the branch being considered and the value of the criterion. Each row corresponds to a step in the algorithm. The first column contains branch indices, and the second column contains criterion values.

Tip

To obtain the whole curve of the criterion versus the number of clusters in Branches, set Threshold to [] (empty) and do not specify a MaxClustValue. Be aware that computation of some criteria can be computationally intensive.

Description

LeafClusters = cluster(Tree, Threshold) returns a column vector containing a cluster index for each species (leaf) in a phylogenetic tree object. It determines the optimal number of clusters as follows:

  • Starting with two clusters (k = 2), selects the partition that optimizes the criterion specified by the 'Criterion' property

  • Increments k by 1 and again selects the optimal partition

  • Continues incrementing k and selecting the optimal partition until a criterion value = Threshold or k = the maximum number of clusters (that is, number of leaves)

  • From all possible k values, selects the k value whose partition optimizes the criterion

[LeafClusters, NodeClusters] = cluster(Tree, Threshold) returns a column vector containing the cluster index for each leaf node and branch node in Tree.

[LeafClusters, NodeClusters, Branches] = cluster(Tree, Threshold) returns a two-column matrix containing, for each step in the algorithm, the index of the branch being considered and the value of the criterion. Each row corresponds to a step in the algorithm. The first column contains branch indices, and the second column contains criterion values.

cluster(..., 'PropertyName', PropertyValue, ...) calls cluster with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Enclose each PropertyName in single quotation marks. Each PropertyName is case insensitive. These property name/property value pairs are as follows:.

cluster(..., 'Criterion', CriterionValue, ...) specifies the criterion to determine the number of clusters as a function of the species pairwise distances.

cluster(..., 'MaxClust', MaxClustValue, ...) specifies the maximum number of possible clusters for the tested partitions. Default is the number of leaves in the tree.

cluster(..., 'Distances', DistancesValue, ...) substitutes the patristic distances in Tree with a user-provided pairwise distance matrix.

Examples

Validate the clusters in a phylogenetic tree:

% Read sequences from a multiple alignment file into a MATLAB
% structure
gagaa = multialignread('aagag.aln');

% Build a phylogenetic tree from the sequences
gag_tree = seqneighjoin(seqpdist(gagaa),'equivar',gagaa);

% Validate the clusters in the tree and find the best partition
% using the 'gain' criterion
[i,j] = cluster(gag_tree,[],'criterion','gain','maxclust',10);

% Use the returned vector of indices to color the branches of each
% cluster in a plot of the tree
h = plot(gag_tree);
set(h.BranchLines(j==2),'Color','b')
set(h.BranchLines(j==1),'Color','r')

References

[1] Dudoit, S. and Fridlyan, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7), research 0036.1–0036.21.

[2] Theodoridis, S. and Koutroumbas, K. (1999). Pattern Recognition (Academic Press), pp. 434–435.

[3] Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis (New York, Wiley).

[4] Calinski, R. and Harabasz, J. (1974). A dendrite method for cluster analysis. Commun Statistics 3, 1–27.

[5] Hartigan, J.A. (1985). Statistical theory in clustering. J Classification 2, 63–76.