Statistics`ClusterAnalysis`Cluster analysis is an unsupervised learning technique used for classification of data. Data elements are partitioned into groups called clusters representing collections of data elements that are proximate based on a distance or dissimilarity function. The distance or dissimilarity function measures how dissimilar data elements are. Identical element pairs have zero distance or dissimilarity and all others have positive distance or dissimilarity. General clustering function. The data argument of FindClusters can be a list of data elements or rules indexing elements and labels. Ways of specifying data in FindClusters. The data elements can be numeric lists, matrices, or tensors, lists of True and False elements, or lists of strings. All data elements must have the same dimensions. This loads the package. Here is a list of numbers. FindClusters clusters the numbers based on their proximity.
Out[3]= |  |
The rule-based data syntax allows for clustering data elements and returning labels for those elements. Here two-dimensional points are clustered and labeled with their positions in the data list.
Out[5]= |  |
The rule-based data syntax can also be used to cluster data based on parts of each data entry. For instance, you might want to cluster data in a data table while ignoring particular columns in the table. Here is a list of data entries. This clusters the data while ignoring the first two elements in each data entry.
Out[7]= |  |
In principle, it is possible to cluster points given in an arbitrary number of dimensions. However, it is difficult at best to visualize the clusters above two or three dimensions. For comparison of optional methods in this documentation, an easily visualizable set of two-dimensional data will be used. The following commands define a set of 300 two-dimensional data points chosen to group in four somewhat nebulous clusters. This defines a command that will display two-dimensional data with different colors for different clusters. This clusters the data based on the proximity of points. Here is a plot of the clusters.
With the default settings, FindClusters has found the four clusters of points. You can also direct FindClusters to find a specific number of clusters. This shows the effect of choosing 3 clusters.
This shows the effect of choosing 5 clusters.
Options for FindClusters. Randomness is used in clustering in two different ways. Some of the methods use a random assignment of some points to a specific number of clusters as a starting point. Randomness may also be used to help determine what seems to be the best number of clusters to use. Changing the random seed for generating the randomness by using FindClusters[{ , , ... }, RandomSeed -> s] may lead to different results for some cases. In principle, clustering techniques can be applied to any set of data. What is needed is a measure of how far apart each element in the set is from other elements: basically a function giving distance between elements. FindClusters[{ , , ... }, DistanceFunction -> f] treats pairs of elements as being less similar when their distances f[ , ] are larger. The function f can be any appropriate distance or dissimilarity function. A dissimilarity function satisfies the following:
If the are vectors of numbers, FindClusters by default uses a squared Euclidean distance. If the are lists of Boolean True and False elements, FindClusters by default uses a dissimilarity based on the normalized fraction of elements that disagree. If the are strings, FindClusters by default uses a distance function based on the number of point changes needed to get from one string to another. Distance and dissimilarity functions for numerical data. This shows the clusters in datapairs found using a Manhattan distance.
Dissimilarities for Boolean vectors are typically calculated by comparing the elements of two Boolean vectors and pairwise and counting the number of true results for various Boolean functions applied to the pairs. It is convenient to summarize each dissimilarity function in terms of four sums representing the four possible elements of the truth table: a, the number of pairs for which and are both True; b, the number of pairs for which is True but is False; c the number of pairs for which is False but is True; and d the number of pairs for which both and are False. Dissimilarity functions for Boolean data. Here is some Boolean data. These are the clusters found using the default dissimilarity for Boolean data.
Out[18]= |  |
Dissimilarity function for string data. The edit distance is determined by counting the number of deletions, insertions, and substitutions required to transform one string into another while preserving the ordering of characters. Here is some string data. This clusters the string data.
Out[20]= |  |
The Method option can be used to specify different methods of clustering. Explicit settings for the Method option. The methods "Agglomerate" and "Optimize" determine how to cluster the data for a particular number of clusters "Agglomerate" uses an agglomerative hierarchical method starting with each member of the set in a cluster of its own and fusing nearest clusters until there are remaining. "Optimize" starts by building a set of representative objects and clustering around those, iterating until a (locally) optimal clustering is found. The default "Optimize" method is based on partitioning around medoids [1]. Additional Method suboptions are available to allow for more control over the clustering. Available suboptions depend on the Method chosen. Suboption for all methods. For a given set of data and distance function, the choice of the best number of clusters may be unclear. With Method->{methodname,"SignificanceTest"->"stest"}, "stest" is used to determine statistically significant clusters to help choose an appropriate number of them. Possible values of "stest" are "Silhouette" and "Gap". The "Silhouette" test uses the silhouette statistic [2] to test how well data are clustered. The "Gap" test uses the gap statistic [3] to determine how well data are clustered. The "Silhouette" test subdivides the data into successively more clusters looking for the first minimum of the silhouette statistic. The "Gap" test compares the dispersion of clusters generated from the data to that derived from a sample of null hypothesis sets. The null hypothesis sets are uniformly randomly distributed data in the box defined by the principal components of the input data. The "Gap" method takes two suboptions: "NullSets" and "Tolerance". The suboption "NullSets" sets the number of null hypothesis sets to compare with the input data. The option "Tolerance" sets the sensitivity. Typically larger values of "Tolerance" will favor fewer clusters being chosen. The default settings are "NullSets"->5 and "Tolerance"->1. This shows the result of clustering datapairs using the "Silhouette" test.
Here are the clusters found using the "Gap" test with the tolerance parameter set to 3. The larger value leads to fewer clusters being selected.
Note that the clusters found in these two examples are identical. The only difference is how the number of clusters is chosen. Suboption for "Agglomerate" method. With Method->{"Agglomerate","Linkage"->linkagefunction}, the specified linkagefunction is used for agglomerative clustering. Possible linkagefunction values are the same as for the Linkage option to the Agglomerate function defined in this package. These are the clusters found using complete linkage hierarchical clustering.
Suboption for "Optimize" method. Here are the clusters determined from a single iteration of the "Optimize" method.
Hierarchical ClusteringHierarchical clustering function. The Agglomerate function computes a cluster hierarchy of a data set. Agglomerate accepts data in the same forms accepted by FindClusters. The output from Agglomerate is a nested Cluster object representing the hierarchical clustering. Here is a small numerical data set. This constructs a hierarchical clustering of the data.
Out[26]= |  |
Here is some Boolean data. This clusters the Boolean data.
Out[28]= |  |
Here is some string data. This clusters the string data.
Out[30]= |  |
An element of the cluster hierarchy. Functions for manipulating Cluster expressions. The ClusterFlatten and ClusterSplit functions are utilities for manipulating Cluster objects. This splits the cluster into the top 3 subclusters.
Out[31]= |  |
This flattens the cluster.
Out[32]= |  |
Options for Agglomerate. The DistanceFunction option is the same as for FindClusters. DistanceFunction defines the distance or dissimilarity between data points, while Linkage defines the dissimilarity between clusters of data points. Possible values for the Linkage option. Linkage methods determine this intercluster dissimilarity, or fusion level, given the dissimilarities between member elements. Common algorithms include single linkage, which selects the smallest distance between elements; complete linkage, which selects the largest distance between elements; and centroid linkage, which uses the dissimilarity between cluster centroids. With Linkage-> f, f is a pure function that defines the linkage algorithm. Distances or dissimilarities between clusters are determined recursively using information about the distances or dissimilarities of the unmerged clusters and their counterparts to determine the distances or dissimilarities for the newly merged cluster. If i and j represent clusters to be merged, new distances or dissimilarities are recursively calculated between this merged cluster and the remaining k clusters. The function f defines the recursion and is passed arguments , where represents the distances or dissimilarities between clusters and n represents the number of data elements in a cluster. The function returns the dissimilarity between k and the cluster formed by merging clusters i and j. This clusters data2 using the average linkage method.
Out[33]= |  |
This clusters data2 using a recursion function. It is equivalent to the average linkage.
Out[34]= |  |
Elements can be labeled using a list of rules or a single rule. Distances or dissimilarities between elements are computed using the expression on the left-hand sides of the rules. The output Cluster object contains the right-hand sides of the rules. This clusters the data showing results by label.
Out[35]= |  |
It is also possible to build a cluster hierarchy directly using a distance or dissimilarity matrix--a matrix that provides the pairwise distances or dissimilarities between all data elements in lieu of a distance or dissimilarity function. In a distance matrix the element in the row and column stores the distance value between the and data elements. Note that only the upper-triangular portion of the matrix is used, that is , since distances and dissimilarities are symmetric. Functions for clustering using distance or dissimilarity matrices. The DistanceMatrix function can be given a DistanceFunction option to define the distance measure. Here is an example distance matrix.
Out[36]= |  |
This clusters the distance matrix directly. Row numbers of the matrix are used to represent the data elements.
Out[37]= |  |
This clusters the distance matrix while specifying the corresponding data elements to store in the Cluster expression.
Out[38]= |  |
The binary tree structure of the hierarchical cluster leads naturally to a graphical representation--the dendrogram. A dendrogram plot shows a visual representation of the merger history by connecting two lines representing clusters with a bar at their fusion level. Dendrogram plotting functions. This generates a basic dendrogram plot.
This generates a dendrogram using the Manhattan distance.
Options for DendrogramPlot. This generates a dendrogram using Ward's linkage method to cluster the data.
The option LeafLabels is provided to support labeling of the dendrogram. For data, a value of LeafLabels->Automatic will use the data element position for the label, but the option can also take a list of expressions or a function. The function will be applied to each data element to generate the label expression. For Cluster objects, the LeafLabels option can take a function but not a list or Automatic, as there is no unambiguous mapping between the labels and data points in a Cluster object. This generates a dendrogram with automatically generated labels.
This generates a dendrogram using a list of labels.
This generates a dendrogram using a label function.
For large data sets, only a summary of the full dendrogram may be desired. Dendrograms may be truncated using the TruncateDendrogram option by providing an integer or a list of two integers. If given as a list of two integers, the first value specifies the fusion level above which mergers should not be shown, while the second indicates the fusion level under which the dendrogram should be truncated. For a single integer n, TruncateDendrogram->n is equivalent to TruncateDendrogram->{1,n}. A value of may be given as a second value to show the full clustering without truncation from below. When a cluster is truncated and labels are specified, a box is substituted for a label indicating the size of the cluster that has been truncated. Here is a larger data set to cluster. It is based on a series of measurements of different species of irises [4]. This turns off tie warnings. This generates a dendrogram truncated to show only the highest 20 mergers.
This highlights the two top clusters.
This generates a dendrogram truncated between levels 2 and 20.
References[1] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons, 1990. [2] P. J. Rousseeuw, "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis," J Comput. Appl. Math., 20, 1987, 53-65. [3] R. Tibshirani, G. Walther, and T. Hastie, "Estimating the Number of Clusters in a Dataset Via the Gap Statistic." Stanford Univ. Tech. report. March 2000. (published Journal of the Royal Statistical Society, B, 63, 2001, 411-423.) [4] R. A. Fisher, "The Use of Multiple Measurements in Taxonomic Problems," Annals of Eugenics, 7, 1936, 179-188.
|