fbpx
Wikipedia

Davies–Bouldin index

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms.[1] This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.[citation needed]

Preliminaries Edit

Given n dimensional points, let Ci be a cluster of data points. Let Xj be an n-dimensional feature vector assigned to cluster Ci.

 

Here   is the centroid of Ci and Ti is the size of the cluster i.   is the qth root of the qth moment of the points in cluster i about the mean. If   then   is the average distance between the feature vectors in cluster i and the centroid of the cluster. Usually the value of p is 2, which makes the distance a Euclidean distance function. Many other distance metrics can be used, in the case of manifolds and higher dimensional data, where the euclidean distance may not be the best measure for determining the clusters. It is important to note that this distance metric has to match with the metric used in the clustering scheme itself for meaningful results.

 
  is a measure of separation between cluster   and cluster  .
  is the kth element of  , and there are n such elements in A for it is an n dimensional centroid.[inconsistent]

Here k indexes the features of the data, and this is essentially the Euclidean distance between the centers of clusters i and j when p equals 2.

Definition Edit

Let Ri,j be a measure of how good the clustering scheme is. This measure, by definition has to account for Mi,j the separation between the ith and the jth cluster, which ideally has to be as large as possible, and Si, the within cluster scatter for cluster i, which has to be as low as possible. Hence the Davies–Bouldin index is defined as the ratio of Si and Mi,j such that these properties are conserved:

  1.  .
  2.  .
  3. When   and   then  .
  4. When   and   then  .

With this formulation, the lower the value, the better the separation of the clusters and the 'tightness' inside the clusters.

A solution that satisfies these properties is:

 

This is used to define Di:

 

If N is the number of clusters:

 

DB is called the Davies–Bouldin index. This is dependent both on the data as well as the algorithm. Di chooses the worst-case scenario, and this value is equal to Ri,j for the most similar cluster to cluster i. There could be many variations to this formulation, like choosing the average of the cluster similarity, weighted average and so on.

Explanation Edit

These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of k in the kmeans algorithm, where the value of k is not known apriori.

Implementations Edit

The SOM toolbox contains a MATLAB implementation.[2] A MATLAB implementation is also available via the MATLAB Statistics and Machine Learning Toolbox, using the "evalclusters" command.[3]

A Java implementation is found in ELKI, and can be compared to many other clustering quality indexes.

See also Edit

Notes and references Edit

  1. ^ Davies, David L.; Bouldin, Donald W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224–227. doi:10.1109/TPAMI.1979.4766909. S2CID 13254783.
  2. ^ "Matlab implementation". Retrieved 12 November 2011.
  3. ^ "Evaluate clustering solutions - MATLAB evalclusters".

davies, bouldin, index, this, article, relies, excessively, references, primary, sources, please, improve, this, article, adding, secondary, tertiary, sources, find, sources, news, newspapers, books, scholar, jstor, 2010, learn, when, remove, this, template, m. This article relies excessively on references to primary sources Please improve this article by adding secondary or tertiary sources Find sources Davies Bouldin index news newspapers books scholar JSTOR May 2010 Learn how and when to remove this template message The Davies Bouldin index DBI introduced by David L Davies and Donald W Bouldin in 1979 is a metric for evaluating clustering algorithms 1 This is an internal evaluation scheme where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset This has a drawback that a good value reported by this method does not imply the best information retrieval citation needed Contents 1 Preliminaries 2 Definition 3 Explanation 4 Implementations 5 See also 6 Notes and referencesPreliminaries EditGiven n dimensional points let Ci be a cluster of data points Let Xj be an n dimensional feature vector assigned to cluster Ci S i 1 T i j 1 T i X j A i p q 1 q displaystyle S i left frac 1 T i sum j 1 T i left left X j A i right right p q right 1 q nbsp Here A i displaystyle A i nbsp is the centroid of Ci and Ti is the size of the cluster i S i displaystyle S i nbsp is the qth root of the qth moment of the points in cluster i about the mean If q 1 displaystyle q 1 nbsp then S i displaystyle S i nbsp is the average distance between the feature vectors in cluster i and the centroid of the cluster Usually the value of p is 2 which makes the distance a Euclidean distance function Many other distance metrics can be used in the case of manifolds and higher dimensional data where the euclidean distance may not be the best measure for determining the clusters It is important to note that this distance metric has to match with the metric used in the clustering scheme itself for meaningful results M i j A i A j p k 1 n a k i a k j p 1 p displaystyle M i j left left A i A j right right p Bigl displaystyle sum k 1 n left a k i a k j right p Bigr frac 1 p nbsp M i j displaystyle M i j nbsp is a measure of separation between cluster C i displaystyle C i nbsp and cluster C j displaystyle C j nbsp a k i displaystyle a k i nbsp is the kth element of A i displaystyle A i nbsp and there are n such elements in A for it is an n dimensional centroid inconsistent Here k indexes the features of the data and this is essentially the Euclidean distance between the centers of clusters i and j when p equals 2 Definition EditLet Ri j be a measure of how good the clustering scheme is This measure by definition has to account for Mi j the separation between the ith and the jth cluster which ideally has to be as large as possible and Si the within cluster scatter for cluster i which has to be as low as possible Hence the Davies Bouldin index is defined as the ratio of Si and Mi j such that these properties are conserved R i j 0 displaystyle R i j geqslant 0 nbsp R i j R j i displaystyle R i j R j i nbsp When S j S k displaystyle S j geqslant S k nbsp and M i j M i k displaystyle M i j M i k nbsp then R i j gt R i k displaystyle R i j gt R i k nbsp When S j S k displaystyle S j S k nbsp and M i j M i k displaystyle M i j leqslant M i k nbsp then R i j gt R i k displaystyle R i j gt R i k nbsp With this formulation the lower the value the better the separation of the clusters and the tightness inside the clusters A solution that satisfies these properties is R i j S i S j M i j displaystyle R i j frac S i S j M i j nbsp This is used to define Di D i max j i R i j displaystyle D i equiv max j neq i R i j nbsp If N is the number of clusters D B 1 N i 1 N D i displaystyle mathit DB equiv frac 1 N displaystyle sum i 1 N D i nbsp DB is called the Davies Bouldin index This is dependent both on the data as well as the algorithm Di chooses the worst case scenario and this value is equal to Ri j for the most similar cluster to cluster i There could be many variations to this formulation like choosing the average of the cluster similarity weighted average and so on Explanation EditThese conditions constrain the index so defined to be symmetric and non negative Due to the way it is defined as a function of the ratio of the within cluster scatter to the between cluster separation a lower value will mean that the clustering is better It happens to be the average similarity between each cluster and its most similar one averaged over all the clusters where the similarity is defined as Si above This affirms the idea that no cluster has to be similar to another and hence the best clustering scheme essentially minimizes the Davies Bouldin index This index thus defined is an average over all the i clusters and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into This has applications in deciding the value of k in the kmeans algorithm where the value of k is not known apriori Implementations EditThe SOM toolbox contains a MATLAB implementation 2 A MATLAB implementation is also available via the MATLAB Statistics and Machine Learning Toolbox using the evalclusters command 3 A Java implementation is found in ELKI and can be compared to many other clustering quality indexes See also EditSilhouette clustering Dunn indexNotes and references Edit Davies David L Bouldin Donald W 1979 A Cluster Separation Measure IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 1 2 224 227 doi 10 1109 TPAMI 1979 4766909 S2CID 13254783 Matlab implementation Retrieved 12 November 2011 Evaluate clustering solutions MATLAB evalclusters Retrieved from https en wikipedia org w index php title Davies Bouldin index amp oldid 1123950298, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.