fbpx
Wikipedia

Rand index

The Rand index[1] or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.

Example clusterings for a dataset with the kMeans (left) and Mean shift (right) algorithms. The calculated Adjusted Rand index for these two clusterings is

Rand index

Definition

Given a set of   elements   and two partitions of   to compare,  , a partition of S into r subsets, and  , a partition of S into s subsets, define the following:

  •  , the number of pairs of elements in   that are in the same subset in   and in the same subset in  
  •  , the number of pairs of elements in   that are in different subsets in   and in different subsets in  
  •  , the number of pairs of elements in   that are in the same subset in   and in different subsets in  
  •  , the number of pairs of elements in   that are in different subsets in   and in the same subset in  

The Rand index,  , is:[1][2]

 

Intuitively,   can be considered as the number of agreements between   and   and   as the number of disagreements between   and  .

Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that   and   will agree on a randomly chosen pair.

  is calculated as  .

Similarly, one can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula:

 
where   is the number of true positives,   is the number of true negatives,   is the number of false positives, and   is the number of false negatives.

Properties

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

In mathematical terms, a, b, c, d are defined as follows:

  •  , where  
  •  , where  
  •  , where  
  •  , where  

for some  

Relationship with classification accuracy

The Rand index can also be viewed through the prism of binary classification accuracy over the pairs of elements in  . The two class labels are "  and   are in the same subset in   and  " and "  and   are in different subsets in   and  ".

In that setting,   is the number of pairs correctly labeled as belonging to the same subset (true positives), and   is the number of pairs correctly labeled as belonging to different subsets (true negatives).

Adjusted Rand index

The adjusted Rand index is the corrected-for-chance version of the Rand index.[1][2][3] Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model. Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters). However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that in K-means the number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings.[4]

Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.[5]

The contingency table

Given a set S of n elements, and two groupings or partitions (e.g. clusterings) of these elements, namely   and  , the overlap between X and Y can be summarized in a contingency table   where each entry   denotes the number of objects in common between   and   :  .

 

Definition

The original Adjusted Rand Index using the Permutation Model is

 

where   are values from the contingency table.

See also

References

  1. ^ a b c W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association. American Statistical Association. 66 (336): 846–850. doi:10.2307/2284239. JSTOR 2284239.
  2. ^ a b Lawrence Hubert and Phipps Arabie (1985). "Comparing partitions". Journal of Classification. 2 (1): 193–218. doi:10.1007/BF01908075.
  3. ^ Nguyen Xuan Vinh, Julien Epps and James Bailey (2009). "Information Theoretic Measures for Clustering Comparison: Is a Correction for Chance Necessary?" (PDF). ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning. ACM. pp. 1073–1080.PDF.
  4. ^ Alexander J Gates and Yong-Yeol Ahn (2017). "The Impact of Random Models on Clustering Similarity" (PDF). Journal of Machine Learning Research. 18: 1–28.
  5. ^ "Comparing Clusterings - An Overview" (PDF).

External links

  • C++ implementation with MATLAB mex files

rand, index, rand, measure, named, after, william, rand, statistics, particular, data, clustering, measure, similarity, between, data, clusterings, form, defined, that, adjusted, chance, grouping, elements, this, adjusted, from, mathematical, standpoint, relat. The Rand index 1 or Rand measure named after William M Rand in statistics and in particular in data clustering is a measure of the similarity between two data clusterings A form of the Rand index may be defined that is adjusted for the chance grouping of elements this is the adjusted Rand index From a mathematical standpoint Rand index is related to the accuracy but is applicable even when class labels are not used Example clusterings for a dataset with the kMeans left and Mean shift right algorithms The calculated Adjusted Rand index for these two clusterings is A R I 0 94 displaystyle ARI approx 0 94 Contents 1 Rand index 1 1 Definition 1 2 Properties 1 3 Relationship with classification accuracy 2 Adjusted Rand index 2 1 The contingency table 2 2 Definition 3 See also 4 References 5 External linksRand index EditDefinition Edit Given a set of n displaystyle n elements S o 1 o n displaystyle S o 1 ldots o n and two partitions of S displaystyle S to compare X X 1 X r displaystyle X X 1 ldots X r a partition of S into r subsets and Y Y 1 Y s displaystyle Y Y 1 ldots Y s a partition of S into s subsets define the following a displaystyle a the number of pairs of elements in S displaystyle S that are in the same subset in X displaystyle X and in the same subset in Y displaystyle Y b displaystyle b the number of pairs of elements in S displaystyle S that are in different subsets in X displaystyle X and in different subsets in Y displaystyle Y c displaystyle c the number of pairs of elements in S displaystyle S that are in the same subset in X displaystyle X and in different subsets in Y displaystyle Y d displaystyle d the number of pairs of elements in S displaystyle S that are in different subsets in X displaystyle X and in the same subset in Y displaystyle Y The Rand index R displaystyle R is 1 2 R a b a b c d a b n 2 displaystyle R frac a b a b c d frac a b n choose 2 Intuitively a b displaystyle a b can be considered as the number of agreements between X displaystyle X and Y displaystyle Y and c d displaystyle c d as the number of disagreements between X displaystyle X and Y displaystyle Y Since the denominator is the total number of pairs the Rand index represents the frequency of occurrence of agreements over the total pairs or the probability that X displaystyle X and Y displaystyle Y will agree on a randomly chosen pair n 2 displaystyle n choose 2 is calculated as n n 1 2 displaystyle n n 1 2 Similarly one can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm It can be computed using the following formula R I T P T N T P F P F N T N displaystyle RI frac TP TN TP FP FN TN dd where T P displaystyle TP is the number of true positives T N displaystyle TN is the number of true negatives F P displaystyle FP is the number of false positives and F N displaystyle FN is the number of false negatives Properties Edit The Rand index has a value between 0 and 1 with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same In mathematical terms a b c d are defined as follows a S displaystyle a S where S o i o j o i o j X k o i o j Y l displaystyle S o i o j mid o i o j in X k o i o j in Y l b S displaystyle b S where S o i o j o i X k 1 o j X k 2 o i Y l 1 o j Y l 2 displaystyle S o i o j mid o i in X k 1 o j in X k 2 o i in Y l 1 o j in Y l 2 c S displaystyle c S where S o i o j o i o j X k o i Y l 1 o j Y l 2 displaystyle S o i o j mid o i o j in X k o i in Y l 1 o j in Y l 2 d S displaystyle d S where S o i o j o i X k 1 o j X k 2 o i o j Y l displaystyle S o i o j mid o i in X k 1 o j in X k 2 o i o j in Y l for some 1 i j n i j 1 k k 1 k 2 r k 1 k 2 1 l l 1 l 2 s l 1 l 2 displaystyle 1 leq i j leq n i neq j 1 leq k k 1 k 2 leq r k 1 neq k 2 1 leq l l 1 l 2 leq s l 1 neq l 2 Relationship with classification accuracy Edit The Rand index can also be viewed through the prism of binary classification accuracy over the pairs of elements in S displaystyle S The two class labels are o i displaystyle o i and o j displaystyle o j are in the same subset in X displaystyle X and Y displaystyle Y and o i displaystyle o i and o j displaystyle o j are in different subsets in X displaystyle X and Y displaystyle Y In that setting a displaystyle a is the number of pairs correctly labeled as belonging to the same subset true positives and b displaystyle b is the number of pairs correctly labeled as belonging to different subsets true negatives Adjusted Rand index EditThe adjusted Rand index is the corrected for chance version of the Rand index 1 2 3 Such a correction for chance establishes a baseline by using the expected similarity of all pair wise comparisons between clusterings specified by a random model Traditionally the Rand Index was corrected using the Permutation Model for clusterings the number and size of clusters within a clustering are fixed and all random clusterings are generated by shuffling the elements between the fixed clusters However the premises of the permutation model are frequently violated in many clustering scenarios either the number of clusters or the size distribution of those clusters vary drastically For example consider that in K means the number of clusters is fixed by the practitioner but the sizes of those clusters are inferred from the data Variations of the adjusted Rand Index account for different models of random clusterings 4 Though the Rand Index may only yield a value between 0 and 1 the adjusted Rand index can yield negative values if the index is less than the expected index 5 The contingency table Edit Given a set S of n elements and two groupings or partitions e g clusterings of these elements namely X X 1 X 2 X r displaystyle X X 1 X 2 ldots X r and Y Y 1 Y 2 Y s displaystyle Y Y 1 Y 2 ldots Y s the overlap between X and Y can be summarized in a contingency table n i j displaystyle left n ij right where each entry n i j displaystyle n ij denotes the number of objects in common between X i displaystyle X i and Y j displaystyle Y j n i j X i Y j displaystyle n ij X i cap Y j X Y Y 1 Y 2 Y s sums X 1 n 11 n 12 n 1 s a 1 X 2 n 21 n 22 n 2 s a 2 X r n r 1 n r 2 n r s a r sums b 1 b 2 b s displaystyle begin array c cccc c atop X diagdown Y amp Y 1 amp Y 2 amp cdots amp Y s amp text sums hline X 1 amp n 11 amp n 12 amp cdots amp n 1s amp a 1 X 2 amp n 21 amp n 22 amp cdots amp n 2s amp a 2 vdots amp vdots amp vdots amp ddots amp vdots amp vdots X r amp n r1 amp n r2 amp cdots amp n rs amp a r hline text sums amp b 1 amp b 2 amp cdots amp b s amp end array Definition Edit The original Adjusted Rand Index using the Permutation Model is A R I i j n i j 2 i a i 2 j b j 2 n 2 1 2 i a i 2 j b j 2 i a i 2 j b j 2 n 2 displaystyle ARI frac left sum ij binom n ij 2 left sum i binom a i 2 sum j binom b j 2 right right binom n 2 left frac 1 2 left sum i binom a i 2 sum j binom b j 2 right left sum i binom a i 2 sum j binom b j 2 right right binom n 2 where n i j a i b j displaystyle n ij a i b j are values from the contingency table See also EditSimple matching coefficientReferences Edit a b c W M Rand 1971 Objective criteria for the evaluation of clustering methods Journal of the American Statistical Association American Statistical Association 66 336 846 850 doi 10 2307 2284239 JSTOR 2284239 a b Lawrence Hubert and Phipps Arabie 1985 Comparing partitions Journal of Classification 2 1 193 218 doi 10 1007 BF01908075 Nguyen Xuan Vinh Julien Epps and James Bailey 2009 Information Theoretic Measures for Clustering Comparison Is a Correction for Chance Necessary PDF ICML 09 Proceedings of the 26th Annual International Conference on Machine Learning ACM pp 1073 1080 PDF Alexander J Gates and Yong Yeol Ahn 2017 The Impact of Random Models on Clustering Similarity PDF Journal of Machine Learning Research 18 1 28 Comparing Clusterings An Overview PDF External links EditC implementation with MATLAB mex files Retrieved from https en wikipedia org w index php title Rand index amp oldid 1110778026 Adjusted Rand index, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.