fbpx
Wikipedia

Sørensen–Dice coefficient

The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen[1] and Lee Raymond Dice,[2] who published in 1948 and 1945 respectively.

Name

The index is known by several other names, especially Sørensen–Dice index,[3] Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC). Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending.

Other names include:

Formula

Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as

 

where |X| and |Y| are the cardinalities of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as

 .

It is different from the Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1.[9] It can be viewed as a similarity measure over sets.

Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b:

 

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :[10]

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[11]

 

where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night
nacht

We would find the set of bigrams in each word:

{ni,ig,gh,ht}
{na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

Continuous Dice Coefficient[12]

For a discrete ground truth and continuous measures the following formula can be used:

 

where c can be computed as follows:

 

If   which means no overlap between A and B, c is set to 1 arbitrarily.

Difference from Jaccard

This coefficient is not very different in form from the Jaccard index. In fact, both are equivalent in the sense that given a value for the Sørensen–Dice coefficient  , one can calculate the respective Jaccard index value   and vice versa, using the equations   and  .

Since the Sørensen–Dice coefficient does not satisfy the triangle inequality, it can be considered a semimetric version of the Jaccard index.[4]

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

 

is not a proper distance metric as it does not satisfy the triangle inequality.[4] The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third. To satisfy the triangle inequality, the sum of any two of these three sides must be greater than or equal to the remaining side. However, the distance between {a} and {a,b} plus the distance between {b} and {a,b} equals 2/3 and is therefore less than the distance between {a} and {b} which is 1.

Applications

The Sørensen–Dice coefficient is useful for ecological community data (e.g. Looman & Campbell, 1960[13]). Justification for its use is primarily empirical rather than theoretical (although it can be justified theoretically as the intersection of two fuzzy sets[14]). As compared to Euclidean distance, the Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers.[15] Recently the Dice score (and its variations, e.g. logDice taking a logarithm of it) has become popular in computer lexicography for measuring the lexical association score of two given words.[16] logDice is also used as part of the Mash Distance for genome and metagenome distance estimation[17] Finally, Dice is used in image segmentation, in particular for comparing algorithm output against reference masks in medical applications.[8]

Abundance version

The expression is easily extended to abundance instead of presence/absence of species. This quantitative version is known by several names:

See also

References

  1. ^ Sørensen, T. (1948). "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons". Kongelige Danske Videnskabernes Selskab. 5 (4): 1–34.
  2. ^ Dice, Lee R. (1945). "Measures of the Amount of Ecologic Association Between Species". Ecology. 26 (3): 297–302. doi:10.2307/1932409. JSTOR 1932409.
  3. ^ a b Carass, A.; Roy, S.; Gherman, A.; Reinhold, J.C.; Jesson, A.; et al. (2020). "Evaluating White Matter Lesion Segmentations with Refined Sørensen-Dice Analysis". Scientific Reports. 10 (1): 8242. Bibcode:2020NatSR..10.8242C. doi:10.1038/s41598-020-64803-w. ISSN 2045-2322. PMC 7237671. PMID 32427874.
  4. ^ a b c d e f g h i j Gallagher, E.D., 1999. COMPAH Documentation, University of Massachusetts, Boston
  5. ^ Nei, M.; Li, W.H. (1979). "Mathematical model for studying genetic variation in terms of restriction endonucleases". PNAS. 76 (10): 5269–5273. Bibcode:1979PNAS...76.5269N. doi:10.1073/pnas.76.10.5269. PMC 413122. PMID 291943.
  6. ^ Prescott, J.W.; Pennell, M.; Best, T.M.; Swanson, M.S.; Haq, F.; Jackson, R.; Gurcan, M.N. (2009). "An automated method to segment the femur for osteoarthritis research". 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE. pp. 6364–6367. doi:10.1109/iembs.2009.5333257. PMC 2826829.
  7. ^ Swanson, M.S.; Prescott, J.W.; Best, T.M.; Powell, K.; Jackson, R.D.; Haq, F.; Gurcan, M.N. (2010). "Semi-automated segmentation to assess the lateral meniscus in normal and osteoarthritic knees". Osteoarthritis and Cartilage. 18 (3): 344–353. doi:10.1016/j.joca.2009.10.004. ISSN 1063-4584. PMC 2826568. PMID 19857510.
  8. ^ a b Zijdenbos, A.P.; Dawant, B.M.; Margolin, R.A.; Palmer, A.C. (1994). "Morphometric analysis of white matter lesions in MR images: method and validation". IEEE Transactions on Medical Imaging. 13 (4): 716–724. doi:10.1109/42.363096. ISSN 0278-0062. PMID 18218550.
  9. ^ http://www.sekj.org/PDF/anbf40/anbf40-415.pdf[bare URL PDF]
  10. ^ van Rijsbergen, Cornelis Joost (1979). Information Retrieval. London: Butterworths. ISBN 3-642-12274-4.
  11. ^ Kondrak, Grzegorz; Marcu, Daniel; Knight, Kevin (2003). "Cognates Can Improve Statistical Translation Models" (PDF). Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. pp. 46–48.
  12. ^ Shamir, Reuben R.; Duchin, Yuval; Kim, Jinyoung; Sapiro, Guillermo; Harel, Noam (2018-04-25). "Continuous Dice Coefficient: a Method for Evaluating Probabilistic Segmentations": 306977. doi:10.1101/306977. S2CID 90993940. {{cite journal}}: Cite journal requires |journal= (help)
  13. ^ Looman, J.; Campbell, J.B. (1960). "Adaptation of Sorensen's K (1948) for estimating unit affinities in prairie vegetation". Ecology. 41 (3): 409–416. doi:10.2307/1933315. JSTOR 1933315.
  14. ^ Roberts, D.W. (1986). "Ordination on the basis of fuzzy set theory". Vegetatio. 66 (3): 123–131. doi:10.1007/BF00039905. S2CID 12573576.
  15. ^ McCune, Bruce & Grace, James (2002) Analysis of Ecological Communities. Mjm Software Design; ISBN 0-9721290-0-6.
  16. ^ Rychlý, P. (2008) A lexicographer-friendly association score. Proceedings of the Second Workshop on Recent Advances in Slavonic Natural Language Processing RASLAN 2008: 6–9
  17. ^ Ondov, Brian D., et al. "Mash: fast genome and metagenome distance estimation using MinHash." Genome biology 17.1 (2016): 1-14.
  18. ^ Bray, J. Roger; Curtis, J. T. (1957). "An Ordination of the Upland Forest Communities of Southern Wisconsin". Ecological Monographs. 27 (4): 326–349. doi:10.2307/1942268. JSTOR 1942268.
  19. ^ Ayappa, Indu; Norman, Robert G (2000). "Non-Invasive Detection of Respiratory Effort-Related Arousals (RERAs) by a Nasal Cannula/Pressure Transducer System". Sleep. 23 (6): 763–771. doi:10.1093/sleep/23.6.763. PMID 11007443.
  20. ^ John Uebersax. "Raw Agreement Indices".

External links

sørensen, dice, coefficient, below, other, names, statistic, used, gauge, similarity, samples, independently, developed, botanists, thorvald, sørensen, raymond, dice, published, 1948, 1945, respectively, contents, name, formula, continuous, dice, coefficient, . The Sorensen Dice coefficient see below for other names is a statistic used to gauge the similarity of two samples It was independently developed by the botanists Thorvald Sorensen 1 and Lee Raymond Dice 2 who published in 1948 and 1945 respectively Contents 1 Name 2 Formula 2 1 Continuous Dice Coefficient 12 3 Difference from Jaccard 4 Applications 5 Abundance version 6 See also 7 References 8 External linksName EditThe index is known by several other names especially Sorensen Dice index 3 Sorensen index and Dice s coefficient Other variations include the similarity coefficient or index such as Dice similarity coefficient DSC Common alternate spellings for Sorensen are Sorenson Soerenson and Sorenson and all three can also be seen with the sen ending Other names include F1 score Czekanowski s binary non quantitative index 4 Measure of genetic similarity 5 Zijdenbos similarity index 6 7 referring to a 1994 paper of Zijdenbos et al 8 3 Formula EditSorensen s original formula was intended to be applied to discrete data Given two sets X and Y it is defined as D S C 2 X Y X Y displaystyle DSC frac 2 X cap Y X Y where X and Y are the cardinalities of the two sets i e the number of elements in each set The Sorensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set When applied to Boolean data using the definition of true positive TP false positive FP and false negative FN it can be written as D S C 2 T P 2 T P F P F N displaystyle DSC frac 2TP 2TP FP FN It is different from the Jaccard index which only counts true positives once in both the numerator and denominator DSC is the quotient of similarity and ranges between 0 and 1 9 It can be viewed as a similarity measure over sets Similarly to the Jaccard index the set operations can be expressed in terms of vector operations over binary vectors a and b s v 2 a b a 2 b 2 displaystyle s v frac 2 bf a cdot bf b bf a 2 bf b 2 which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms For sets X and Y of keywords used in information retrieval the coefficient may be defined as twice the shared information intersection over the sum of cardinalities 10 When taken as a string similarity measure the coefficient may be calculated for two strings x and y using bigrams as follows 11 s 2 n t n x n y displaystyle s frac 2n t n x n y where nt is the number of character bigrams found in both strings nx is the number of bigrams in string x and ny is the number of bigrams in string y For example to calculate the similarity between night nachtWe would find the set of bigrams in each word ni ig gh ht na ac ch ht Each set has four elements and the intersection of these two sets has only one element ht Inserting these numbers into the formula we calculate s 2 1 4 4 0 25 Continuous Dice Coefficient 12 Edit For a discrete ground truth and continuous measures the following formula can be used c D C 2 X Y c X Y displaystyle cDC frac 2 X cap Y c X Y where c can be computed as follows c S a i b i S a i sign b i displaystyle c frac Sigma a i b i Sigma a i operatorname sign b i If S a i sign b i 0 displaystyle Sigma a i operatorname sign b i 0 which means no overlap between A and B c is set to 1 arbitrarily Difference from Jaccard EditThis coefficient is not very different in form from the Jaccard index In fact both are equivalent in the sense that given a value for the Sorensen Dice coefficient S displaystyle S one can calculate the respective Jaccard index value J displaystyle J and vice versa using the equations J S 2 S displaystyle J S 2 S and S 2 J 1 J displaystyle S 2J 1 J Since the Sorensen Dice coefficient does not satisfy the triangle inequality it can be considered a semimetric version of the Jaccard index 4 The function ranges between zero and one like Jaccard Unlike Jaccard the corresponding difference function d 1 2 X Y X Y displaystyle d 1 frac 2 X cap Y X Y is not a proper distance metric as it does not satisfy the triangle inequality 4 The simplest counterexample of this is given by the three sets a b and a b the distance between the first two being 1 and the difference between the third and each of the others being one third To satisfy the triangle inequality the sum of any two of these three sides must be greater than or equal to the remaining side However the distance between a and a b plus the distance between b and a b equals 2 3 and is therefore less than the distance between a and b which is 1 Applications EditThe Sorensen Dice coefficient is useful for ecological community data e g Looman amp Campbell 1960 13 Justification for its use is primarily empirical rather than theoretical although it can be justified theoretically as the intersection of two fuzzy sets 14 As compared to Euclidean distance the Sorensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers 15 Recently the Dice score and its variations e g logDice taking a logarithm of it has become popular in computer lexicography for measuring the lexical association score of two given words 16 logDice is also used as part of the Mash Distance for genome and metagenome distance estimation 17 Finally Dice is used in image segmentation in particular for comparing algorithm output against reference masks in medical applications 8 Abundance version EditThe expression is easily extended to abundance instead of presence absence of species This quantitative version is known by several names Quantitative Sorensen Dice index 4 Quantitative Sorensen index 4 Quantitative Dice index 4 Bray Curtis similarity 1 minus the Bray Curtis dissimilarity 4 Czekanowski s quantitative index 4 Steinhaus index 4 Pielou s percentage similarity 4 1 minus the Hellinger distance 18 Proportion of specific agreement 19 or positive agreement 20 See also EditCorrelation F1 score Jaccard index Hamming distance Mantel test Morisita s overlap index Most frequent k characters Overlap coefficient Renkonen similarity index due to Olavi Renkonen Tversky index Universal adaptive strategy theory UAST References Edit Sorensen T 1948 A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons Kongelige Danske Videnskabernes Selskab 5 4 1 34 Dice Lee R 1945 Measures of the Amount of Ecologic Association Between Species Ecology 26 3 297 302 doi 10 2307 1932409 JSTOR 1932409 a b Carass A Roy S Gherman A Reinhold J C Jesson A et al 2020 Evaluating White Matter Lesion Segmentations with Refined Sorensen Dice Analysis Scientific Reports 10 1 8242 Bibcode 2020NatSR 10 8242C doi 10 1038 s41598 020 64803 w ISSN 2045 2322 PMC 7237671 PMID 32427874 a b c d e f g h i j Gallagher E D 1999 COMPAH Documentation University of Massachusetts Boston Nei M Li W H 1979 Mathematical model for studying genetic variation in terms of restriction endonucleases PNAS 76 10 5269 5273 Bibcode 1979PNAS 76 5269N doi 10 1073 pnas 76 10 5269 PMC 413122 PMID 291943 Prescott J W Pennell M Best T M Swanson M S Haq F Jackson R Gurcan M N 2009 An automated method to segment the femur for osteoarthritis research 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society IEEE pp 6364 6367 doi 10 1109 iembs 2009 5333257 PMC 2826829 Swanson M S Prescott J W Best T M Powell K Jackson R D Haq F Gurcan M N 2010 Semi automated segmentation to assess the lateral meniscus in normal and osteoarthritic knees Osteoarthritis and Cartilage 18 3 344 353 doi 10 1016 j joca 2009 10 004 ISSN 1063 4584 PMC 2826568 PMID 19857510 a b Zijdenbos A P Dawant B M Margolin R A Palmer A C 1994 Morphometric analysis of white matter lesions in MR images method and validation IEEE Transactions on Medical Imaging 13 4 716 724 doi 10 1109 42 363096 ISSN 0278 0062 PMID 18218550 http www sekj org PDF anbf40 anbf40 415 pdf bare URL PDF van Rijsbergen Cornelis Joost 1979 Information Retrieval London Butterworths ISBN 3 642 12274 4 Kondrak Grzegorz Marcu Daniel Knight Kevin 2003 Cognates Can Improve Statistical Translation Models PDF Proceedings of HLT NAACL 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics pp 46 48 Shamir Reuben R Duchin Yuval Kim Jinyoung Sapiro Guillermo Harel Noam 2018 04 25 Continuous Dice Coefficient a Method for Evaluating Probabilistic Segmentations 306977 doi 10 1101 306977 S2CID 90993940 a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help Looman J Campbell J B 1960 Adaptation of Sorensen s K 1948 for estimating unit affinities in prairie vegetation Ecology 41 3 409 416 doi 10 2307 1933315 JSTOR 1933315 Roberts D W 1986 Ordination on the basis of fuzzy set theory Vegetatio 66 3 123 131 doi 10 1007 BF00039905 S2CID 12573576 McCune Bruce amp Grace James 2002 Analysis of Ecological Communities Mjm Software Design ISBN 0 9721290 0 6 Rychly P 2008 A lexicographer friendly association score Proceedings of the Second Workshop on Recent Advances in Slavonic Natural Language Processing RASLAN 2008 6 9 Ondov Brian D et al Mash fast genome and metagenome distance estimation using MinHash Genome biology 17 1 2016 1 14 Bray J Roger Curtis J T 1957 An Ordination of the Upland Forest Communities of Southern Wisconsin Ecological Monographs 27 4 326 349 doi 10 2307 1942268 JSTOR 1942268 Ayappa Indu Norman Robert G 2000 Non Invasive Detection of Respiratory Effort Related Arousals RERAs by a Nasal Cannula Pressure Transducer System Sleep 23 6 763 771 doi 10 1093 sleep 23 6 763 PMID 11007443 John Uebersax Raw Agreement Indices External links Edit The Wikibook Algorithm implementation has a page on the topic of Dice s coefficient Retrieved from https en wikipedia org w index php title Sorensen Dice coefficient amp oldid 1096446555, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.