fbpx
Wikipedia

String metric

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.[1] A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance).[2] It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, incremental search, data integration, malware detection,[3] and semantic knowledge integration.

List of string metrics edit

There also exist functions which measure a dissimilarity between strings, but do not necessarily fulfill the triangle inequality, and as such are not metrics in the mathematical sense. An example of such function is the Jaro–Winkler distance.

Selected string measures examples edit

Name Description Example
Hamming distance Only for strings of the same length. Number of changed characters. "karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance Generalisation of Hamming distance that allows for different length strings, and (with Damerau) for transpositions kitten and sitting have a distance of 3.
  1. kittensitten (substitution of "s" for "k")
  2. sittensittin (substitution of "i" for "e")
  3. sittinsitting (insertion of "g" at the end).
Jaro–Winkler distance JaroWinklerDist("MARTHA","MARHTA") =
 
  •   is the number of matching characters;
  •   is half the number of transpositions("MARTHA"[3]!=H, "MARHTA"[3]!=T).
Most frequent k characters MostFreqKeySimilarity('research', 'seeking', 2) = 2

References edit

  1. ^ Lu, Jiaheng; et al. (2013). "String similarity measures and joins with synonyms". Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. pp. 373–384. doi:10.1145/2463676.2465313. ISBN 9781450320375. S2CID 2091942.{{cite book}}: CS1 maint: date and year (link)
  2. ^ Navarro, Gonzalo (2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:10.1145/375360.375365. hdl:10533/172862. S2CID 207551224.
  3. ^ Shlomi Dolev; Mohammad, Ghanayim; Alexander, Binun; Sergey, Frenkel; Yeali, S. Sun (2017). "Relationship of Jaccard and edit distance in malware clustering and online identification". 16th IEEE International Symposium on Network Computing and Applications: 369–373.
  4. ^ a b c d e Sam's String Metrics - Computational Linguistics and Phonetics
  5. ^ Russell, David J., et al. "A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences." BMC bioinformatics 11.1 (2010): 1-14.
  6. ^ Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks": 73–78. {{cite journal}}: Cite journal requires |journal= (help)


External links edit

  • A fairly complete overview at the Wayback Machine
  • Carnegie Mellon University open source library
  • StringMetric project a Scala library of string metrics and phonetic algorithms
  • Natural project a JavaScript natural language processing library which includes implementations of popular string metrics

string, metric, string, distance, redirects, here, distance, between, strings, fingerboard, musical, instruments, action, music, mathematics, computer, science, string, metric, also, known, string, similarity, metric, string, distance, function, metric, that, . String distance redirects here For the distance between strings and the fingerboard in musical instruments see Action music In mathematics and computer science a string metric also known as a string similarity metric or string distance function is a metric that measures distance inverse similarity between two text strings for approximate string matching or comparison and in fuzzy string searching A requirement for a string metric e g in contrast to string matching is fulfillment of the triangle inequality For example the strings Sam and Samuel can be considered to be close 1 A string metric provides a number indicating an algorithm specific indication of distance The most widely known string metric is a rudimentary one called the Levenshtein distance also known as edit distance 2 It operates between two input strings returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another Simplistic string metrics such as Levenshtein distance have expanded to include phonetic token grammatical and character based methods of statistical comparisons String metrics are used heavily in information integration and are currently used in areas including fraud detection fingerprint analysis plagiarism detection ontology merging DNA analysis RNA analysis image analysis evidence based machine learning database data deduplication data mining incremental search data integration malware detection 3 and semantic knowledge integration Contents 1 List of string metrics 2 Selected string measures examples 3 References 4 External linksList of string metrics editLevenshtein distance or its generalization edit distance Damerau Levenshtein distance Sorensen Dice coefficient Block distance or L1 distance or City block distance Hamming distance Simple matching coefficient SMC Jaccard similarity or Jaccard coefficient or Tanimoto coefficient Tversky index Overlap coefficient Variational distance 4 Hellinger distance or Bhattacharyya distance Information radius Jensen Shannon divergence Skew divergence 4 Confusion probability 4 Tau metric an approximation of the Kullback Leibler divergence Fellegi and Sunters metric SFS 4 Maximal matches 4 Grammar based distance 5 TFIDF distance metric 6 There also exist functions which measure a dissimilarity between strings but do not necessarily fulfill the triangle inequality and as such are not metrics in the mathematical sense An example of such function is the Jaro Winkler distance Selected string measures examples editName Description ExampleHamming distance Only for strings of the same length Number of changed characters karol in and kathr in is 3 Levenshtein distance and Damerau Levenshtein distance Generalisation of Hamming distance that allows for different length strings and with Damerau for transpositions kitten and sitting have a distance of 3 kitten sitten substitution of s for k sitten sittin substitution of i for e sittin sitting insertion of g at the end Jaro Winkler distance JaroWinklerDist MARTHA MARHTA d j 1 3 m s 1 m s 2 m t m 1 3 6 6 6 6 6 2 2 6 0 944 displaystyle d j frac 1 3 left frac m s 1 frac m s 2 frac m t m right frac 1 3 left frac 6 6 frac 6 6 frac 6 frac 2 2 6 right 0 944 nbsp m displaystyle m nbsp is the number of matching characters t displaystyle t nbsp is half the number of transpositions MARTHA 3 H MARHTA 3 T Most frequent k characters MostFreqKeySimilarity r e se ar ch see king 2 2References edit Lu Jiaheng et al 2013 String similarity measures and joins with synonyms Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data pp 373 384 doi 10 1145 2463676 2465313 ISBN 9781450320375 S2CID 2091942 a href Template Cite book html title Template Cite book cite book a CS1 maint date and year link Navarro Gonzalo 2001 A guided tour to approximate string matching ACM Computing Surveys 33 1 31 88 doi 10 1145 375360 375365 hdl 10533 172862 S2CID 207551224 Shlomi Dolev Mohammad Ghanayim Alexander Binun Sergey Frenkel Yeali S Sun 2017 Relationship of Jaccard and edit distance in malware clustering and online identification 16th IEEE International Symposium on Network Computing and Applications 369 373 a b c d e Sam s String Metrics Computational Linguistics and Phonetics Russell David J et al A grammar based distance metric enables fast and accurate clustering of large sets of 16S sequences BMC bioinformatics 11 1 2010 1 14 Cohen William Ravikumar Pradeep Fienberg Stephen 2003 08 01 A Comparison of String Distance Metrics for Name Matching Tasks 73 78 a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help External links editString Similarity Metrics for Information Integration A fairly complete overview Archive index at the Wayback Machine Carnegie Mellon University open source library StringMetric project a Scala library of string metrics and phonetic algorithms Natural project a JavaScript natural language processing library which includes implementations of popular string metrics Retrieved from https en wikipedia org w index php title String metric amp oldid 1169461566, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.