fbpx
Wikipedia

Approximate string matching

In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

A fuzzy Mediawiki search for "angry emoticon" has as a suggested result "andré emotions"

Overview Edit

The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:[1]

  • insertion: cotcoat
  • deletion: coatcot
  • substitution: coatcost

These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:

  • insertion: co*tcoat
  • deletion: coatco*t
  • substitution: coatcost

Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation.[2]

  • transposition: costcots

Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is coil, foil differs by one substitution, coils by one insertion, oil by one deletion, and foal by two substitutions. If all operations count as a single unit of cost and the limit is set to one, foil, coils, and oil will count as matches while foal will not.

Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers permit separate assignments of limits and weights to individual groups in the pattern.

Problem formulation and algorithms Edit

One possible definition of the approximate string matching problem is the following: Given a pattern string   and a text string  , find a substring   in T, which, of all substrings of T, has the smallest edit distance to the pattern P.

A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m).

A better solution, which was proposed by Sellers[3], relies on dynamic programming. It uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern,  , and any substring   of T that ends at position j.

For each position j in the text T, and each position i in the pattern P, go through all substrings of T ending at position j, and determine which one of them has the minimal edit distance to the i first characters of the pattern P. Write this minimal distance as E(ij). After computing E(ij) for all i and j, we can easily find a solution to the original problem: it is the substring for which E(mj) is minimal (m being the length of the pattern P.)

Computing E(mj) is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for E(mj), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used E(i − 1,j), E(i,j − 1) or E(i − 1,j − 1) in computing E(ij).

In the array containing the E(xy) values, we then choose the minimal value in the last row, let it be E(x2y2), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was E(0, y1), then T[y1 + 1] ... T[y2] is a substring of T with the minimal edit distance to the pattern P.

Computing the E(xy) array takes O(mn) time with the dynamic programming algorithm, while the backwards-working phase takes O(n + m) time.

Another recent idea is the similarity join. When matching database relates to a large scale of data, the O(mn) time with the dynamic programming algorithm cannot work within a limited time. So, the idea is to reduce the number of candidate pairs, instead of computing the similarity of all pairs of strings. Widely used algorithms are based on filter-verification, hashing, Locality-sensitive hashing (LSH), Tries and other greedy and approximation algorithms. Most of them are designed to fit some framework (such as Map-Reduce) to compute concurrently.

On-line versus off-line Edit

Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be processed before searching but the text cannot. In other words, on-line techniques do searching without an index. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher[4] and by Sellers[5]. Both algorithms are based on dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.

On-line searching techniques have been repeatedly improved. Perhaps the most famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The Bitap algorithm is the heart of the Unix searching utility agrep. A review of on-line searching algorithms was done by G. Navarro.[6]

Although very fast on-line techniques exist, their performance on large data is unacceptable. Text preprocessing or indexing makes searching dramatically faster. Today, a variety of indexing algorithms have been presented. Among them are suffix trees[7], metric trees[8] and n-gram methods.[9][10] A detailed survey of indexing techniques that allows one to find an arbitrary substring in a text is given by Navarro et al.[11] A computational survey of dictionary methods (i.e., methods that permit finding all dictionary words that approximately match a search pattern) is given by Boytsov[12].

Applications Edit

Common applications of approximate matching include spell checking.[13] With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application.[14] Approximate matching is also used in spam filtering.[15] Record linkage is a common application where records from two disparate databases are matched.

String matching cannot be used for most binary data, such as images and music. They require different algorithms, such as acoustic fingerprinting.

A common command-line tool fzf is often used to integrate approximate string searching into various command-line applications.[1]

See also Edit

References Edit

  1. ^ "Fzf - A Quick Fuzzy File Search from Linux Terminal". www.tecmint.com. 2018-11-08. Retrieved 2022-09-08.
  • ^ Baeza-Yates, R.; Navarro, G. (June 1996). "A faster algorithm for approximate string matching". In Dan Hirchsberg; Gene Myers (eds.). Combinatorial Pattern Matching (CPM'96), LNCS 1075. Irvine, CA. pp. 1–23. CiteSeerX 10.1.1.42.1593.
  • ^ Baeza-Yates, R.; Navarro, G. "Fast Approximate String Matching in a Dictionary" (PDF). Proc. SPIRE'98. IEEE CS Press. pp. 14–22.
  • ^ Boytsov, Leonid (2011). "Indexing methods for approximate dictionary searching: Comparative analysis". Journal of Experimental Algorithmics. 16 (1): 1–91. doi:10.1145/1963190.1963191. S2CID 15635688.
  • ^ Cormen, Thomas; Leiserson, Rivest (2001). Introduction to Algorithms (2nd ed.). MIT Press. pp. 364–7. ISBN 978-0-262-03293-3.
  • ^ Galil, Zvi; Apostolico, Alberto (1997). Pattern matching algorithms. Oxford [Oxfordshire]: Oxford University Press. ISBN 978-0-19-511367-9.
  • ^ Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge, UK: Cambridge University Press. ISBN 978-0-521-58519-4.
  • ^ Myers, G. (May 1999). "A fast bit-vector algorithm for approximate string matching based on dynamic programming" (PDF). Journal of the ACM. 46 (3): 395–415. doi:10.1145/316542.316550. S2CID 1158099.
  • ^ Navarro, Gonzalo (2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. CiteSeerX 10.1.1.96.7225. doi:10.1145/375360.375365. S2CID 207551224.
  • ^ Navarro, Gonzalo; Baeza-Yates, Ricardo; Sutinen, Erkki; Tarhio, Jorma (2001). "Indexing Methods for Approximate String Matching" (PDF). IEEE Data Engineering Bulletin. 24 (4): 19–27.
  • ^ Sellers, Peter H. (1980). "The Theory and Computation of Evolutionary Distances: Pattern Recognition". Journal of Algorithms. 1 (4): 359–73. doi:10.1016/0196-6774(80)90016-4.
  • ^ Skiena, Steve (1998). Algorithm Design Manual (1st ed.). Springer. ISBN 978-0-387-94860-7.
  • ^ Ukkonen, E. (1985). "Algorithms for approximate string matching". Information and Control. 64 (1–3): 100–18. doi:10.1016/S0019-9958(85)80046-2.
  • ^ Wagner, R.; Fischer, M. (1974). "The string-to-string correction problem". Journal of the ACM. 21: 168–73. doi:10.1145/321796.321811. S2CID 13381535.
  • ^ Zobel, Justin; Dart, Philip (1995). "Finding approximate matches in large lexicons". Software: Practice and Experience. 25 (3): 331–345. CiteSeerX 10.1.1.14.3856. doi:10.1002/spe.4380250307. S2CID 6776819.

External links Edit

  • Flamingo Project
  • with recent advances in approximate string matching based on an edit distance threshold.
  • StringMetric project a Scala library of string metrics and phonetic algorithms
  • Natural project a JavaScript natural language processing library which includes implementations of popular string metrics

approximate, string, matching, computer, science, approximate, string, matching, often, colloquially, referred, fuzzy, string, searching, technique, finding, strings, that, match, pattern, approximately, rather, than, exactly, problem, approximate, string, mat. In computer science approximate string matching often colloquially referred to as fuzzy string searching is the technique of finding strings that match a pattern approximately rather than exactly The problem of approximate string matching is typically divided into two sub problems finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately A fuzzy Mediawiki search for angry emoticon has as a suggested result andre emotions Contents 1 Overview 2 Problem formulation and algorithms 3 On line versus off line 4 Applications 5 See also 6 References 7 External linksOverview EditThe closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match This number is called the edit distance between the string and the pattern The usual primitive operations are 1 insertion cot coat deletion coat cot substitution coat costThese three operations may be generalized as forms of substitution by adding a NULL character here symbolized by wherever a character has been deleted or inserted insertion co t coat deletion coat co t substitution coat costSome approximate matchers also treat transposition in which the positions of two letters in the string are swapped to be a primitive operation 2 transposition cost cotsDifferent approximate matchers impose different constraints Some matchers use a single global unweighted cost that is the total number of primitive operations necessary to convert the match to the pattern For example if the pattern is coil foil differs by one substitution coils by one insertion oil by one deletion and foal by two substitutions If all operations count as a single unit of cost and the limit is set to one foil coils and oil will count as matches while foal will not Other matchers specify the number of operations of each type separately while still others set a total cost but allow different weights to be assigned to different operations Some matchers permit separate assignments of limits and weights to individual groups in the pattern Problem formulation and algorithms EditOne possible definition of the approximate string matching problem is the following Given a pattern string P p 1 p 2 p m displaystyle P p 1 p 2 p m and a text string T t 1 t 2 t n displaystyle T t 1 t 2 dots t n find a substring T j j t j t j displaystyle T j j t j dots t j in T which of all substrings of T has the smallest edit distance to the pattern P A brute force approach would be to compute the edit distance to P for all substrings of T and then choose the substring with the minimum distance However this algorithm would have the running time O n3 m A better solution which was proposed by Sellers 3 relies on dynamic programming It uses an alternative formulation of the problem for each position j in the text T and each position i in the pattern P compute the minimum edit distance between the i first characters of the pattern P i displaystyle P i and any substring T j j displaystyle T j j of T that ends at position j For each position j in the text T and each position i in the pattern P go through all substrings of T ending at position j and determine which one of them has the minimal edit distance to the i first characters of the pattern P Write this minimal distance as E i j After computing E i j for all i and j we can easily find a solution to the original problem it is the substring for which E m j is minimal m being the length of the pattern P Computing E m j is very similar to computing the edit distance between two strings In fact we can use the Levenshtein distance computing algorithm for E m j the only difference being that we must initialize the first row with zeros and save the path of computation that is whether we used E i 1 j E i j 1 or E i 1 j 1 in computing E i j In the array containing the E x y values we then choose the minimal value in the last row let it be E x2 y2 and follow the path of computation backwards back to the row number 0 If the field we arrived at was E 0 y1 then T y1 1 T y2 is a substring of T with the minimal edit distance to the pattern P Computing the E x y array takes O mn time with the dynamic programming algorithm while the backwards working phase takes O n m time Another recent idea is the similarity join When matching database relates to a large scale of data the O mn time with the dynamic programming algorithm cannot work within a limited time So the idea is to reduce the number of candidate pairs instead of computing the similarity of all pairs of strings Widely used algorithms are based on filter verification hashing Locality sensitive hashing LSH Tries and other greedy and approximation algorithms Most of them are designed to fit some framework such as Map Reduce to compute concurrently On line versus off line EditTraditionally approximate string matching algorithms are classified into two categories on line and off line With on line algorithms the pattern can be processed before searching but the text cannot In other words on line techniques do searching without an index Early algorithms for on line approximate matching were suggested by Wagner and Fisher 4 and by Sellers 5 Both algorithms are based on dynamic programming but solve different problems Sellers algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance being appropriate for dictionary fuzzy search only On line searching techniques have been repeatedly improved Perhaps the most famous improvement is the bitap algorithm also known as the shift or and shift and algorithm which is very efficient for relatively short pattern strings The Bitap algorithm is the heart of the Unix searching utility agrep A review of on line searching algorithms was done by G Navarro 6 Although very fast on line techniques exist their performance on large data is unacceptable Text preprocessing or indexing makes searching dramatically faster Today a variety of indexing algorithms have been presented Among them are suffix trees 7 metric trees 8 and n gram methods 9 10 A detailed survey of indexing techniques that allows one to find an arbitrary substring in a text is given by Navarro et al 11 A computational survey of dictionary methods i e methods that permit finding all dictionary words that approximately match a search pattern is given by Boytsov 12 Applications EditCommon applications of approximate matching include spell checking 13 With the availability of large amounts of DNA data matching of nucleotide sequences has become an important application 14 Approximate matching is also used in spam filtering 15 Record linkage is a common application where records from two disparate databases are matched String matching cannot be used for most binary data such as images and music They require different algorithms such as acoustic fingerprinting A common command line tool fzf is often used to integrate approximate string searching into various command line applications 1 See also EditConcept search Jaro Winkler distance Levenshtein distance Locality sensitive hashing Metaphone Needleman Wunsch algorithm Plagiarism detection Regular expressions for fuzzy and non fuzzy matching Smith Waterman algorithm Soundex String metricReferences Edit Fzf A Quick Fuzzy File Search from Linux Terminal www tecmint com 2018 11 08 Retrieved 2022 09 08 Baeza Yates R Navarro G June 1996 A faster algorithm for approximate string matching In Dan Hirchsberg Gene Myers eds Combinatorial Pattern Matching CPM 96 LNCS 1075 Irvine CA pp 1 23 CiteSeerX 10 1 1 42 1593 Baeza Yates R Navarro G Fast Approximate String Matching in a Dictionary PDF Proc SPIRE 98 IEEE CS Press pp 14 22 Boytsov Leonid 2011 Indexing methods for approximate dictionary searching Comparative analysis Journal of Experimental Algorithmics 16 1 1 91 doi 10 1145 1963190 1963191 S2CID 15635688 Cormen Thomas Leiserson Rivest 2001 Introduction to Algorithms 2nd ed MIT Press pp 364 7 ISBN 978 0 262 03293 3 Galil Zvi Apostolico Alberto 1997 Pattern matching algorithms Oxford Oxfordshire Oxford University Press ISBN 978 0 19 511367 9 Gusfield Dan 1997 Algorithms on strings trees and sequences computer science and computational biology Cambridge UK Cambridge University Press ISBN 978 0 521 58519 4 Myers G May 1999 A fast bit vector algorithm for approximate string matching based on dynamic programming PDF Journal of the ACM 46 3 395 415 doi 10 1145 316542 316550 S2CID 1158099 Navarro Gonzalo 2001 A guided tour to approximate string matching ACM Computing Surveys 33 1 31 88 CiteSeerX 10 1 1 96 7225 doi 10 1145 375360 375365 S2CID 207551224 Navarro Gonzalo Baeza Yates Ricardo Sutinen Erkki Tarhio Jorma 2001 Indexing Methods for Approximate String Matching PDF IEEE Data Engineering Bulletin 24 4 19 27 Sellers Peter H 1980 The Theory and Computation of Evolutionary Distances Pattern Recognition Journal of Algorithms 1 4 359 73 doi 10 1016 0196 6774 80 90016 4 Skiena Steve 1998 Algorithm Design Manual 1st ed Springer ISBN 978 0 387 94860 7 Ukkonen E 1985 Algorithms for approximate string matching Information and Control 64 1 3 100 18 doi 10 1016 S0019 9958 85 80046 2 Wagner R Fischer M 1974 The string to string correction problem Journal of the ACM 21 168 73 doi 10 1145 321796 321811 S2CID 13381535 Zobel Justin Dart Philip 1995 Finding approximate matches in large lexicons Software Practice and Experience 25 3 331 345 CiteSeerX 10 1 1 14 3856 doi 10 1002 spe 4380250307 S2CID 6776819 External links EditFlamingo Project Efficient Similarity Query Processing Project with recent advances in approximate string matching based on an edit distance threshold StringMetric project a Scala library of string metrics and phonetic algorithms Natural project a JavaScript natural language processing library which includes implementations of popular string metrics Retrieved from https en wikipedia org w index php title Approximate string matching amp oldid 1170047672, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.