fbpx
Wikipedia

Damerau–Levenshtein distance

In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein[1][2][3]) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).[4][2]

In his seminal paper,[5] Damerau stated that in an investigation of spelling errors for an information-retrieval system, more than 80% were a result of a single error of one of the four types. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between protein sequences.[6]

Definition edit

To express the Damerau–Levenshtein distance between two strings   and  , a function   is defined, whose value is a distance between an  -symbol prefix (initial substring) of string   and a  -symbol prefix of  .

The restricted distance function is defined recursively as:[7]: A:11 

 
where   is the indicator function equal to 0 when   and equal to 1 otherwise.

Each recursive call matches one of the cases covered by the Damerau–Levenshtein distance:

  •   corresponds to a deletion (from a to b),
  •   corresponds to an insertion (from a to b),
  •   corresponds to a match or mismatch, depending on whether the respective symbols are the same,
  •   corresponds to a transposition between two successive symbols.

The Damerau–Levenshtein distance between a and b is then given by the function value for full strings:  , where   denotes the length of string a, and   is the length of b.

Algorithm edit

Presented here are two algorithms: the first,[8] simpler one, computes what is known as the optimal string alignment distance or restricted edit distance,[7] while the second one[9] computes the Damerau–Levenshtein distance with adjacent transpositions. Adding transpositions adds significant complexity. The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once, whereas the second one presents no such restriction.

Take for example the edit distance between CA and ABC. The Damerau–Levenshtein distance LD(CAABC) = 2 because CAACABC, but the optimal string alignment distance OSA(CAABC) = 3 because if the operation CAAC is used, it is not possible to use ACABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore the shortest sequence of operations is CAAABABC. Note that for the optimal string alignment distance, the triangle inequality does not hold: OSA(CAAC) + OSA(ACABC) < OSA(CAABC), and so it is not a true metric.

Optimal string alignment distance edit

Optimal string alignment distance can be computed using a straightforward extension of the Wagner–Fischer dynamic programming algorithm that computes Levenshtein distance. In pseudocode:

algorithm OSA-distance is input: strings a[1..length(a)], b[1..length(b)] output: distance, integer let d[0..length(a), 0..length(b)] be a 2-d array of integers, dimensions length(a)+1, length(b)+1 // note that d is zero-indexed, while a and b are one-indexed. for i := 0 to length(a) inclusive do d[i, 0] := i for j := 0 to length(b) inclusive do d[0, j] := j for i := 1 to length(a) inclusive do for j := 1 to length(b) inclusive do  if a[i] = b[j] then  cost := 0  else  cost := 1  d[i, j] := minimum(d[i-1, j] + 1, // deletion   d[i, j-1] + 1, // insertion   d[i-1, j-1] + cost) // substitution  if i > 1 and j > 1 and a[i] = b[j-1] and a[i-1] = b[j] then  d[i, j] := minimum(d[i, j],   d[i-2, j-2] + 1) // transposition return d[length(a), length(b)] 

The difference from the algorithm for Levenshtein distance is the addition of one recurrence:

if i > 1 and j > 1 and a[i] = b[j-1] and a[i-1] = b[j] then d[i, j] := minimum(d[i, j],  d[i-2, j-2] + 1) // transposition 

Distance with adjacent transpositions edit

The following algorithm computes the true Damerau–Levenshtein distance with adjacent transpositions; this algorithm requires as an additional parameter the size of the alphabet Σ, so that all entries of the arrays are in [0, |Σ|):[7]: A:93 

algorithm DL-distance is input: strings a[1..length(a)], b[1..length(b)] output: distance, integer da := new array of |Σ| integers for i := 1 to |Σ| inclusive do da[i] := 0 let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2 // note that d has indices starting at −1, while a, b and da are one-indexed. maxdist := length(a) + length(b) d[−1, −1] := maxdist for i := 0 to length(a) inclusive do d[i, −1] := maxdist d[i, 0] := i for j := 0 to length(b) inclusive do d[−1, j] := maxdist d[0, j] := j for i := 1 to length(a) inclusive do db := 0 for j := 1 to length(b) inclusive do  k := da[b[j]]  ℓ := db  if a[i] = b[j] then  cost := 0  db := j  else  cost := 1  d[i, j] := minimum(d[i−1, j−1] + cost, //substitution   d[i, j−1] + 1, //insertion   d[i−1, j ] + 1, //deletion   d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition da[a[i]] := i return d[length(a), length(b)] 

To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance, note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. (This holds as long as the cost of a transposition,  , is at least the average of the cost of an insertion and deletion, i.e.,  .[9]) Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity:  , where M and N are string lengths. Using the ideas of Lowrance and Wagner,[9] this naive algorithm can be improved to be   in the worst case, which is what the above pseudocode does.

It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of[1] for an example of such an adaptation.

Applications edit

Damerau–Levenshtein distance plays an important role in natural language processing. In natural languages, strings are short and the number of errors (misspellings) rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. Oommen and Loke[8] even mitigated the limitation of the restricted edit distance by introducing generalized transpositions. Nevertheless, one must remember that the restricted edit distance usually does not satisfy the triangle inequality, and thus cannot be used with metric trees.

DNA edit

Since DNA frequently undergoes insertions, deletions, substitutions, and transpositions, and each of these operations occurs on approximately the same timescale, the Damerau–Levenshtein distance is an appropriate metric of the variation between two strands of DNA.[6] More common in DNA, protein, and other bioinformatics related alignment tasks is the use of closely related algorithms such as Needleman–Wunsch algorithm or Smith–Waterman algorithm.[citation needed]

Fraud detection edit

The algorithm can be used with any set of words, like vendor names. Since entry is manual by nature, there is a risk of entering a false vendor. A fraudster employee may enter one real vendor such as "Rich Heir Estate Services" versus a false vendor "Rich Hier State Services". The fraudster would then create a false bank account and have the company route checks to the real vendor and false vendor. The Damerau–Levenshtein algorithm will detect the transposed and dropped letter and bring attention of the items to a fraud examiner.

Export control edit

The U.S. Government uses the Damerau–Levenshtein distance with its Consolidated Screening List API.[10]

See also edit

  • Ispell – Suggests corrections that are based on a Damerau–Levenshtein distance of 1
  • Typosquatting – Form of cybersquatting which relies on mistakes when inputting a website address

References edit

  1. ^ Brill, Eric; Moore, Robert C. (2000). (PDF). Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. pp. 286–293. doi:10.3115/1075218.1075255. Archived from the original (PDF) on 2012-12-21.
  2. ^ a b Bard, Gregory V. (2007), "Spelling-error tolerant, order-independent pass-phrases via the Damerau–Levenshtein string-edit distance metric", Proceedings of the Fifth Australasian Symposium on ACSW Frontiers : 2007, Ballarat, Australia, January 30 - February 2, 2007, Conferences in Research and Practice in Information Technology, vol. 68, Darlinghurst, Australia: Australian Computer Society, Inc., pp. 117–124, ISBN 978-1-920682-49-1.
  3. ^ Li; et al. (2006). (PDF). Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. pp. 1025–1032. doi:10.3115/1220175.1220304. Archived from the original (PDF) on 2010-04-01.
  4. ^ Levenshtein, Vladimir I. (February 1966), "Binary codes capable of correcting deletions, insertions, and reversals", Soviet Physics Doklady, 10 (8): 707–710, Bibcode:1966SPhD...10..707L
  5. ^ Damerau, Fred J. (March 1964), "A technique for computer detection and correction of spelling errors", Communications of the ACM, 7 (3): 171–176, doi:10.1145/363958.363994, S2CID 7713345
  6. ^ a b Majorek, Karolina A.; Dunin-Horkawicz, Stanisław; et al. (2013), "The RNase H-like superfamily: new members, comparative structural analysis and evolutionary classification", Nucleic Acids Research, 42 (7): 4160–4179, doi:10.1093/nar/gkt1414, PMC 3985635, PMID 24464998
  7. ^ a b c Boytsov, Leonid (May 2011). "Indexing methods for approximate dictionary searching". Journal of Experimental Algorithmics. 16: 1. doi:10.1145/1963190.1963191. S2CID 15635688.
  8. ^ a b Oommen, B. J.; Loke, R. K. S. (1997). "Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions". Pattern Recognition. 30 (5): 789–800. Bibcode:1997PatRe..30..789O. CiteSeerX 10.1.1.50.1459. doi:10.1016/S0031-3203(96)00101-X.
  9. ^ a b c Lowrance, Roy; Wagner, Robert A. (April 1975), "An Extension of the String-to-String Correction Problem", J ACM, 22 (2): 177–183, doi:10.1145/321879.321880, S2CID 18892193
  10. ^ . Trade.gov Developer Portal. International Trade Administration, U.S. Department of Commerce. Archived from the original on 2019-10-30.

Further reading edit

  • Navarro, Gonzalo (March 2001), "A guided tour to approximate string matching", ACM Computing Surveys, 33 (1): 31–88, doi:10.1145/375360.375365, S2CID 207551224

damerau, levenshtein, distance, information, theory, computer, science, named, after, frederick, damerau, vladimir, levenshtein, string, metric, measuring, edit, distance, between, sequences, informally, between, words, minimum, number, operations, consisting,. In information theory and computer science the Damerau Levenshtein distance named after Frederick J Damerau and Vladimir I Levenshtein 1 2 3 is a string metric for measuring the edit distance between two sequences Informally the Damerau Levenshtein distance between two words is the minimum number of operations consisting of insertions deletions or substitutions of a single character or transposition of two adjacent characters required to change one word into the other The Damerau Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single character edit operations insertions deletions and substitutions 4 2 In his seminal paper 5 Damerau stated that in an investigation of spelling errors for an information retrieval system more than 80 were a result of a single error of one of the four types Damerau s paper considered only misspellings that could be corrected with at most one edit operation While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers Damerau Levenshtein distance has also seen uses in biology to measure the variation between protein sequences 6 Contents 1 Definition 2 Algorithm 2 1 Optimal string alignment distance 2 2 Distance with adjacent transpositions 3 Applications 3 1 DNA 3 2 Fraud detection 3 3 Export control 4 See also 5 References 6 Further readingDefinition editTo express the Damerau Levenshtein distance between two strings a displaystyle a nbsp and b displaystyle b nbsp a function d a b i j displaystyle d a b i j nbsp is defined whose value is a distance between an i displaystyle i nbsp symbol prefix initial substring of string a displaystyle a nbsp and a j displaystyle j nbsp symbol prefix of b displaystyle b nbsp The restricted distance function is defined recursively as 7 A 11 d a b i j min 0 if i j 0 d a b i 1 j 1 if i gt 0 d a b i j 1 1 if j gt 0 d a b i 1 j 1 1 a i b j if i j gt 0 d a b i 2 j 2 1 a i b j if i j gt 1 and a i b j 1 and a i 1 b j displaystyle d a b i j min begin cases 0 amp text if i j 0 d a b i 1 j 1 amp text if i gt 0 d a b i j 1 1 amp text if j gt 0 d a b i 1 j 1 1 a i neq b j amp text if i j gt 0 d a b i 2 j 2 1 a i neq b j amp text if i j gt 1 text and a i b j 1 text and a i 1 b j end cases nbsp where 1 a i b j displaystyle 1 a i neq b j nbsp is the indicator function equal to 0 when a i b j displaystyle a i b j nbsp and equal to 1 otherwise Each recursive call matches one of the cases covered by the Damerau Levenshtein distance d a b i 1 j 1 displaystyle d a b i 1 j 1 nbsp corresponds to a deletion from a to b d a b i j 1 1 displaystyle d a b i j 1 1 nbsp corresponds to an insertion from a to b d a b i 1 j 1 1 a i b j displaystyle d a b i 1 j 1 1 a i neq b j nbsp corresponds to a match or mismatch depending on whether the respective symbols are the same d a b i 2 j 2 1 a i b j displaystyle d a b i 2 j 2 1 a i neq b j nbsp corresponds to a transposition between two successive symbols The Damerau Levenshtein distance between a and b is then given by the function value for full strings d a b a b displaystyle d a b big a b big nbsp where i a displaystyle i a nbsp denotes the length of string a and j b displaystyle j b nbsp is the length of b Algorithm editPresented here are two algorithms the first 8 simpler one computes what is known as the optimal string alignment distance or restricted edit distance 7 while the second one 9 computes the Damerau Levenshtein distance with adjacent transpositions Adding transpositions adds significant complexity The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once whereas the second one presents no such restriction Take for example the edit distance between CA and ABC The Damerau Levenshtein distance LD CA ABC 2 because CA AC ABC but the optimal string alignment distance OSA CA ABC 3 because if the operation CA AC is used it is not possible to use AC ABC because that would require the substring to be edited more than once which is not allowed in OSA and therefore the shortest sequence of operations is CA A AB ABC Note that for the optimal string alignment distance the triangle inequality does not hold OSA CA AC OSA AC ABC lt OSA CA ABC and so it is not a true metric Optimal string alignment distance edit Optimal string alignment distance can be computed using a straightforward extension of the Wagner Fischer dynamic programming algorithm that computes Levenshtein distance In pseudocode algorithm OSA distance is input strings a 1 length a b 1 length b output distance integer let d 0 length a 0 length b be a 2 d array of integers dimensions length a 1 length b 1 note that d is zero indexed while a and b are one indexed for i 0 to length a inclusive do d i 0 i for j 0 to length b inclusive do d 0 j j for i 1 to length a inclusive do for j 1 to length b inclusive do if a i b j then cost 0 else cost 1 d i j minimum d i 1 j 1 deletion d i j 1 1 insertion d i 1 j 1 cost substitution if i gt 1 and j gt 1 and a i b j 1 and a i 1 b j then d i j minimum d i j d i 2 j 2 1 transposition return d length a length b The difference from the algorithm for Levenshtein distance is the addition of one recurrence if i gt 1 and j gt 1 and a i b j 1 and a i 1 b j then d i j minimum d i j d i 2 j 2 1 transposition Distance with adjacent transpositions edit The following algorithm computes the true Damerau Levenshtein distance with adjacent transpositions this algorithm requires as an additional parameter the size of the alphabet S so that all entries of the arrays are in 0 S 7 A 93 algorithm DL distance is input strings a 1 length a b 1 length b output distance integer da new array of S integers for i 1 to S inclusive do da i 0 let d 1 length a 1 length b be a 2 d array of integers dimensions length a 2 length b 2 note that d has indices starting at 1 while a b and da are one indexed maxdist length a length b d 1 1 maxdist for i 0 to length a inclusive do d i 1 maxdist d i 0 i for j 0 to length b inclusive do d 1 j maxdist d 0 j j for i 1 to length a inclusive do db 0 for j 1 to length b inclusive do k da b j ℓ db if a i b j then cost 0 db j else cost 1 d i j minimum d i 1 j 1 cost substitution d i j 1 1 insertion d i 1 j 1 deletion d k 1 ℓ 1 i k 1 1 j ℓ 1 transposition da a i i return d length a length b To devise a proper algorithm to calculate unrestricted Damerau Levenshtein distance note that there always exists an optimal sequence of edit operations where once transposed letters are never modified afterwards This holds as long as the cost of a transposition W T displaystyle W T nbsp is at least the average of the cost of an insertion and deletion i e 2 W T W I W D displaystyle 2W T geq W I W D nbsp 9 Thus we need to consider only two symmetric ways of modifying a substring more than once 1 transpose letters and insert an arbitrary number of characters between them or 2 delete a sequence of characters and transpose letters that become adjacent after deletion The straightforward implementation of this idea gives an algorithm of cubic complexity O M N max M N displaystyle O big M cdot N cdot max M N big nbsp where M and N are string lengths Using the ideas of Lowrance and Wagner 9 this naive algorithm can be improved to be O M N displaystyle O M cdot N nbsp in the worst case which is what the above pseudocode does It is interesting that the bitap algorithm can be modified to process transposition See the information retrieval section of 1 for an example of such an adaptation Applications editDamerau Levenshtein distance plays an important role in natural language processing In natural languages strings are short and the number of errors misspellings rarely exceeds 2 In such circumstances restricted and real edit distance differ very rarely Oommen and Loke 8 even mitigated the limitation of the restricted edit distance by introducing generalized transpositions Nevertheless one must remember that the restricted edit distance usually does not satisfy the triangle inequality and thus cannot be used with metric trees DNA edit Since DNA frequently undergoes insertions deletions substitutions and transpositions and each of these operations occurs on approximately the same timescale the Damerau Levenshtein distance is an appropriate metric of the variation between two strands of DNA 6 More common in DNA protein and other bioinformatics related alignment tasks is the use of closely related algorithms such as Needleman Wunsch algorithm or Smith Waterman algorithm citation needed Fraud detection edit The algorithm can be used with any set of words like vendor names Since entry is manual by nature there is a risk of entering a false vendor A fraudster employee may enter one real vendor such as Rich Heir Estate Services versus a false vendor Rich Hier State Services The fraudster would then create a false bank account and have the company route checks to the real vendor and false vendor The Damerau Levenshtein algorithm will detect the transposed and dropped letter and bring attention of the items to a fraud examiner Export control edit The U S Government uses the Damerau Levenshtein distance with its Consolidated Screening List API 10 See also editIspell Suggests corrections that are based on a Damerau Levenshtein distance of 1 Typosquatting Form of cybersquatting which relies on mistakes when inputting a website addressReferences edit Brill Eric Moore Robert C 2000 An Improved Error Model for Noisy Channel Spelling Correction PDF Proceedings of the 38th Annual Meeting on Association for Computational Linguistics pp 286 293 doi 10 3115 1075218 1075255 Archived from the original PDF on 2012 12 21 a b Bard Gregory V 2007 Spelling error tolerant order independent pass phrases via the Damerau Levenshtein string edit distance metric Proceedings of the Fifth Australasian Symposium on ACSW Frontiers 2007 Ballarat Australia January 30 February 2 2007 Conferences in Research and Practice in Information Technology vol 68 Darlinghurst Australia Australian Computer Society Inc pp 117 124 ISBN 978 1 920682 49 1 Li et al 2006 Exploring distributional similarity based models for query spelling correction PDF Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics pp 1025 1032 doi 10 3115 1220175 1220304 Archived from the original PDF on 2010 04 01 Levenshtein Vladimir I February 1966 Binary codes capable of correcting deletions insertions and reversals Soviet Physics Doklady 10 8 707 710 Bibcode 1966SPhD 10 707L Damerau Fred J March 1964 A technique for computer detection and correction of spelling errors Communications of the ACM 7 3 171 176 doi 10 1145 363958 363994 S2CID 7713345 a b Majorek Karolina A Dunin Horkawicz Stanislaw et al 2013 The RNase H like superfamily new members comparative structural analysis and evolutionary classification Nucleic Acids Research 42 7 4160 4179 doi 10 1093 nar gkt1414 PMC 3985635 PMID 24464998 a b c Boytsov Leonid May 2011 Indexing methods for approximate dictionary searching Journal of Experimental Algorithmics 16 1 doi 10 1145 1963190 1963191 S2CID 15635688 a b Oommen B J Loke R K S 1997 Pattern recognition of strings with substitutions insertions deletions and generalized transpositions Pattern Recognition 30 5 789 800 Bibcode 1997PatRe 30 789O CiteSeerX 10 1 1 50 1459 doi 10 1016 S0031 3203 96 00101 X a b c Lowrance Roy Wagner Robert A April 1975 An Extension of the String to String Correction Problem J ACM 22 2 177 183 doi 10 1145 321879 321880 S2CID 18892193 Consolidated Screening List API Trade gov Developer Portal International Trade Administration U S Department of Commerce Archived from the original on 2019 10 30 Further reading editNavarro Gonzalo March 2001 A guided tour to approximate string matching ACM Computing Surveys 33 1 31 88 doi 10 1145 375360 375365 S2CID 207551224 Retrieved from https en wikipedia org w index php title Damerau Levenshtein distance amp oldid 1209413830, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.