fbpx
Wikipedia

Levenshtein automaton

In computer science, a Levenshtein automaton for a string w and a number n is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.[1]

Applications edit

Levenshtein automata may be used for spelling correction, by finding words in a given dictionary that are close to a misspelled word. In this application, once a word is identified as being misspelled, its Levenshtein automaton may be constructed, and then applied to all of the words in the dictionary to determine which ones are close to the misspelled word. If the dictionary is stored in compressed form as a trie, the time for this algorithm (after the automaton has been constructed) is proportional to the number of nodes in the trie, significantly faster than using dynamic programming to compute the Levenshtein distance separately for each dictionary word.[1]

It is also possible to find words in a regular language, rather than a finite dictionary, that are close to a given target word, by computing the Levenshtein automaton for the word, and then using a Cartesian product construction to combine it with an automaton for the regular language, giving an automaton for the intersection language. Alternatively, rather than using the product construction, both the Levenshtein automaton and the automaton for the given regular language may be traversed simultaneously using a backtracking algorithm.[1]

Levenshtein automata are used in Lucene for full-text searches that can return relevant documents even if the query is misspelled.[2]

Construction edit

For any fixed constant n, the Levenshtein automaton for w and n may be constructed in time O(|w|).[1]

Mitankin studies a variant of this construction called the universal Levenshtein automaton, determined only by a numeric parameter n, that can recognize pairs of words (encoded in a certain way by bitvectors) that are within Levenshtein distance n of each other.[3] Touzet proposed an effective algorithm to build this automaton.[4]

Yet a third finite automaton construction of Levenshtein (or Damerau–Levenshtein) distance are the Levenshtein transducers of Hassan et al., who show finite state transducers implementing edit distance one, then compose these to implement edit distances up to some constant.[5]

See also edit

  • agrep, tool (implemented several times) for approximate regular expression matching
  • TRE, library for regular expression matching that is tolerant to Levenshtein-style edits

References edit

  1. ^ a b c d Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast String Correction with Levenshtein-Automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. CiteSeerX 10.1.1.16.652. doi:10.1007/s10032-002-0082-8. S2CID 207046453.
  2. ^ McCandless, Michael (24 March 2011). "Lucene's FuzzyQuery is 100 times faster in 4.0". Changing Bits. Retrieved 2021-06-07.
  3. ^ Mitankin, Petar N. (2005). Universal Levenshtein Automata. Building and Properties (PDF) (Thesis). Sofia University St. Kliment Ohridski.
  4. ^ Touzet H. (2016). "On the Levenshtein Automaton and the Size of the Neighbourhood of a Word" (PDF). Language and Automata Theory and Applications. Lecture Notes in Computer Science. Vol. 9618. Lecture Notes in Computer Science. pp. 207–218. doi:10.1007/978-3-319-30000-9_16. ISBN 978-3-319-29999-0. S2CID 34821290.
  5. ^ Hassan, Ahmed; Noeman, Sara; Hassan, Hany (2008). Language Independent Text Correction using Finite State Automata. IJCNLP.

levenshtein, automaton, computer, science, string, number, finite, state, automaton, that, recognize, strings, whose, levenshtein, distance, from, most, that, string, formal, language, recognized, only, transformed, into, most, single, character, insertions, d. In computer science a Levenshtein automaton for a string w and a number n is a finite state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n That is a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single character insertions deletions and substitutions 1 Contents 1 Applications 2 Construction 3 See also 4 ReferencesApplications editLevenshtein automata may be used for spelling correction by finding words in a given dictionary that are close to a misspelled word In this application once a word is identified as being misspelled its Levenshtein automaton may be constructed and then applied to all of the words in the dictionary to determine which ones are close to the misspelled word If the dictionary is stored in compressed form as a trie the time for this algorithm after the automaton has been constructed is proportional to the number of nodes in the trie significantly faster than using dynamic programming to compute the Levenshtein distance separately for each dictionary word 1 It is also possible to find words in a regular language rather than a finite dictionary that are close to a given target word by computing the Levenshtein automaton for the word and then using a Cartesian product construction to combine it with an automaton for the regular language giving an automaton for the intersection language Alternatively rather than using the product construction both the Levenshtein automaton and the automaton for the given regular language may be traversed simultaneously using a backtracking algorithm 1 Levenshtein automata are used in Lucene for full text searches that can return relevant documents even if the query is misspelled 2 Construction editFor any fixed constant n the Levenshtein automaton for w and n may be constructed in time O w 1 Mitankin studies a variant of this construction called the universal Levenshtein automaton determined only by a numeric parameter n that can recognize pairs of words encoded in a certain way by bitvectors that are within Levenshtein distance n of each other 3 Touzet proposed an effective algorithm to build this automaton 4 Yet a third finite automaton construction of Levenshtein or Damerau Levenshtein distance are the Levenshtein transducers of Hassan et al who show finite state transducers implementing edit distance one then compose these to implement edit distances up to some constant 5 See also editagrep tool implemented several times for approximate regular expression matching TRE library for regular expression matching that is tolerant to Levenshtein style editsReferences edit a b c d Schulz Klaus U Mihov Stoyan 2002 Fast String Correction with Levenshtein Automata International Journal of Document Analysis and Recognition 5 1 67 85 CiteSeerX 10 1 1 16 652 doi 10 1007 s10032 002 0082 8 S2CID 207046453 McCandless Michael 24 March 2011 Lucene s FuzzyQuery is 100 times faster in 4 0 Changing Bits Retrieved 2021 06 07 Mitankin Petar N 2005 Universal Levenshtein Automata Building and Properties PDF Thesis Sofia University St Kliment Ohridski Touzet H 2016 On the Levenshtein Automaton and the Size of the Neighbourhood of a Word PDF Language and Automata Theory and Applications Lecture Notes in Computer Science Vol 9618 Lecture Notes in Computer Science pp 207 218 doi 10 1007 978 3 319 30000 9 16 ISBN 978 3 319 29999 0 S2CID 34821290 Hassan Ahmed Noeman Sara Hassan Hany 2008 Language Independent Text Correction using Finite State Automata IJCNLP Retrieved from https en wikipedia org w index php title Levenshtein automaton amp oldid 1177417441, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.