fbpx
Wikipedia

Matching wildcards

In computer science, an algorithm for matching wildcards (also known as globbing) is useful in comparing text strings that may contain wildcard syntax.[1] Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell[2] or Microsoft Windows command-line[3] or text editor or file manager, as well as the interfaces for some search engines[4] and databases.[5] Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.[6]

The problem edit

A wildcard matcher tests a wildcard pattern p against an input string s. It performs an anchored match, returns true only when p matches the entirety of s.

The pattern can be based on any common syntax (see globbing), but on Windows programmers tend to only discuss a simplified syntax supported by the native C runtime:[7][8]

  • No escape characters are defined
  • Wildcards: ? matches exactly one occurrence of any character. * matches arbitrary many (including zero) occurrences of any character.

This article mainly discusses the Windows formulation of the problem, unless otherwise stated.

Definition edit

Stated in zero-based indices, the wildcard-matching problem can be defined recursively as:

 

where mij is the result of matching the pattern p against the text t truncated at i and j characters respectively. This is the formulation used by Richter's algorithm and the Snippets algorithm found in Cantatore's collection.[9][10] This description is similar to the Levenshtein distance.

Related problems edit

Directly related problems in computer science include:

  • Pattern matching with don't cares or gaps, an unanchored string search with only the equivalent of ? defined.[11][12]
  • Pattern matching with wildcards, an unanchored string search with the equivalent of both wildcards defined. Has an exponential runtime unless a length-bound is given in the pattern matching with flexible wildcards variant.[13]

History edit

Early algorithms for matching wildcards often relied on recursion, but the technique was criticized on grounds of performance[10] and reliability[8] considerations. Non-recursive algorithms for matching wildcards have gained favor in light of these considerations.

Among both recursive and non-recursive algorithms, strategies for performing the pattern matching operation vary widely, as evidenced among the variety of example algorithms referenced below. Test case development and performance optimization techniques have been demonstrably brought to bear on certain algorithms, particularly those developed by critics of the recursive algorithms.

Recursive algorithms edit

The recursion generally happens on matching * when there is more suffix to match against. This is a form of backtracking, also done by some regular expression matchers.

The general form of these algorithms are the same. On recursion the algorithm slices the input into substrings, and considers a match to have happened when ONE of the substrings return a positive match. For dowild("*X", "abcX"), it would greedily call dowild("X", "abcX"), dowild("X", "bcX"), dowild("X", "cX") and dowild("X", "X"). They usually differ by less-important things like support for features and by more important factors such as minor but highly effective optimizations. Some of them include:

  • The ABORT signal against over-recursion (Lars Mathiesen 1991). While it is correct to naively recurse by the entire rest of the strings (pattern and text) on * and making sure that ONE of the substrings return a positive match, the running time becomes exponential for rejecting a match with many * in the text. Lars Mathiesen changes the return to three classes, match, no-match, and ABORT (no match possible at all for asterisk recursion.) The ABORT value is returned when the text is consumed too early or when another asterisk match has failed, guaranteeing a linear performance with respect to the number of asterisks. (The overall complexity is additionally quadratic to the number of characters left to match.)[14] Git/Rsync's wildmatch ABORT also covers invalid inputs.[21] The new INN uwildmat does the same.[22]
  • Asterisk advancement in recursion. This wildmatch tweak is relatively more minor. It applies to when the recursion wants to match "*X" on "abcX": when an asterisk is followed by a literal like "X", it is obvious that only the last comparison with equal lengths would have a chance of producing a match.[21] This is seen earlier in uwildmat in 2000[22] and more implicitly in van Rossum's fnmatch for FNM_PATHNAME.

Martin Richter's algorithm is an exception to this pattern, although the overall operation is equivalent. On * it recurses into increasing either of the indexes, following the dynamic programming formulation of the problem. The "ABORT" technique is applicable to it as well.[9] On typical patterns (as tested by Cantatore) it is slower than the greedy-call implementations.[10]

The recursive algorithms are in general easier to reason about, and with the ABORT modification they perform acceptably in terms of worst-case complexity. On strings without * they take linear-to-string-size time to match since there is a fixed one-to-one relation.

Non-recursive algorithms edit

The following are developed by critics of the recursive algorithms:

The following is not:

  • Jack Handy's incorrect algorithm[25] (fails MATCH("*?", "xx"))

The iterative functions above implement backtracking by saving an old set of pattern/text pointers, and reverting to it should a match fails. According to Kurt, since only one successful match is required, only one such set needs to be saved.[17]

In addition, the problem of wildcard matching can be converted into regular expression matching using a naive text-replacement approach. Although non-recursive regular expression matchers such as Thompson's construction are less used in practice due to lack of backreference support, wildcard matching in general does not come with a similarly rich set of features. (In fact, many of the algorithms above only has support for ? and *.) The Russ Cox implementation of Thompson NFA can be trivially modified for such.[26] Gustavo Navarro's BDM-based nrgrep algorithm provides a more streamlined implementation with emphasis on efficient suffixes.[27] See also regular expression § Implementations.

See also edit

References edit

  1. ^ "Wildcard characters". ScienceDirect. 2018.
  2. ^ Quigley, Ellie (2005). UNIX Shell Programming QuickStart. InformIT.com.
  3. ^ "MS-DOS and Windows Wildcard Characters". Microsoft Developer Network Library.
  4. ^ "Apache Lucene - Query Parser Syntax". Apache Lucene 2.9.4 Documentation. 2006.
  5. ^ "SQL Wildcards". W3Schools. 2018.
  6. ^ Goyvaerts, Jan (2018). "Welcome to Regular-Expressions.info". RegularExpressions.info.
  7. ^ "Wildcard Expansion". docs.microsoft.com.
  8. ^ a b c Krauss, Kirk (2008). "Matching Wildcards: An Algorithm". Dr. Dobb's Journal.
  9. ^ a b c Deadlock (2015). "Wildcard Matching Recursive Algorithm C++". Stack Overflow.
  10. ^ a b c d Cantatore, Alessandro (2003). "Wildcard matching algorithms".
  11. ^ Iliopoulos, Costas S.; Rahman, M. Sohel (2007). (PDF). SOFSEM 2007: Theory and Practice of Computer Science, 33rd Conference on Current Trends in Theory and Practice of Computer Science. Harrachov, Czech Republic. S2CID 14538871. Archived from the original (PDF) on 2019-12-17.
  12. ^ Clifford, Peter; Clifford, Raphaël (January 2007). "Simple deterministic wildcard matching". Information Processing Letters. 101 (2): 53–54. doi:10.1016/j.ipl.2006.08.002.
  13. ^ Wu, Xindong; Qiang, Ji-Peng; Xie, Fei (12 September 2014). "Pattern Matching with Flexible Wildcards". Journal of Computer Science and Technology. 29 (5): 740–750. doi:10.1007/s11390-014-1464-3. S2CID 16824910.
  14. ^ a b Salz, Rich (1991). "wildmat.c". GitHub.
  15. ^ Filip (2014). "Compare strings with wildcard". Stack Overflow.
  16. ^ Murugesan, Vignesh (2014). "WildCard Matching algorithm".
  17. ^ a b c Kurt, Dogan. "Wildcard Matching Methods".
  18. ^ van Rossum, Guido (20 November 2019). "freebsd/lib/libc/gen/fnmatch.c". GitHub. Retrieved 21 November 2019.
  19. ^ "fnmatch.c". opensource.apple.com. 1999.
  20. ^ "fnmatch_internal.c". Beren Minor's Mirrors. 21 November 2019.
  21. ^ a b "git/git: wildmatch.c". GitHub. 2020-01-20.
  22. ^ a b "uwildmat.c in trunk/lib – INN". inn.eyrie.org. Retrieved 27 November 2019.
  23. ^ Krauss, Kirk (2018). "Matching Wildcards: An Improved Algorithm for Big Data". Develop for Performance.
  24. ^ Siler (2013). "Recursive solutions for glob pattern matching". Stack Overflow.
  25. ^ Handy, Jack (2005). "Wildcard string compare (globbing)". Code Project.
  26. ^ Cox, Ross. "Regular Expression Matching Can Be Simple And Fast".
  27. ^ Navarro, Gonzalo (10 November 2001). "NR-grep: a fast and flexible pattern-matching tool" (PDF). Software: Practice and Experience. 31 (13): 1265–1312. doi:10.1002/spe.411. S2CID 3175806.

matching, wildcards, computer, science, algorithm, matching, wildcards, also, known, globbing, useful, comparing, text, strings, that, contain, wildcard, syntax, common, uses, these, algorithms, include, command, line, interfaces, bourne, shell, microsoft, win. In computer science an algorithm for matching wildcards also known as globbing is useful in comparing text strings that may contain wildcard syntax 1 Common uses of these algorithms include command line interfaces e g the Bourne shell 2 or Microsoft Windows command line 3 or text editor or file manager as well as the interfaces for some search engines 4 and databases 5 Wildcard matching is a subset of the problem of matching regular expressions and string matching in general 6 Contents 1 The problem 1 1 Definition 1 2 Related problems 2 History 3 Recursive algorithms 4 Non recursive algorithms 5 See also 6 ReferencesThe problem editA wildcard matcher tests a wildcard pattern p against an input string s It performs an anchored match returns true only when p matches the entirety of s The pattern can be based on any common syntax see globbing but on Windows programmers tend to only discuss a simplified syntax supported by the native C runtime 7 8 No escape characters are defined Wildcards matches exactly one occurrence of any character matches arbitrary many including zero occurrences of any character This article mainly discusses the Windows formulation of the problem unless otherwise stated Definition edit Stated in zero based indices the wildcard matching problem can be defined recursively as m00 p0 t0 m0j pj 1 m0 j 1mi0 falsemij mi 1 j 1forpi 1 tj 1 pi 1 mi j 1 mi 1 jforpi 1 falseforpi 1 tj 1for1 i p 1 j t displaystyle begin aligned m 00 amp p 0 t 0 m 0j amp p j 1 text land m 0 j 1 m i0 amp text false m ij amp begin cases m i 1 j 1 amp text for p i 1 t j 1 lor p i 1 text m i j 1 lor m i 1 j amp text for p i 1 text text false amp text for p i 1 neq t j 1 end cases amp amp quad text for 1 leq i leq p 1 leq j leq t end aligned nbsp where mij is the result of matching the pattern p against the text t truncated at i and j characters respectively This is the formulation used by Richter s algorithm and the Snippets algorithm found in Cantatore s collection 9 10 This description is similar to the Levenshtein distance Related problems edit Directly related problems in computer science include Pattern matching with don t cares or gaps an unanchored string search with only the equivalent of defined 11 12 Pattern matching with wildcards an unanchored string search with the equivalent of both wildcards defined Has an exponential runtime unless a length bound is given in the pattern matching with flexible wildcards variant 13 History editEarly algorithms for matching wildcards often relied on recursion but the technique was criticized on grounds of performance 10 and reliability 8 considerations Non recursive algorithms for matching wildcards have gained favor in light of these considerations Among both recursive and non recursive algorithms strategies for performing the pattern matching operation vary widely as evidenced among the variety of example algorithms referenced below Test case development and performance optimization techniques have been demonstrably brought to bear on certain algorithms particularly those developed by critics of the recursive algorithms Recursive algorithms editThe recursion generally happens on matching when there is more suffix to match against This is a form of backtracking also done by some regular expression matchers Rich Salz wildmat algorithm sh like syntax 14 Filip s algorithm 15 and Vignesh Murugesan s algorithm 16 Martin Richter s algorithm 9 identical to Snippets and related to the 7 zip algorithm 17 C library fnmatch implementations supports and multibyte character sets Guido van Rossum s BSD libc fnmatch 18 also part of Apple libc 19 Glibc fnmatch 20 The general form of these algorithms are the same On recursion the algorithm slices the input into substrings and considers a match to have happened when ONE of the substrings return a positive match For dowild X abcX it would greedily call dowild X abcX dowild X bcX dowild X cX and dowild X X They usually differ by less important things like support for features and by more important factors such as minor but highly effective optimizations Some of them include The ABORT signal against over recursion Lars Mathiesen 1991 While it is correct to naively recurse by the entire rest of the strings pattern and text on and making sure that ONE of the substrings return a positive match the running time becomes exponential for rejecting a match with many in the text Lars Mathiesen changes the return to three classes match no match and ABORT no match possible at all for asterisk recursion The ABORT value is returned when the text is consumed too early or when another asterisk match has failed guaranteeing a linear performance with respect to the number of asterisks The overall complexity is additionally quadratic to the number of characters left to match 14 Git Rsync s wildmatch ABORT also covers invalid inputs 21 The new INN uwildmat does the same 22 Asterisk advancement in recursion This wildmatch tweak is relatively more minor It applies to when the recursion wants to match X on abcX when an asterisk is followed by a literal like X it is obvious that only the last comparison with equal lengths would have a chance of producing a match 21 This is seen earlier in uwildmat in 2000 22 and more implicitly in van Rossum s fnmatch for FNM PATHNAME Martin Richter s algorithm is an exception to this pattern although the overall operation is equivalent On it recurses into increasing either of the indexes following the dynamic programming formulation of the problem The ABORT technique is applicable to it as well 9 On typical patterns as tested by Cantatore it is slower than the greedy call implementations 10 The recursive algorithms are in general easier to reason about and with the ABORT modification they perform acceptably in terms of worst case complexity On strings without they take linear to string size time to match since there is a fixed one to one relation Non recursive algorithms editThe following are developed by critics of the recursive algorithms Kirk J Krauss s wildcard matching algorithm used by IBM 8 23 Alessandro Cantatore s collection of wildcard matching algorithms 10 Dogan Kurt s iterative matcher and slower NFA matcher 17 Siler s incorrect algorithm fails MATCH da da da daaadabadmanda 24 The following is not Jack Handy s incorrect algorithm 25 fails MATCH xx The iterative functions above implement backtracking by saving an old set of pattern text pointers and reverting to it should a match fails According to Kurt since only one successful match is required only one such set needs to be saved 17 In addition the problem of wildcard matching can be converted into regular expression matching using a naive text replacement approach Although non recursive regular expression matchers such as Thompson s construction are less used in practice due to lack of backreference support wildcard matching in general does not come with a similarly rich set of features In fact many of the algorithms above only has support for and The Russ Cox implementation of Thompson NFA can be trivially modified for such 26 Gustavo Navarro s BDM based nrgrep algorithm provides a more streamlined implementation with emphasis on efficient suffixes 27 See also regular expression Implementations See also editPattern matching Pattern calculus Glob programming Wildcard character List of algorithmsReferences edit Wildcard characters ScienceDirect 2018 Quigley Ellie 2005 UNIX Shell Programming QuickStart InformIT com MS DOS and Windows Wildcard Characters Microsoft Developer Network Library Apache Lucene Query Parser Syntax Apache Lucene 2 9 4 Documentation 2006 SQL Wildcards W3Schools 2018 Goyvaerts Jan 2018 Welcome to Regular Expressions info RegularExpressions info Wildcard Expansion docs microsoft com a b c Krauss Kirk 2008 Matching Wildcards An Algorithm Dr Dobb s Journal a b c Deadlock 2015 Wildcard Matching Recursive Algorithm C Stack Overflow a b c d Cantatore Alessandro 2003 Wildcard matching algorithms Iliopoulos Costas S Rahman M Sohel 2007 Pattern Matching Algorithms with Don t Cares PDF SOFSEM 2007 Theory and Practice of Computer Science 33rd Conference on Current Trends in Theory and Practice of Computer Science Harrachov Czech Republic S2CID 14538871 Archived from the original PDF on 2019 12 17 Clifford Peter Clifford Raphael January 2007 Simple deterministic wildcard matching Information Processing Letters 101 2 53 54 doi 10 1016 j ipl 2006 08 002 Wu Xindong Qiang Ji Peng Xie Fei 12 September 2014 Pattern Matching with Flexible Wildcards Journal of Computer Science and Technology 29 5 740 750 doi 10 1007 s11390 014 1464 3 S2CID 16824910 a b Salz Rich 1991 wildmat c GitHub Filip 2014 Compare strings with wildcard Stack Overflow Murugesan Vignesh 2014 WildCard Matching algorithm a b c Kurt Dogan Wildcard Matching Methods van Rossum Guido 20 November 2019 freebsd lib libc gen fnmatch c GitHub Retrieved 21 November 2019 fnmatch c opensource apple com 1999 fnmatch internal c Beren Minor s Mirrors 21 November 2019 a b git git wildmatch c GitHub 2020 01 20 a b uwildmat c in trunk lib INN inn eyrie org Retrieved 27 November 2019 Krauss Kirk 2018 Matching Wildcards An Improved Algorithm for Big Data Develop for Performance Siler 2013 Recursive solutions for glob pattern matching Stack Overflow Handy Jack 2005 Wildcard string compare globbing Code Project Cox Ross Regular Expression Matching Can Be Simple And Fast Navarro Gonzalo 10 November 2001 NR grep a fast and flexible pattern matching tool PDF Software Practice and Experience 31 13 1265 1312 doi 10 1002 spe 411 S2CID 3175806 Retrieved from https en wikipedia org w index php title Matching wildcards amp oldid 1216424174, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.