fbpx
Wikipedia

re2c

re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper,[1] re2c was put in public domain and has been since maintained by volunteers.[3] It is the lexer generator adopted by projects such as PHP,[4] SpamAssassin,[5] Ninja build system[6] and others. Together with the Lemon parser generator, re2c is used in BRL-CAD.[7] This combination is also used with STEPcode, an implementation of ISO 10303 standard.[8]

re2c
Original author(s)Peter Bumbulis
Initial releasearound 1994; 30 years ago (1994)[1]
Stable release
3.1[2]  / 19 July 2023; 7 months ago (19 July 2023)
Repositorygithub.com/skvadrik/re2c
TypeLexical analyzer generator
LicensePublic domain
Websitere2c.org

Philosophy edit

The main goal of re2c is generating fast lexers:[1] at least as fast as reasonably optimized C lexers coded by hand. Instead of using traditional table-driven approach, re2c encodes the generated finite state machine directly in the form of conditional jumps and comparisons. The resulting program is faster than its table-driven counterpart[1] and much easier to debug and understand. Moreover, this approach often results in smaller lexers,[1] as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton.[9] Another distinctive feature of re2c is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment. The main idea is that re2c should be a zero-cost abstraction for the programmer: using it should never result in a slower program than the corresponding hand-coded implementation.

Features edit

  • Submatch extraction:[10] re2c supports both POSIX-compliant capturing groups and standalone tags [11] (with leftmost greedy disambiguation and optional handling of repeated submatch). The implementation is based on the lookahead-TDFA algorithm.[12][13][14]
  • Encoding support:[15] re2c supports ASCII, UTF-8, UTF-16, UTF-32, UCS-2 and EBCDIC.
  • Flexible user interface:[16] the generated code uses a few primitive operations in order to interface with the environment (read input characters, advance to the next input position, etc.); users can redefine these primitives to whatever they need.
  • Storable state:[17] re2c supports both pull-model lexers (when lexer runs without interrupts and pulls more input as necessary) and push-model lexers (when lexer is periodically stopped and resumed to parse new chunks of input).
  • Start conditions:[18] re2c can generate multiple interrelated lexers, where each lexer is triggered by a certain condition in program.
  • Self-validation:[19] re2c has a special mode in which it ignores all used-defined interface code and generates a self-contained skeleton program. Additionally, re2c generates two files: one with the input strings derived from the regular grammar, and one with compressed match results that are used to verify lexer behavior on all inputs. Input strings are generated so that they extensively cover DFA transitions and paths. Data generation happens right after DFA construction and prior to any optimizations, but the lexer itself is fully optimized, so skeleton programs are capable of revealing any errors in optimizations and code generation.
  • Warnings:[20] re2c performs static analysis of the program and warns its users about possible deficiencies or bugs, such as undefined control flow, unreachable code, ill-formed escape symbols and potential misuse of the interface primitives.
  • Debugging. Besides generating human-readable lexers, re2c has a number of options that dump various intermediate representations of the generated lexer, such as NFA, multiple stages of DFA and the resulting program graph in DOT format.[21]

Syntax edit

re2c program can contain any number of /*!re2c ... */ blocks. Each block consists of a sequence of rules, definitions and configurations (they can be intermixed, but it is generally better to put configurations first, then definitions and then rules). Rules have the form REGEXP { CODE } or REGEXP := CODE; where REGEXP is a regular expression and CODE is a block of C code. When REGEXP matches the input string, control flow is transferred to the associated CODE. There is one special rule: the default rule with * instead of REGEXP; it is triggered if no other rules matches. re2c has greedy matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority. Definitions have the form NAME = REGEXP; (and also NAME { REGEXP } in Flex compatibility mode). Configurations have the form re2c:CONFIG = VALUE; where CONFIG is the name of the particular configuration and VALUE is a number or a string. For more advanced usage see the official re2c manual.[22]

Regular expressions edit

re2c uses the following syntax for regular expressions:

  • "foo" case-sensitive string literal
  • 'foo' case-insensitive string literal
  • [a-xyz], [^a-xyz] character class (possibly negated)
  • . any character except newline
  • R \ S difference of character classes
  • R* zero or more occurrences of R
  • R+ one or more occurrences of R
  • R? optional R
  • R{n} repetition of R exactly n times
  • R{n,} repetition of R at least n times
  • R{n,m} repetition of R from n to m times
  • (R) just R; parentheses are used to override precedence or for POSIX-style submatch
  • R S concatenation: R followed by S
  • R | S alternative: R or S
  • R / S lookahead: R followed by S, but S is not consumed
  • name the regular expression defined as name (except in Flex compatibility mode)
  • @stag an s-tag: saves the last input position at which @stag matches in a variable named stag
  • #mtag an m-tag: saves all input positions at which #mtag matches in a variable named mtag

Character classes and string literals may contain the following escape sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexadecimal escapes \xhh, \uhhhh and \Uhhhhhhhh.

Example edit

Here is a very simple program in re2c (example.re). It checks that all input arguments are hexadecimal numbers. The code for re2c is enclosed in comments /*!re2c ... */, all the rest is plain C code. See the official re2c website for more complex examples.[23]

#include <stdio.h> static int lex(const char *YYCURSOR) {  const char *YYMARKER;  /*!re2c  re2c:define:YYCTYPE = char;  re2c:yyfill:enable = 0;  end = "\x00";  hex = "0x" [0-9a-fA-F]+;  * { printf("err\n"); return 1; }  hex end { printf("hex\n"); return 0; }  */ } int main(int argc, char **argv) {  for (int i = 1; i < argc; ++i) {  lex(argv[i]);  }  return 0; } 

Given that, re2c -is -o example.c example.re generates the code below (example.c). The contents of the comment /*!re2c ... */ are substituted with a deterministic finite automaton encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file. There are several code generation options; normally re2c uses switch statements, but it can use nested if statements (as in this example with -s option), or generate bitmaps and jump tables. Which option is better depends on the C compiler; re2c users are encouraged to experiment.

/* Generated by re2c 1.2.1 on Fri Aug 23 21:59:00 2019 */ #include <stdio.h> static int lex(const char *YYCURSOR) {  const char *YYMARKER;   {  char yych;  yych = *YYCURSOR;  if (yych == '0') goto yy4;  ++YYCURSOR; yy3:  { printf("err\n"); return 1; } yy4:  yych = *(YYMARKER = ++YYCURSOR);  if (yych != 'x') goto yy3;  yych = *++YYCURSOR;  if (yych >= 0x01) goto yy8; yy6:  YYCURSOR = YYMARKER;  goto yy3; yy7:  yych = *++YYCURSOR; yy8:  if (yych <= '@') {  if (yych <= 0x00) goto yy9;  if (yych <= '/') goto yy6;  if (yych <= '9') goto yy7;  goto yy6;  } else {  if (yych <= 'F') goto yy7;  if (yych <= '`') goto yy6;  if (yych <= 'f') goto yy7;  goto yy6;  } yy9:  ++YYCURSOR;  { printf("hex\n"); return 0; } } } int main(int argc, char **argv) {  for (int i = 1; i < argc; ++i) {  lex(argv[i]);  }  return 0; } 

See also edit

References edit

  1. ^ a b c d e Bumbulis, Peter; Donald D., Cowan (March–December 1993). "RE2C: a more versatile scanner generator". ACM Letters on Programming Languages and Systems. 2 (1–4): 70–84. doi:10.1145/176454.176487. S2CID 14814637.
  2. ^ "Release 3.1". 19 July 2023. Retrieved 3 August 2023.
  3. ^ "Authors, re2c documentation".
  4. ^ "Building PHP". PHP Internals Book. Retrieved 2020-07-20.
  5. ^ "SpamAssassin (sa-compile)".
  6. ^ "Ninja: build.ninja". Ninja. Retrieved 2020-07-20.
  7. ^ "BRL-CAD (tools: re2c)".
  8. ^ "Build Process".
  9. ^ Joseph, Grosch (1989). "Efficient Generation of Table-Driven Scanners". Software: Practice and Experience 19: 1089–1103.
  10. ^ "Submatch extraction, re2c documentation".
  11. ^ Ville, Laurikari (2000). "NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions" (PDF). Seventh International Symposium on String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings.
  12. ^ Ulya, Trofimovich (2017). "Tagged Deterministic Finite Automata with Lookahead". arXiv:1907.08837 [cs.FL].
  13. ^ Ulya, Trofimovich (2020). "RE2C -- a lexer generator based on lookahead TDFA". Software Impacts. 6: 100027. doi:10.1016/j.simpa.2020.100027.
  14. ^ Ulya, Trofimovich (2021). "Lookahead TDFA in pictures (slides)" (PDF).
  15. ^ "Encodings, re2c documentation".
  16. ^ "Program interface, re2c documentation".
  17. ^ "Storable state, re2c documentation".
  18. ^ "Start conditions, re2c documentation".
  19. ^ "Skeleton, re2c documentation".
  20. ^ "Warnings, re2c documentation".
  21. ^ "Visualization, re2c documentation".
  22. ^ "User manual (C), re2c documentation".
  23. ^ "Official webcite".

External links edit

  • Official website

re2c, free, open, source, lexer, generator, rust, compiles, declarative, regular, expression, specifications, deterministic, finite, automata, originally, written, peter, bumbulis, described, paper, public, domain, been, since, maintained, volunteers, lexer, g. re2c is a free and open source lexer generator for C C Go and Rust It compiles declarative regular expression specifications to deterministic finite automata Originally written by Peter Bumbulis and described in his paper 1 re2c was put in public domain and has been since maintained by volunteers 3 It is the lexer generator adopted by projects such as PHP 4 SpamAssassin 5 Ninja build system 6 and others Together with the Lemon parser generator re2c is used in BRL CAD 7 This combination is also used with STEPcode an implementation of ISO 10303 standard 8 re2cOriginal author s Peter BumbulisInitial releasearound 1994 30 years ago 1994 1 Stable release3 1 2 19 July 2023 7 months ago 19 July 2023 Repositorygithub wbr com wbr skvadrik wbr re2cTypeLexical analyzer generatorLicensePublic domainWebsitere2c wbr org Contents 1 Philosophy 2 Features 3 Syntax 4 Regular expressions 5 Example 6 See also 7 References 8 External linksPhilosophy editThe main goal of re2c is generating fast lexers 1 at least as fast as reasonably optimized C lexers coded by hand Instead of using traditional table driven approach re2c encodes the generated finite state machine directly in the form of conditional jumps and comparisons The resulting program is faster than its table driven counterpart 1 and much easier to debug and understand Moreover this approach often results in smaller lexers 1 as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton 9 Another distinctive feature of re2c is its flexible interface instead of assuming a fixed program template re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment The main idea is that re2c should be a zero cost abstraction for the programmer using it should never result in a slower program than the corresponding hand coded implementation Features editSubmatch extraction 10 re2c supports both POSIX compliant capturing groups and standalone tags 11 with leftmost greedy disambiguation and optional handling of repeated submatch The implementation is based on the lookahead TDFA algorithm 12 13 14 Encoding support 15 re2c supports ASCII UTF 8 UTF 16 UTF 32 UCS 2 and EBCDIC Flexible user interface 16 the generated code uses a few primitive operations in order to interface with the environment read input characters advance to the next input position etc users can redefine these primitives to whatever they need Storable state 17 re2c supports both pull model lexers when lexer runs without interrupts and pulls more input as necessary and push model lexers when lexer is periodically stopped and resumed to parse new chunks of input Start conditions 18 re2c can generate multiple interrelated lexers where each lexer is triggered by a certain condition in program Self validation 19 re2c has a special mode in which it ignores all used defined interface code and generates a self contained skeleton program Additionally re2c generates two files one with the input strings derived from the regular grammar and one with compressed match results that are used to verify lexer behavior on all inputs Input strings are generated so that they extensively cover DFA transitions and paths Data generation happens right after DFA construction and prior to any optimizations but the lexer itself is fully optimized so skeleton programs are capable of revealing any errors in optimizations and code generation Warnings 20 re2c performs static analysis of the program and warns its users about possible deficiencies or bugs such as undefined control flow unreachable code ill formed escape symbols and potential misuse of the interface primitives Debugging Besides generating human readable lexers re2c has a number of options that dump various intermediate representations of the generated lexer such as NFA multiple stages of DFA and the resulting program graph in DOT format 21 Syntax editre2c program can contain any number of re2c blocks Each block consists of a sequence of rules definitions and configurations they can be intermixed but it is generally better to put configurations first then definitions and then rules Rules have the form REGEXP CODE or REGEXP CODE where REGEXP is a regular expression and CODE is a block of C code When REGEXP matches the input string control flow is transferred to the associated CODE There is one special rule the default rule with instead of REGEXP it is triggered if no other rules matches re2c has greedy matching semantics if multiple rules match the rule that matches longer prefix is preferred if the conflicting rules match the same prefix the earlier rule has priority Definitions have the form NAME REGEXP and also NAME REGEXP in Flex compatibility mode Configurations have the form re2c CONFIG VALUE where CONFIG is the name of the particular configuration and VALUE is a number or a string For more advanced usage see the official re2c manual 22 Regular expressions editre2c uses the following syntax for regular expressions foo case sensitive string literal foo case insensitive string literal a xyz a xyz character class possibly negated any character except newline R S difference of character classes R zero or more occurrences of R R one or more occurrences of R R optional R R n repetition of R exactly n times R n repetition of R at least n times R n m repetition of R from n to m times R just R parentheses are used to override precedence or for POSIX style submatch R S concatenation R followed by S R S alternative R or S R S lookahead R followed by S but S is not consumed name the regular expression defined as name except in Flex compatibility mode stag an s tag saves the last input position at which stag matches in a variable named stag mtag an m tag saves all input positions at which mtag matches in a variable named mtagCharacter classes and string literals may contain the following escape sequences a b f n r t v octal escapes ooo and hexadecimal escapes xhh uhhhh and Uhhhhhhhh Example editHere is a very simple program in re2c example re It checks that all input arguments are hexadecimal numbers The code for re2c is enclosed in comments re2c all the rest is plain C code See the official re2c website for more complex examples 23 include lt stdio h gt static int lex const char YYCURSOR const char YYMARKER re2c re2c define YYCTYPE char re2c yyfill enable 0 end x00 hex 0x 0 9a fA F printf err n return 1 hex end printf hex n return 0 int main int argc char argv for int i 1 i lt argc i lex argv i return 0 Given that re2c is o example c example re generates the code below example c The contents of the comment re2c are substituted with a deterministic finite automaton encoded in the form of conditional jumps and comparisons the rest of the program is copied verbatim into the output file There are several code generation options normally re2c uses switch statements but it can use nested if statements as in this example with s option or generate bitmaps and jump tables Which option is better depends on the C compiler re2c users are encouraged to experiment Generated by re2c 1 2 1 on Fri Aug 23 21 59 00 2019 include lt stdio h gt static int lex const char YYCURSOR const char YYMARKER char yych yych YYCURSOR if yych 0 goto yy4 YYCURSOR yy3 printf err n return 1 yy4 yych YYMARKER YYCURSOR if yych x goto yy3 yych YYCURSOR if yych gt 0x01 goto yy8 yy6 YYCURSOR YYMARKER goto yy3 yy7 yych YYCURSOR yy8 if yych lt if yych lt 0x00 goto yy9 if yych lt goto yy6 if yych lt 9 goto yy7 goto yy6 else if yych lt F goto yy7 if yych lt goto yy6 if yych lt f goto yy7 goto yy6 yy9 YYCURSOR printf hex n return 0 int main int argc char argv for int i 1 i lt argc i lex argv i return 0 See also editComparison of parser generatorsReferences edit a b c d e Bumbulis Peter Donald D Cowan March December 1993 RE2C a more versatile scanner generator ACM Letters on Programming Languages and Systems 2 1 4 70 84 doi 10 1145 176454 176487 S2CID 14814637 Release 3 1 19 July 2023 Retrieved 3 August 2023 Authors re2c documentation Building PHP PHP Internals Book Retrieved 2020 07 20 SpamAssassin sa compile Ninja build ninja Ninja Retrieved 2020 07 20 BRL CAD tools re2c Build Process Joseph Grosch 1989 Efficient Generation of Table Driven Scanners Software Practice and Experience 19 1089 1103 Submatch extraction re2c documentation Ville Laurikari 2000 NFAs with tagged transitions their conversion to deterministic automata and application to regular expressions PDF Seventh International Symposium on String Processing and Information Retrieval 2000 SPIRE 2000 Proceedings Ulya Trofimovich 2017 Tagged Deterministic Finite Automata with Lookahead arXiv 1907 08837 cs FL Ulya Trofimovich 2020 RE2C a lexer generator based on lookahead TDFA Software Impacts 6 100027 doi 10 1016 j simpa 2020 100027 Ulya Trofimovich 2021 Lookahead TDFA in pictures slides PDF Encodings re2c documentation Program interface re2c documentation Storable state re2c documentation Start conditions re2c documentation Skeleton re2c documentation Warnings re2c documentation Visualization re2c documentation User manual C re2c documentation Official webcite External links editOfficial website Retrieved from https en wikipedia org w index php title Re2c amp oldid 1210875000, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.