fbpx
Wikipedia

Lexical analysis

Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding.

Rule-based programs edit

A rule-based program, performing lexical tokenization, is called tokenizer,[1] or scanner, although scanner is also a term for the first stage of a lexer. A lexer forms the first phase of a compiler frontend in processing. Analysis generally occurs in one pass. Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such as prettyprinters or linters. Lexing can be divided into two stages: the scanning, which segments the input string into syntactic units called lexemes and categorizes these into token classes; and the evaluating, which converts lexemes into processed values.

Lexers are generally quite simple, with most of the complexity deferred to the parser or semantic analysis phases, and can often be generated by a lexer generator, notably lex or derivatives. However, lexers can sometimes include some complexity, such as phrase structure processing to make input easier and simplify the parser, and may be written partly or fully by hand, either to support more features or for performance.

Disambiguation of "lexeme" edit

What is called "lexeme" in rule-based natural language processing is not equal to what is called lexeme in linguistics. What is called "lexeme" in rule-based natural language processing can be equal to the linguistic equivalent only in analytic languages, such as English, but not in highly synthetic languages, such as fusional languages. What is called a lexeme in rule-based natural language processing is more similar to what is called a word in linguistics (not to be confused with a word in computer architecture), although in some cases it may be more similar to a morpheme.

Lexical token and lexical tokenization edit

A lexical token is a string with an assigned and thus identified meaning, in contrast to the probabilistic token used in large language models. Lexical token consists of a token name and an optional token value. The token name is a category of a rule-based lexical unit.[2]

Examples of common tokens
Token name Explanation Sample token values
identifier Names assigned by the programmer. x, color, UP
keyword Reserved words of the language. if, while, return
separator/punctuator Punctuation characters and paired delimiters. }, (, ;
operator Symbols that operate on arguments and produce results. +, <, =
literal Numeric, logical, textual, and reference literals. true, 6.02e23, "music"
comment Line or block comments. Usually discarded. /* Retrieves user data */, // must be negative
whitespace Groups of non-printable characters. Usually discarded.

Consider this expression in the C programming language:

x = a + b * 2;

The lexical analysis of this expression yields the following sequence of tokens:

[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)]

A token name is what might be termed a part of speech in linguistics.

Lexical tokenization is conversion of a raw text into (semantically or syntactically) meaningful lexical tokens, belonging to categories defined by a "lexer" program, such as identifiers, operators, grouping symbols and data types. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

For example, in the text string:

The quick brown fox jumps over the lazy dog

the string isn't implicitly segmented on spaces, as a natural language speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " " or regular expression /\s{1}/).

When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used in semantic analysis. The parser typically retrieves this information from the lexer and stores it in the abstract syntax tree. This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers.

Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include: regular expressions, specific sequences of characters termed a flag, specific separating characters called delimiters, and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each "(" is matched with a ")".

When a lexer feeds tokens to the parser, the representation used is typically an enumerated list of number representations. For example, "Identifier" is represented with 0, "Assignment operator" with 1, "Addition operator" with 2, etc.

Tokens are defined often by regular expressions, which are understood by a lexical analyzer generator such as lex, or handcoded equivalent finite state automata. The lexical analyzer (generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is termed tokenizing. If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling.

Lexical grammar edit

The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. The lexical syntax is usually a regular language, with the grammar rules consisting of regular expressions; they define the set of possible character sequences (lexemes) of a token. A lexer recognizes strings, and for each kind of string found the lexical program takes an action, most simply producing a token.

Two important common lexical categories are white space and comments. These are also defined in the grammar and processed by the lexer, but may be discarded (not producing any tokens) and considered non-significant, at most separating two tokens (as in if x instead of ifx). There are two important exceptions to this. First, in off-side rule languages that delimit blocks with indenting, initial whitespace is significant, as it determines block structure, and is generally handled at the lexer level; see phrase structure, below. Secondly, in some uses of lexers, comments and whitespace must be preserved – for examples, a prettyprinter also needs to output the comments and some debugging tools may provide messages to the programmer showing the original source code. In the 1960s, notably for ALGOL, whitespace and comments were eliminated as part of the line reconstruction phase (the initial phase of the compiler frontend), but this separate phase has been eliminated and these are now handled by the lexer.

Details edit

Scanner edit

The first stage, the scanner, is usually based on a finite-state machine (FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are termed lexemes). For example, an integer lexeme may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is termed the maximal munch, or longest match, rule). In some languages, the lexeme creation rules are more complex and may involve backtracking over previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.

Evaluator edit

A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing: only the type is needed. Similarly, sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. The evaluators for identifiers are usually simple (literally representing the identifier), but may include some unstropping. The evaluators for integer literals may pass the string on (deferring evaluation to the semantic analysis phase), or may perform evaluation themselves, which can be involved for different bases or floating point numbers. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an escaped string literal incorporates a lexer, which unescapes the escape sequences.

For example, in the source code of a computer program, the string

net_worth_future = (assets liabilities);

might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value:

IDENTIFIER net_worth_future EQUALS OPEN_PARENTHESIS IDENTIFIER assets MINUS IDENTIFIER liabilities CLOSE_PARENTHESIS SEMICOLON 

Lexers may be written by hand. This is practical if the list of tokens is small, but lexers generated by automated tooling as part of a compiler-compiler toolchain. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production rule in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state transition table for a finite-state machine (which is plugged into template code for compiling and executing).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are unable to keep count, and verify that n is the same on both sides, unless a finite set of permissible values exists for n. It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end (see example[3] in the Structure and Interpretation of Computer Programs book).

Obstacles edit

Typically, lexical tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:

  • Punctuation and whitespace may or may not be included in the resulting list of tokens.
  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
  • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

In languages that use inter-word spaces (such as most that use the Latin alphabet, and most programming languages), this approach is fairly straightforward. However, even here there are many edge cases such as contractions, hyphenated words, emoticons, and larger constructs such as URIs (which for some purposes may count as single tokens). A classic example is "New York-based", which a naive tokenizer may break at the space even though the better break is (arguably) at the hyphen.

Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek, Chinese,[4] or Thai. Agglutinative languages, such as Korean, also make tokenization tasks complicated.

Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special-cases, or fitting the tokens to a language model that identifies collocations in a later processing step.

Lexer generator edit

Lexers are often generated by a lexer generator, analogous to parser generators, and such tools often come together. The most established is lex, paired with the yacc parser generator, or rather some of their many reimplementations, like flex (often paired with GNU Bison). These generators are a form of domain-specific language, taking in a lexical specification – generally regular expressions with some markup – and emitting a lexer.

These tools yield very fast development, which is very important in early development, both to get a working lexer and because a language specification may change often. Further, they often provide advanced features, such as pre- and post-conditions which are hard to program by hand. However, an automatically generated lexer may lack flexibility, and thus may require some manual modification, or an all-manually written lexer.

Lexer performance is a concern, and optimizing is worthwhile, more so in stable languages where the lexer is run very often (such as C or HTML). lex/flex-generated lexers are reasonably fast, but improvements of two to three times are possible using more tuned generators. Hand-written lexers are sometimes used, but modern lexer generators produce faster lexers than most hand-coded ones. The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach.[dubious ] With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c[5] have proven to produce engines that are between two and three times faster than flex produced engines.[citation needed] It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.

Phrase structure edit

Lexical analysis mainly segments the input stream of characters into tokens, simply grouping the characters into pieces and categorizing them. However, the lexing may be significantly more complex; most simply, lexers may omit tokens or insert added tokens. Omitting tokens, notably whitespace and comments, is very common, when these are not needed by the compiler. Less commonly, added tokens may be inserted. This is done mainly to group tokens into statements, or statements into blocks, to simplify the parser.

Line continuation edit

Line continuation is a feature of some languages where a newline is normally a statement terminator. Most often, ending a line with a backslash (immediately followed by a newline) results in the line being continued – the following line is joined to the prior line. This is generally done in the lexer: the backslash and newline are discarded, rather than the newline being tokenized. Examples include bash,[6] other shell scripts and Python.[7]

Semicolon insertion edit

Many languages use the semicolon as a statement terminator. Most often this is mandatory, but in some languages the semicolon is optional in many contexts. This is mainly done at the lexer level, where the lexer outputs a semicolon into the token stream, despite one not being present in the input character stream, and is termed semicolon insertion or automatic semicolon insertion. In these cases, semicolons are part of the formal phrase grammar of the language, but may not be found in input text, as they can be inserted by the lexer. Optional semicolons or other terminators or separators are also sometimes handled at the parser level, notably in the case of trailing commas or semicolons.

Semicolon insertion is a feature of BCPL and its distant descendant Go,[8] though it is absent in B or C.[9] Semicolon insertion is present in JavaScript, though the rules are somewhat complex and much-criticized; to avoid bugs, some recommend always using semicolons, while others use initial semicolons, termed defensive semicolons, at the start of potentially ambiguous statements.

Semicolon insertion (in languages with semicolon-terminated statements) and line continuation (in languages with newline-terminated statements) can be seen as complementary: semicolon insertion adds a token, even though newlines generally do not generate tokens, while line continuation prevents a token from being generated, even though newlines generally do generate tokens.

Off-side rule edit

The off-side rule (blocks determined by indenting) can be implemented in the lexer, as in Python, where increasing the indenting results in the lexer emitting an INDENT token, and decreasing the indenting results in the lexer emitting one or more DEDENT tokens.[10] These tokens correspond to the opening brace { and closing brace } in languages that use braces for blocks, and means that the phrase grammar does not depend on whether braces or indenting are used. This requires that the lexer hold state, namely a stack of indent levels, and thus can detect changes in indenting when this changes, and thus the lexical grammar is not context-free: INDENT–DEDENT depend on the contextual information of prior indent levels.

Context-sensitive lexing edit

Generally lexical grammars are context-free, or almost so, and thus require no looking back or ahead, or backtracking, which allows a simple, clean, and efficient implementation. This also allows simple one-way communication from lexer to parser, without needing any information flowing back to the lexer.

There are exceptions, however. Simple examples include: semicolon insertion in Go, which requires looking back one token; concatenation of consecutive string literals in Python,[7] which requires holding one token in a buffer before emitting it (to see if the next token is another string literal); and the off-side rule in Python, which requires maintaining a count of indent level (indeed, a stack of each indent level). These examples all only require lexical context, and while they complicate a lexer somewhat, they are invisible to the parser and later phases.

A more complex example is the lexer hack in C, where the token class of a sequence of characters cannot be determined until the semantic analysis phase, since typedef names and variable names are lexically identical but constitute different token classes. Thus in the hack, the lexer calls the semantic analyzer (say, symbol table) and checks if the sequence requires a typedef name. In this case, information must flow back not from the parser only, but from the semantic analyzer back to the lexer, which complicates design.

See also edit

References edit

  1. ^ "Anatomy of a Compiler and The Tokenizer". www.cs.man.ac.uk.
  2. ^ page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in https://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme
  3. ^ . mitpress.mit.edu. Archived from the original on 2012-10-30. Retrieved 2009-03-07.
  4. ^ Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007) Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification
  5. ^ Bumbulis, P.; Cowan, D. D. (Mar–Dec 1993). "RE2C: A more versatile scanner generator". ACM Letters on Programming Languages and Systems. 2 (1–4): 70–84. doi:10.1145/176454.176487. S2CID 14814637.
  6. ^ Bash Reference Manual, 3.1.2.1 Escape Character
  7. ^ a b "3.6.4 Documentation". docs.python.org.
  8. ^ Effective Go, "Semicolons"
  9. ^ "Semicolons in Go", golang-nuts, Rob 'Commander' Pike, 12/10/09
  10. ^ "Lexical analysis > Indentation". The Python Language Reference. Retrieved 21 June 2023.

Sources edit

  • Compiling with C# and Java, Pat Terry, 2005, ISBN 032126360X
  • Algorithms + Data Structures = Programs, Niklaus Wirth, 1975, ISBN 0-13-022418-9
  • Compiler Construction, Niklaus Wirth, 1996, ISBN 0-201-40353-6
  • Sebesta, R. W. (2006). Concepts of programming languages (Seventh edition) pp. 177. Boston: Pearson/Addison-Wesley.

External links edit

  • Yang, W.; Tsay, Chey-Woei; Chan, Jien-Tsai (2002). "On the applicability of the longest-match rule in lexical analysis". Computer Languages, Systems & Structures. 28 (3): 273–288. doi:10.1016/S0096-0551(02)00014-0. NSC 86-2213-E-009-021 and NSC 86-2213-E-009-079.
  • Trim, Craig (Jan 23, 2013). . Developer Works. IBM. Archived from the original on 2019-05-30.
  • Word Mention Segmentation Task, an analysis

lexical, analysis, lexer, redirects, here, people, with, this, name, lexer, surname, lexical, tokenization, conversion, text, into, semantically, syntactically, meaningful, lexical, tokens, belonging, categories, defined, lexer, program, case, natural, languag. Lexer redirects here For people with this name see Lexer surname Lexical tokenization is conversion of a text into semantically or syntactically meaningful lexical tokens belonging to categories defined by a lexer program In case of a natural language those categories include nouns verbs adjectives punctuations etc In case of a programming language the categories include identifiers operators grouping symbols and data types Lexical tokenization is not the same process as the probabilistic tokenization used for large language model s data preprocessing that encode text into numerical tokens using byte pair encoding Contents 1 Rule based programs 2 Disambiguation of lexeme 3 Lexical token and lexical tokenization 4 Lexical grammar 5 Details 5 1 Scanner 5 2 Evaluator 6 Obstacles 7 Lexer generator 8 Phrase structure 8 1 Line continuation 8 2 Semicolon insertion 8 3 Off side rule 9 Context sensitive lexing 10 See also 11 References 11 1 Sources 12 External linksRule based programs editA rule based program performing lexical tokenization is called tokenizer 1 or scanner although scanner is also a term for the first stage of a lexer A lexer forms the first phase of a compiler frontend in processing Analysis generally occurs in one pass Lexers and parsers are most often used for compilers but can be used for other computer language tools such as prettyprinters or linters Lexing can be divided into two stages the scanning which segments the input string into syntactic units called lexemes and categorizes these into token classes and the evaluating which converts lexemes into processed values Lexers are generally quite simple with most of the complexity deferred to the parser or semantic analysis phases and can often be generated by a lexer generator notably lex or derivatives However lexers can sometimes include some complexity such as phrase structure processing to make input easier and simplify the parser and may be written partly or fully by hand either to support more features or for performance Disambiguation of lexeme editSee also Word boundary linguistics and Word boundary computing What is called lexeme in rule based natural language processing is not equal to what is called lexeme in linguistics What is called lexeme in rule based natural language processing can be equal to the linguistic equivalent only in analytic languages such as English but not in highly synthetic languages such as fusional languages What is called a lexeme in rule based natural language processing is more similar to what is called a word in linguistics not to be confused with a word in computer architecture although in some cases it may be more similar to a morpheme Lexical token and lexical tokenization editNot to be confused with Large language model Probabilistic tokenization or tokenization data security A lexical token is a string with an assigned and thus identified meaning in contrast to the probabilistic token used in large language models Lexical token consists of a token name and an optional token value The token name is a category of a rule based lexical unit 2 Examples of common tokens Token name Explanation Sample token valuesidentifier Names assigned by the programmer x color UPkeyword Reserved words of the language span class k if span span class k while span span class k return span separator punctuator Punctuation characters and paired delimiters operator Symbols that operate on arguments and produce results span class o span span class o lt span span class o span literal Numeric logical textual and reference literals span class nb true span span class mf 6 02e23 span span class s music span comment Line or block comments Usually discarded span class cm Retrieves user data span span class c1 must be negative span whitespace Groups of non printable characters Usually discarded Consider this expression in the C programming language span class n x span span class w span span class o span span class w span span class n a span span class w span span class o span span class w span span class n b span span class w span span class o span span class w span span class mi 2 span span class p span The lexical analysis of this expression yields the following sequence of tokens identifier x operator identifier a operator identifier b operator literal 2 separator A token name is what might be termed a part of speech in linguistics Lexical tokenization is conversion of a raw text into semantically or syntactically meaningful lexical tokens belonging to categories defined by a lexer program such as identifiers operators grouping symbols and data types The resulting tokens are then passed on to some other form of processing The process can be considered a sub task of parsing input For example in the text string The quick brown fox jumps over the lazy dogthe string isn t implicitly segmented on spaces as a natural language speaker would do The raw input the 43 characters must be explicitly split into the 9 tokens with a given space delimiter i e matching the string or regular expression s 1 When a token class represents more than one possible lexeme the lexer often saves enough information to reproduce the original lexeme so that it can be used in semantic analysis The parser typically retrieves this information from the lexer and stores it in the abstract syntax tree This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers Tokens are identified based on the specific rules of the lexer Some methods used to identify tokens include regular expressions specific sequences of characters termed a flag specific separating characters called delimiters and explicit definition by a dictionary Special characters including punctuation characters are commonly used by lexers to identify tokens because of their natural use in written and programming languages A lexical analyzer generally does nothing with combinations of tokens a task left for a parser For example a typical lexical analyzer recognizes parentheses as tokens but does nothing to ensure that each is matched with a When a lexer feeds tokens to the parser the representation used is typically an enumerated list of number representations For example Identifier is represented with 0 Assignment operator with 1 Addition operator with 2 etc Tokens are defined often by regular expressions which are understood by a lexical analyzer generator such as lex or handcoded equivalent finite state automata The lexical analyzer generated automatically by a tool like lex or hand crafted reads in a stream of characters identifies the lexemes in the stream and categorizes them into tokens This is termed tokenizing If the lexer finds an invalid token it will report an error Following tokenizing is parsing From there the interpreted data may be loaded into data structures for general use interpretation or compiling Lexical grammar editFurther information Lexical grammar The specification of a programming language often includes a set of rules the lexical grammar which defines the lexical syntax The lexical syntax is usually a regular language with the grammar rules consisting of regular expressions they define the set of possible character sequences lexemes of a token A lexer recognizes strings and for each kind of string found the lexical program takes an action most simply producing a token Two important common lexical categories are white space and comments These are also defined in the grammar and processed by the lexer but may be discarded not producing any tokens and considered non significant at most separating two tokens as in if x instead of ifx There are two important exceptions to this First in off side rule languages that delimit blocks with indenting initial whitespace is significant as it determines block structure and is generally handled at the lexer level see phrase structure below Secondly in some uses of lexers comments and whitespace must be preserved for examples a prettyprinter also needs to output the comments and some debugging tools may provide messages to the programmer showing the original source code In the 1960s notably for ALGOL whitespace and comments were eliminated as part of the line reconstruction phase the initial phase of the compiler frontend but this separate phase has been eliminated and these are now handled by the lexer Details editScanner edit The first stage the scanner is usually based on a finite state machine FSM It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles individual instances of these character sequences are termed lexemes For example an integer lexeme may contain any sequence of numerical digit characters In many cases the first non whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token this is termed the maximal munch or longest match rule In some languages the lexeme creation rules are more complex and may involve backtracking over previously read characters For example in C one L character is not enough to distinguish between an identifier that begins with L and a wide character string literal Evaluator edit A lexeme however is only a string of characters known to be of a certain kind e g a string literal a sequence of letters In order to construct a token the lexical analyzer needs a second stage the evaluator which goes over the characters of the lexeme to produce a value The lexeme s type combined with its value is what properly constitutes a token which can be given to a parser Some tokens such as parentheses do not really have values and so the evaluator function for these can return nothing only the type is needed Similarly sometimes evaluators can suppress a lexeme entirely concealing it from the parser which is useful for whitespace and comments The evaluators for identifiers are usually simple literally representing the identifier but may include some unstropping The evaluators for integer literals may pass the string on deferring evaluation to the semantic analysis phase or may perform evaluation themselves which can be involved for different bases or floating point numbers For a simple quoted string literal the evaluator needs to remove only the quotes but the evaluator for an escaped string literal incorporates a lexer which unescapes the escape sequences For example in the source code of a computer program the string span class n net worth future span span class w span span class o span span class w span span class p span span class n assets span span class w span span class err span span class w span span class n liabilities span span class p span might be converted into the following lexical token stream whitespace is suppressed and special characters have no value IDENTIFIER net worth future EQUALS OPEN PARENTHESIS IDENTIFIER assets MINUS IDENTIFIER liabilities CLOSE PARENTHESIS SEMICOLON Lexers may be written by hand This is practical if the list of tokens is small but lexers generated by automated tooling as part of a compiler compiler toolchain These tools generally accept regular expressions that describe the tokens allowed in the input stream Each regular expression is associated with a production rule in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression These tools may generate source code that can be compiled and executed or construct a state transition table for a finite state machine which is plugged into template code for compiling and executing Regular expressions compactly represent patterns that the characters in lexemes might follow For example for an English based language an IDENTIFIER token might be any English alphabetic character or an underscore followed by any number of instances of ASCII alphanumeric characters and or underscores This could be represented compactly by the string a zA Z a zA Z 0 9 This means any character a z A Z or followed by 0 or more of a z A Z or 0 9 Regular expressions and the finite state machines they generate are not powerful enough to handle recursive patterns such as n opening parentheses followed by a statement followed by n closing parentheses They are unable to keep count and verify that n is the same on both sides unless a finite set of permissible values exists for n It takes a full parser to recognize such patterns in their full generality A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end see example 3 in the Structure and Interpretation of Computer Programs book Obstacles editTypically lexical tokenization occurs at the word level However it is sometimes difficult to define what is meant by a word Often a tokenizer relies on simple heuristics for example Punctuation and whitespace may or may not be included in the resulting list of tokens All contiguous strings of alphabetic characters are part of one token likewise with numbers Tokens are separated by whitespace characters such as a space or line break or by punctuation characters In languages that use inter word spaces such as most that use the Latin alphabet and most programming languages this approach is fairly straightforward However even here there are many edge cases such as contractions hyphenated words emoticons and larger constructs such as URIs which for some purposes may count as single tokens A classic example is New York based which a naive tokenizer may break at the space even though the better break is arguably at the hyphen Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek Chinese 4 or Thai Agglutinative languages such as Korean also make tokenization tasks complicated Some ways to address the more difficult problems include developing more complex heuristics querying a table of common special cases or fitting the tokens to a language model that identifies collocations in a later processing step Lexer generator editSee also Parser generator Lexers are often generated by a lexer generator analogous to parser generators and such tools often come together The most established is lex paired with the yacc parser generator or rather some of their many reimplementations like flex often paired with GNU Bison These generators are a form of domain specific language taking in a lexical specification generally regular expressions with some markup and emitting a lexer These tools yield very fast development which is very important in early development both to get a working lexer and because a language specification may change often Further they often provide advanced features such as pre and post conditions which are hard to program by hand However an automatically generated lexer may lack flexibility and thus may require some manual modification or an all manually written lexer Lexer performance is a concern and optimizing is worthwhile more so in stable languages where the lexer is run very often such as C or HTML lex flex generated lexers are reasonably fast but improvements of two to three times are possible using more tuned generators Hand written lexers are sometimes used but modern lexer generators produce faster lexers than most hand coded ones The lex flex family of generators uses a table driven approach which is much less efficient than the directly coded approach dubious discuss With the latter approach the generator produces an engine that directly jumps to follow up states via goto statements Tools like re2c 5 have proven to produce engines that are between two and three times faster than flex produced engines citation needed It is in general difficult to hand write analyzers that perform better than engines generated by these latter tools Phrase structure editLexical analysis mainly segments the input stream of characters into tokens simply grouping the characters into pieces and categorizing them However the lexing may be significantly more complex most simply lexers may omit tokens or insert added tokens Omitting tokens notably whitespace and comments is very common when these are not needed by the compiler Less commonly added tokens may be inserted This is done mainly to group tokens into statements or statements into blocks to simplify the parser Line continuation edit Line continuation is a feature of some languages where a newline is normally a statement terminator Most often ending a line with a backslash immediately followed by a newline results in the line being continued the following line is joined to the prior line This is generally done in the lexer the backslash and newline are discarded rather than the newline being tokenized Examples include bash 6 other shell scripts and Python 7 Semicolon insertion edit Many languages use the semicolon as a statement terminator Most often this is mandatory but in some languages the semicolon is optional in many contexts This is mainly done at the lexer level where the lexer outputs a semicolon into the token stream despite one not being present in the input character stream and is termed semicolon insertion or automatic semicolon insertion In these cases semicolons are part of the formal phrase grammar of the language but may not be found in input text as they can be inserted by the lexer Optional semicolons or other terminators or separators are also sometimes handled at the parser level notably in the case of trailing commas or semicolons Semicolon insertion is a feature of BCPL and its distant descendant Go 8 though it is absent in B or C 9 Semicolon insertion is present in JavaScript though the rules are somewhat complex and much criticized to avoid bugs some recommend always using semicolons while others use initial semicolons termed defensive semicolons at the start of potentially ambiguous statements Semicolon insertion in languages with semicolon terminated statements and line continuation in languages with newline terminated statements can be seen as complementary semicolon insertion adds a token even though newlines generally do not generate tokens while line continuation prevents a token from being generated even though newlines generally do generate tokens Off side rule edit Further information Off side rule The off side rule blocks determined by indenting can be implemented in the lexer as in Python where increasing the indenting results in the lexer emitting an INDENT token and decreasing the indenting results in the lexer emitting one or more DEDENT tokens 10 These tokens correspond to the opening brace and closing brace in languages that use braces for blocks and means that the phrase grammar does not depend on whether braces or indenting are used This requires that the lexer hold state namely a stack of indent levels and thus can detect changes in indenting when this changes and thus the lexical grammar is not context free INDENT DEDENT depend on the contextual information of prior indent levels Context sensitive lexing editGenerally lexical grammars are context free or almost so and thus require no looking back or ahead or backtracking which allows a simple clean and efficient implementation This also allows simple one way communication from lexer to parser without needing any information flowing back to the lexer There are exceptions however Simple examples include semicolon insertion in Go which requires looking back one token concatenation of consecutive string literals in Python 7 which requires holding one token in a buffer before emitting it to see if the next token is another string literal and the off side rule in Python which requires maintaining a count of indent level indeed a stack of each indent level These examples all only require lexical context and while they complicate a lexer somewhat they are invisible to the parser and later phases A more complex example is the lexer hack in C where the token class of a sequence of characters cannot be determined until the semantic analysis phase since typedef names and variable names are lexically identical but constitute different token classes Thus in the hack the lexer calls the semantic analyzer say symbol table and checks if the sequence requires a typedef name In this case information must flow back not from the parser only but from the semantic analyzer back to the lexer which complicates design See also editLexicalization Lexical semantics List of parser generatorsReferences edit Anatomy of a Compiler and The Tokenizer www cs man ac uk page 111 Compilers Principles Techniques amp Tools 2nd Ed WorldCat by Aho Lam Sethi and Ullman as quoted in https stackoverflow com questions 14954721 what is the difference between token and lexeme Structure and Interpretation of Computer Programs mitpress mit edu Archived from the original on 2012 10 30 Retrieved 2009 03 07 Huang C Simon P Hsieh S amp Prevot L 2007 Rethinking Chinese Word Segmentation Tokenization Character Classification or Word break Identification Bumbulis P Cowan D D Mar Dec 1993 RE2C A more versatile scanner generator ACM Letters on Programming Languages and Systems 2 1 4 70 84 doi 10 1145 176454 176487 S2CID 14814637 Bash Reference Manual 3 1 2 1 Escape Character a b 3 6 4 Documentation docs python org Effective Go Semicolons Semicolons in Go golang nuts Rob Commander Pike 12 10 09 Lexical analysis gt Indentation The Python Language Reference Retrieved 21 June 2023 Sources edit Compiling with C and Java Pat Terry 2005 ISBN 032126360X Algorithms Data Structures Programs Niklaus Wirth 1975 ISBN 0 13 022418 9 Compiler Construction Niklaus Wirth 1996 ISBN 0 201 40353 6 Sebesta R W 2006 Concepts of programming languages Seventh edition pp 177 Boston Pearson Addison Wesley External links editYang W Tsay Chey Woei Chan Jien Tsai 2002 On the applicability of the longest match rule in lexical analysis Computer Languages Systems amp Structures 28 3 273 288 doi 10 1016 S0096 0551 02 00014 0 NSC 86 2213 E 009 021 and NSC 86 2213 E 009 079 Trim Craig Jan 23 2013 The Art of Tokenization Developer Works IBM Archived from the original on 2019 05 30 Word Mention Segmentation Task an analysis Retrieved from https en wikipedia org w index php title Lexical analysis amp oldid 1184447159 Token, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.