Lexical analysis process

Lexer does not define a single regular grammar for entire file. Instead several alternative patterns (regular expressions) are given for various character sequences that can appear in the input stream. Each of these patterns is itself a small regular grammar. When analyzing input, lexer tries to match one of defined patterns against current position in the file. After finding a match, lexer returns that match as a recognized token. After that lexer advances in the input stream to the end of just matched text and performs a new search at that position.

There is no relationship between consecutive searches performed by lexer. Each search-match-and-return operation is independent. This procedure continues until the end of file is reached. It is the responsibility of whoever is reading the lexer output (usually the parser) to handle and analyze sequence of consecutive tokens returned by lexer.

Each user-defined pattern can include an action: a piece of code to execute when that pattern is matched. A regular pattern together with attached action code makes up a lexer rule.

User must define complete set of patterns (regular expressions) covering all possible character sequences that can appear in the input stream. If lexer cannot find any pattern matching character sequence found at some point in input stream, an error occurs.

When pattern is matched, user code for that pattern is executed. This code can perform any action, but usually it reports the match. This reported match is usually passed directly to the parser as an input token.

Lexer always tries to match longest possible sequence of characters that can be matched by any of the patterns.

Patterns can be overlapping: an input sequence can match more than one pattern. In such case pattern that gives longer match wins. If multiple patterns give matches of equal length, pattern defined earlier in the grammar wins.

Lexer rules are defined by placing regular expression pattern (like [A-Z]+) at the start of line. Optional code to execute when pattern is matched, can be placed after regular expression whitespace.

Example lexer rules are shown on Fig. 1 (refer to Regular expression syntax chapter for details).

Fig. 1 Example lexer rules for matching words and skipping space

// (1) rule for recognizing all—caps words

// returns a predefined token 'AN_UPPERCASE__WORD'

[A-Z]+ { return AN_UPPERCASE_WORD; }

// (2) rule for recognizing any word, returns a token 'A_WORD'

[A-Za-z]+ { return A_WORD; }

// (3) rule for matching space ‘\ ‘ or tab ‘\t’,

// (backslash ‘\’ is an escape character)

// no action to be executed, so whitespace will be ignored

[\ \t]+ /* ignore space */

In the example three rules are defined:

First rule [A-Z]+ matches words made completely of uppercase letters (one or more characters in ‘A-Z’ range).
Second rule [A-Za-z]+ matches words made of any combination of upper and lowercase letters (one or more in ‘a-z’ or ‘A-Z’ range).
Third rule [\ \t]+ matches space that surround the words (one or more spaces or tabulations). The space will be ignored and not reported to the user, since the rule has no code associated with it. Nevertheless a rule for such space must be explicitly defined, since all characters in input stream, important or not, must be recognized and matched by some rules, even if these rules will eventually discard the input.

Lexer starts analysis at the beginning of input text and tries to match longest possible character sequence using one of user-defined patterns. When match is found and reported, lexer skips to the end of matched sequence and tries to find next match. This procedure is repeated until end of input text is reached.

Fig. 2 shows how a sample text is split to tokens using the rules defined above.

The text being analyzed is "A Sequence of Words".

Fig. 2 Example of lexer processing

Input:

A Sequence of Words

Recognized tokens:

A matched by (1) [A-Z]+ code: return AN_UPPERCASE_WORD

matched by (3) [\ \t]+ code: /* ignore space */

Sequence matched by (2) [A-Za-z]+ code: return A_WORD

matched by (3) [\ \t]+ code: /* ignore space */

of matched by (2) [A-Za-z]+ code: return A_WORD

matched by (3) [\ \t]+ code: /* ignore space */

Words matched by (2) [A-Za-z]+ code: return A_WORD

Lexer starts at the beginning of text, before letter A. It tries to find longest match. At this point two patterns match the input text. Pattern for rule (1) [A-Z]+ matches one uppercase letter A. Rule (2) [A-Za-z]+ also matches the letter A since it is also any-case letter. Both matches are of equal length. Rule (1) wins, since it is defined earlier in the file. Action for rule (1) is fired.

Next lexer moves to position after letter A and tries to find a match. Only one rule matches at this point, rule (3) [\ \t]+ for whitespace. No action code is defined for this rule so nothing is reported.

Lexer moves to position after matched space character and tries to find next match. Letter S found at this position can be matched by both rules (1) and (2), but rule (2) can match more characters. Since lexer is always looking for longest possible match rule (2) wins with match: Sequence.

Whole procedure is repeated until end of input stream is reached.

User is free to attach any code to lexer rules. In above example some user-defined token identifiers are returned whenever a syntactically relevant string is matched. This returns stream of tokens:

AN_UPPERCASE_WORD( A ) A_WORD ( Sequence ) A_WORD ( of ) A_WORD ( Words )

Returning token from lexer on each match is known as pull operation. Alternative way of organizing the process is to invoke some external code on each match and keep analyzing without exiting from lexer. This is known as push operation.

< Regular grammar | Lexer algorithm >

Alpag Manual