Lexer grammar

To start writing lexer grammar create an empty text file with extension .alpag or .alp. If file contains only lexer definition extension .alpl can be used to emphasize it. Alpag can accept input files with any extension, but using standard extension is a good idea. For backward compatibility reasons, extension .lex is also recognized (for files containing lexer, and nothing but lexer).

In the example we assume filename: myLexer.alp

Inside the file you can place C++ style comments using /*..*/ or // syntax.

/* myLexer – my first lexer */

Contents of the file is divided into sections. A section starts with %% double percent token followed by name of section. Types of used sections determine if the file contains a lexer a parser or both.

Place %%lrules at the start of new line

/* myLexer – my first lexer */

%%lrules
// everything below %%lrules is lexer rules until next section

The %%lrules section must contain at least one valid rule.

/* myLexer – my first lexer */

%%lrules
[A-Z]+ { return 23; }

Above rule matches an uppercase word: one or more characters in A-Z range (refer to -Regular expression syntax for details).

User defined constant (here 23) will be returned to the invoking code (usually parser) whenever that rule is matched. Any positive integer can be used for reporting user-defined codes. Negative integers are reserved for errors.

You can use %%ldefs section to declare named subexpressions that can later be used when declaring actual lexer rules. To define named subexpression, put its name at the start of line followed by whitespace and regular expression. You can later reference that subexpression using its name in curly braces:

/* myLexer – my first lexer */

%%ldefs
// named subexpression: identifier followed by regex
LETTER [A-Z]
%%lrules
// the same as [A-Z]+
{LETTER}+ { return 23; }

Initial part of Alpag file before any named sections is reserved for setting global options. Options are declared using %option keyword followed by option name and a value. Alpag has many options controlling lexer calculation and code generation (Options section contains complete reference of options).

When setting options it is best to use highlighting extension for Visual Studio. It provides list of available options along with spellchecking of their names and basic help.

Here is an option controlling name of generated lexer class:

/* myLexer – my first lexer */

// option: lexer / code generation / lexer class / name of class
%option Lexer.Code.Lexer.ClassName "MyLexer"

%%ldefs
LETTER [A-Z]
%%lrules
{LETTER}+ { return 23; }

Lexer must provide rules for all characters that can appear on the input, even these that are not relevant and will be skipped in processing.

We shall add one more rule for matching input spaces. Below is a lexer grammar with two rules:

/* myLexer – my first lexer */

%option Lexer.Code.Lexer.ClassName "MyLexer"
%%ldefs
LETTER [A-Z]
%%lrules
LETTER+ { return 23; }
\ + /* skip spaces (note space after ‘\’ and before ‘+’) */

A rule ‘\ +’ matching one or more spaces was added. Space must be escaped using backslash character.

Added rule has no action code associated with it. When lexer matches it, no code will be executed and spaces will be effectively skipped.

Above file is sufficient to generate a valid lexer. This lexer will detect uppercase words separated by spaces. Note that any other input character (e.g. newline or lowercase letter) wich are not covered by defined rules will cause the lexer to fail.

Save the file and run from command line:

alpag myLexer.alp

If there were no errors, you should see generated lexer code file myLexerLexer.cs located in the same directory as the input file.

By default a single code file is generated containing all components of the lexer. Depending on selected options Alpag can generate multiple files containing individual lexer components. By default these files are named same as input file (here myLexer) with additional suffix describing contents of particular file (here Lexer).

Inspect generated file. Search for string 'return 23' to see where your custom code is located.

< Lexer | Generating lexer report >

Alpag Manual