Lexer definitions

Lexer definitions section is declared using %ldefs or %ld.

When this section is present in input file, it is assumed that file contains lexer definition.

Lexer modes

By default all lexer rules are always active. It can be sometimes useful to selectively disable some rules. This can be achieved using modes.

User can declare multiple modes and assign lexer rules to these modes. At any moment a single mode is 'current'. Only rules active in this mode are enabled. A single rule can be assigned to multiple modes. Switching modes is done programmatically during lexer run.

Default lexer mode is called INITIAL. When lexer starts, this default mode is current. All lexer rules, if not specified otherwise, are assigned to INITIAL mode.

User can declare two kinds of modes:

shared also known as inclusive (declared using %s). This kind of mode includes all rules which are explicitly tagged with this mode plus all rules that are not marked with any mode.
exclusive (declared using %x). This kind of mode includes only rules explicitly marked with it.

Modes are also known as 'start conditions'.

Named regular expressions

Sometimes multiple regular expressions contain the same element or subpart. Such element can be declared once, as named regular expression, and then referenced from other places using its name in curly braces like:

{name}

A named regular expression is defined by placing its name (identifier) at the beginning of line followed at least one space and regular expression.

identifier regularExpression

Example

%%ldefs
DIGIT [0-9] // named subexpression for a single digit
NUMBER {DIGIT}+ // named subexpression for one or more digits, uses DIGIT
%%lrules
{NUMBER}\.{NUMBER} { ... } // rule matching real number

Summary of lexer definitions

exclusive mode

%x identifier

Declares an exclusive lexer mode

shared mode

%s identifier

Declares a shared (inclusive) lexer mode

retcode

Declares return code which can be assigned to multiple lexer rules

%retcode name <valueType> { code }
%retcode <valueType> { code }

where:

valueType is name of value type
name is optional name used to differentiate multiple return codes for the same value type
code is user code for reporting token. It can include $$ placeholder which will be replaced by identifier of returned token.

The %retcode command declares a return code which can be referenced from lexer rules. Return codes are declared for each value type separately. For a single value type there can be only one return code without name. If necessary additional return codes with custom names can be defined.

Return codes are referenced from lexer rules using %return command. The %return command specifies value type and optional name used to lookup the return code. Moreover %return command specifies identifier of token reported by particular rule. Placeholder $$ in enclosed code is replaced with returned token.

< Common definitions | Parser definitions >

Alpag Manual