Parser rules section

Parser rules section is declared using %prules or %pr.

Rule format

A parser rule starts with a nonterminal which must be placed at the beginning of line. It is followed by colon ':'. After colon come right hand side elements of production, terminals or nonterminals. It is possible to declare multiple right-hand sides (and thus multiple grammar productions) by separating them with pipe '|'. A grammar rule is terminated with semicolon ';'.

Each production can contain code block which is executed when production has been fully recognized by parser and is just about to be reduced. This code block is placed at the end of production (after last right-hand side element).

Production can also contain mid-rule code blocks which are placed between right hand side elements. Code block cannot be placed at beginning of production before any elements.

Each production can have a number of options which are placed after all elements but before terminal code block. Moreover individual elements of production can have their own modifiers.

Below is the outline of most general grammar production form (some elements shown are optional). Detailed explanation of individual elements is provided further in this chapter.

lhsNt <valueType> [name] : // rule start, all elements on the same line
// single production given as:
%empty
// or sequence of zero or more elements (terminals or nonterminals):
eltBeingTerminalOrNonterminal [name] ^place1 ^place2...
// with possible mid-rule-code between elements
{ mid-rule-code } <valType> [name] ^place1 ^place2...
// after elements, optional lookahead context (after slash)
/ lahTerm1 , lahTerm2 ...
// zero or more options (see below)
option1 option2...
// reduction code for entire production (optional)
{ code-fired-on-reduction }
// definition of single production ends here, terminated with…
| … // pipe followed by next production for the same lhsNt, or
; // semicolon after last production

Right hand side of production can be of zero length. To improve readability %empty token can be placed instead of empty sequence of elements.

Nonterminals (left and right hand side) are referenced using their identifier, like:

SOME_TOKEN

Terminals can be referenced by identifier, user-defined string name, or character (which was used to assign code to terminal):

WHILE_KEYWORD
"until"
'c'

When referencing terminals both single or double quotes can be used, regardless whether user defined name or character code is used.

Rule definition alternatives

Typical form of production is:

lhsNt: elt1 elt2 … { code-fired-on-reduction };

It specifies left hand side nonterminal, sequence of right hand side elements and code executed when parser reduces entire production.

Occasionally mid rule-code is placed between right-hand-side elements:

lhsNt: elt1 { mid-rule-code } elt2 elt3 ... { code-fired-on-reduction };

Mid-rule code is executed when parser reaches its location within production.

When mid-rule code fires, further right-hand-side production elements has still not been read and recognized. There is no guarantee that these further elements will be matched. This means that mid-rule code can fire even if entire production is never matched.

Multiple productions for the same left-hand-side nonterminal can be declared within single rule, separated with pipes '|':

lhsNt:
elt11 elt12 ... { code1 } |
elt21 elt22 ... { code2 } |
...
eltN1 eltN2 ... { codeN } ;

Productions for the same nonterminal can be declared using separate rules as well.

The left hand side nonterminal can be declared with optional name and optional value type:

lhsNt [<valueType>] [[name]]: ...

name can be used to reference value of nonterminal in code blocks

valueType declares value type of nonterminal. This way of declaring value type can be used exchangeably with %type command. If both appear, these must be non-conflicting.

The right hand side elements can also have modifiers:

rhsTermOrNt [[name]] [[ ^place1] ^place2]...
{ mid-rule-code } [<valueType>] [[name] [[^place1] ^place2]...

where:

name is optional name which can be used to reference production element from code.
place is identifier of place (point) inside production which can be used to trace it after grammar is transformed into automaton. Multiple named places can be declared for single point in production.
valueType is optional value type which can be assigned to mid-rule-code. Since mid-rule code is not declared elsewhere in grammar, its value type can be specified only this way.

Start nonterminal

A single nonterminal must be selected as start nonterminal for the grammar. This nonterminal becomes topmost construct of grammar corresponding to entire input stream. By default left-hand-side nonterminal of first grammar production declared in grammar becomes start nonterminal.

Start nonterminal can be explicitly declared using command:

%start nonterminalName

This command must be placed in parser definitions %%pdefs section.

Lookahead context

Each production can be declared with explicit lookahead context. This context determines what terminal symbols are allowed next on input stream for production to be matched.

Lookahead context is specified as one or more comma-separated terminals placed at the end of production after slash '/':

lhsNt: elt1 elt2 ... eltN / lahTerm1, lahTerm2... { code }

Above production will be matched only if next input terminal is lahTerm1 or lahTerm2.

Grammar can contain multiple productions with the same left and right hand sides providing their lookahead contexts are disjoint. In this group there can be one production with no lookahead context. It will match all lookahead symbols not covered explicitly by other rules.

Normally Alpag calculates possible lookahead symbols itself. Using explicit declaration enables fine-tuning parser behavior.

Options

Below is a list of options that can be placed at the end of production:

%prec terminalSymbol
%prec !precClass

Specifies precedence of this production, used to resolve shift-reduce conflicts.

With first syntax precedence of production becomes the same as precedence of referenced terminal symbol. Second syntax takes precedence of named precedence class.

Using this option overrides any precedence that could be derived from elements of production.

%warnoff warningCode

Disables a single warning. Warning code can be decimal or hexadecimal. Multiple %warnoff options can be specified for single production.

%nolah or %stdef

Both these options have the same meaning: require that automaton at the end of production should be defaulted. When state is defaulted, production is reduced without analyzing lookahead token.

Parser generation will fail if end-of-production state cannot be defaulted.

%nostdef

This option disables defaulting of automaton state at the end of production, even if defaulting is possible.

%igsr

This option ignores (disables) all shift-reduce conflicts for production

%igsr terminalSymbol

This option ignores (disables) shift-reduce conflicts under given terminal symbol. Option can be specified multiple times for different symbols.

%igsr ^place

This option ignores (disables) shift-reduce conflict against given place (declared in other production). Option can be specified multiple times for different place identifiers.

%igsr count

This option ignores (disables) a given number (count) of shift-reduce conflicts for this production, regardless of their character. When used in combination with other %igsr options it covers only conflicts not resolved by other means.

Reduction code

Code associated with parser rules is executed when production is reduced. The code usually calculates value of left-hand-side nonterminal (being result of reduction) using values of right-hand-side production elements. To perform this task code must have access to all these values.

Placeholders

Following placeholders can be used inside in code handling reduction:

$$ - value of left hand side nonterminal which becomes result of reduction. It is usually assigned to.
$1, $2, $2… – values of right hand side elements of production being reduced.
$NAME – value of element NAME.

Name is matched against:

identifiers of terminals and nonterminals
user-assigned names of terminals and nonterminals, local to production

To use $NAME syntax referenced name must be unique within production. If name of terminal or nonterminal is used, there can be only one such element in production. If user-assigned name is used it must be unique within production.

Both integer and name arguments can be wrapped in square braces:

$[1] $[2], $[3], $[NAME]

Examples

(usage of operators in code is purely illustrative):

PROD1: FIRST SECOND THIRD { $$ = $1 + $2 + $3; } // result of production is sum of elements
SUM: NUMBER '+' NUMBER { $$ = $1 + $3; } // result calculated from values of particular elements
DIV: DIVIDENT ‘/’ DIVISOR { $DIV = $DIVIDENT / $DIVISOR; } // accessing elements by id
SUM[RES]: NUMBER[A] '+' NUMBER[B] { $RES = $A + $B; } // accessing elements by given name

File locations

Parser can be configured to process information about file locations.

Information about location of individual elements can be accessed using following placeholders:

@$ - location of entire production (sum of all elements)
@1, @2, @3… - locations of subsequent right hand side elements of production
@NAME – location of element NAME

Format of variables storing file location information depends on configuration.

Value type casting

The value type of expressions $$, $1, $2... is taken from declaration of respective terminal, nonterminal or mid rule code. It is possible to override this default behavior using type casting:

$<valueType>$, $<valueType>1, $< valueType>2...

Consequences of casting value depend on the way value types are handled. By default alternative value types are stored in dedicated fields of a value data record, so forcing value type results simply in referencing another field of this record.

Value types can be also declared using custom code template for both field declaration and access. In such case consequences of casting value type depend on the way underlying fields are stored.

It is also possible to access entire value data record using asterisk '*' like so:

$<*>$, $<*>1, $<*>2

With this syntax all fields of value data record can be accessed using syntax like $<*>$.MyField.

Setters

By default expression like $$ denotes a simple field which can be used for both getting and assignment. Sometimes it may be necessary to use different syntax for setting and getting the value.

Alpag provides special syntax for assignment:

$${ assignedExpression }

Assigned-to value type must have %set code template defined.

%valtype identifier %set {code}

The %set code must contain:

$$ placeholder which is replaced by location of assigned-to value data
$1 placeholder which is replaced by assignedExpression.

Note that $$ stands for entire record with value data, not just single field holding value for that single type.

Above mechanism can be used with both, value field generated by Alpag, or with completely custom code.

Example (with automatically generated field):

%valtype MyType %type { string } %set { $$.MyType = "(" + $1 + ")" }

Field named MyType is added automatically to value data structure (via %type declaration).

Getting value of this field is done in usual way. Setting is done via provided setter template.

If reduction code is:

$${ $3 + ","+ $4 }

Generated code will be

dataR.MyType = "(" + data3.MyType + "," + data4.MyType + ")"

where dataR, data2, data4 stand for actual locations of value data records.

Example (with custom field declared by user):

%valtype MyType %decl { string UserField; } %get { $$.UserField } %set { $$.UserField = "(" + $1 + ")" }

Above declaration adds field named UserField to the value data structure.

If reduction code is:

$${ $3 + ","+ $4 }

Generated code will be:

dataR.UserField = "(" + data3.UserField + "," + data4.UserField + ")"

Assignment syntax can be used even if %set code template was not declared. In such case assignment:

$${ expression }

is replaced with:

$$ = expression

Macros

A number of predefined macros can be used within reduction code. These are:

ACCEPT – ends parsing session and returns a success
ABORT – ends parsing session and returns an error
ERROR – generates an error and starts error recovery procedure
RECOVERING – boolean variable set when parser is currently in error recovery
LAH_CLEAR – clears and discards current lookahead symbol.
LAH_IS_SET – boolean value set when lookahead symbol is prefetched and present
LAH_IS_EMPTY – boolean value set when no lookahead symbol is present
LAH_SYMBOL – code of lookahead terminal symbol
LAH_IS_EOF – true when lookahead symbol is EOF
LAH_VALUE – gives access to value associated with lookahead symbol
LAH_LOCATION – contains location of lookahead symbol

< Lexer rules section | Lexer >

Alpag Manual