Parser definitions

Parser definitions section is declared using %pdefs or %pd.

When this section is present in input file, it is assumed that file contains parser definition.

Basic concepts

Terminal symbol identifiers

Parser input is actually a sequence of terminal symbols coming from lexer. Whenever lexer detects a meaningful input sequence of characters it reports a symbol.

Symbol codes are integers. To exchange them, lexer and parser must settle a common set of unique codes for each possible input terminal symbol.

By default symbol codes are assigned by Alpag for each declared terminal as consecutive integer numbers. When necessary, user can also assign symbol codes manually. Both approaches can be mixed.

For performance reasons it is better to use codes assigned by Alpag.

When terminal symbol codes are assigned by parser, source of terminal symbols (lexer) must be using them when reporting matches. Sometimes source of tokens can be using its own fixed token identifiers which for some reason cannot be changed to parser-assigned ones. In such case parser must be configured to use these identifiers. To accomplish this parser terminals can be declared with manually assigned codes.

Alpag can use any positive integers for input terminal symbols. If custom codes are used it is better if these small integer numbers, preferably in continuous range starting from 1.

Value types

Symbols passed to parser can have associated value. Nonterminal symbols resolved during parsing also can have associated value. Type of value can be the same for all grammar elements or can be declared for each terminal and nonterminal separately.

If parser is using multiple value types, these must be declared explicitly using %valtype keyword. Such value type can be then referenced when declaring terminals and nonterminals by putting its name in angle brackets:

Terminals and nonterminals can be also declared with no value type at all.

Associativity and precedence

One way of resolving ambiguity in grammar is using information about associativity of individual terminal symbols. Terminal symbols can be declared as right, left or non-associative.

Another way to resolve ambiguity is to group terminals in precedence classes. These can be used to resolve shift-reduce conflicts by choosing grammar construct which higher precedence terminal. By default all terminals have the same precedence.

Traditional way of declaring terminals with precedence is by using %precedence , %left or %right keyword. These keywords in single operation introduce a new precedence class and declare terminal symbols that belong to this class.

Alternative way is to declare a named precedence class using %prclass keyword. This named class can be then referenced from terminal symbol declaration putting all declared terminals in that class.

To reference precedence class use its name preceded with exclamation mark:

!precedenceClassIdentifier

Terminal symbol declaration elements

Terminal symbols are declared using %token, %left, %right, %nonassoc and %precedence keywords (described further in this section). Following types of information can be specified when declaring terminals using one of these keywords:

name of symbol, used when defining grammar. Possible options are:

user specified identifier (like MY_TOKEN)
alternative double-quoted string which can be used interchangeably with identifier (like "while")

integer code assigned to terminal. By default it is assigned automatically. It can also be declared explicitly by using:

integer value (like 305 or 0xAA)
using ASCII or Unicode value of single-quoted character (like 'C').
This single character can be also used when referencing symbol, instead of its identifier.

type of value associated with terminal. Value type is specified using <valtypeName> syntax.
associativity of terminal. It its specified by using appropriate keyword for declaring symbols (like %left, %right or %nonassoc ).
precedence of terminal which is specified either by declaring the symbol using %precedence , %left, %right keywords (which introduce new precedence class), or by referencing a named precedence class (like !precClassName)

Summary of parser definitions

valtype

Declares named value type which can be used when declaring terminals or nonterminals.

%valtype identifier options...

Available options are:

%type { type-name }

name of programming language type

%decl { code }

code declaring variable.

%access { code }

code template for accessing variable. Code must contain $$ placeholder for variable.

%get { code }

code template for getting value of variable. Code must contain $$ placeholder for variable. Effectively the same as %access code.

%set { code }

code template for setting value of variable. Code must contain $$ placeholder for variable and $1 placeholder for assigned value.

%isdef

specifies that this type is a default type. Default type is used for all terminals that do not have value type specified explicitly.

%warnoff warning

disables warning with given integer code

Examples

%valtype MY_STRING %type { string } %isdef
%valtype COMPLEX %type { SomeClass }

Terminals can be also declared with no value type.

Functionality of declaring multiple value types is available in yacc via %union keyword.

prclass

Declares a named precedence class.

%prclass identifier

Order of %prclass declarations becomes also the order of their respective precedence classes. Precedence classes declared further in the file have higher priority.

The %prclass keyword introduces named class which can be further used to declare symbols anywhere in the file. Note that %precedence, %left, %right declarations used to declare tokens also introduce their own anonymous precedence classes which participate in global order of precedence classes.

Named precedence class can be later referenced using name prefixed with exclamation mark

!identifier.

token, left, right, nonassoc, precedence

Terminal symbol declaration can be introduced using one of following keywords:

%token – declares ordinary tokens with no explicit precedence nor associativity
%precedence – declares tokens with new precedence class
%left – declares left-associative tokens with new precedence class
%right – declares right-associative tokens with new precedence class
%nonassoc – declares non-associative tokens with no explicit precedence

Each keyword can be used to declare multiple tokens. All tokens declared on the same line have the same precedence class and associativity.

When token declaration starts with !precClass reference, all tokens declared on the same line are added to this precedence class.

Value type can be referenced anywhere in the line using <valtype> syntax. Such value type is assigned to all tokens that follow it (until end of line or another value type reference).

General format of declaration is as follows (using %token as example). Square brackets denote optional elements):

%token [!precClass] [<valtype>] tokenDeclaration [[<valtype>]tokenDeclaration2]...

where:

precClass is name of precedence class (which was declared earlier with %prclass).
valtype is name of value type (declared earlier with %valtype). Value type of minus ( that is <-> ) can be used to declare tokens with no explicit value type.
tokenDeclaration declares a single terminal symbol with its identifier, optional name and optional code.

Token identifier declaration can have one of following forms (square brackets denote optional element):

identifier [integerId] ["string"]
identifier 'character' ["string"]
'character' ["string"]

where:

identifier can be used to reference terminal from grammar productions
"string" is an optional user-given string which can be used exchangeably with identifier. String declared here is not guaranteed to be the same as text matched by lexer.
integerId – explicitly specified code for terminal symbol. If given, parser will expect this very code to appear on its input. It can be decimal or hexadecimal (with ‘0x’ prefix).
'character' – alternate way of specifying code for terminal symbol. ASCII or Unicode value of character is used as code of terminal. This single character can also be used (quoted) to reference the terminal from grammar productions.

Either integerId or 'character' can be specified but not both. If neither is given, code for nonterminal is assigned automatically.

Examples:

%token AAA BBB CCC // three tokens declared
%token LETTERA 65 LETTERB 66 // explicitly assigned token codes
%token VOID_KEYWORD "void" // can be referenced from grammar either way
%token 'C' 'D' 'E' // three tokens declared, letters become both names and source of codes
%token <STR> STRING TEXT <INT> INTEGER <-> FOR WHILE UNTIL // with value types
%token !PREC_MULDIV MUL "*" DIV "/" REM "%" // tokens in named precedence class
%right MULTIPLY // right associative token in new anonymous precedence class
%precedence ADD "+" SUB "-" // tokens in new anonymous precedence class

Do not use single quoted characters in declarations (like 'c') if you do not need symbol codes to match ASCII codes of these characters. If you want reference tokens from grammar using string declare tokens using double quotes (like "c") which does not interfere with symbol code assignment.

To maintain backward compatibility Alpag is using the same set of keywords for declaring terminal symbol as yacc that is: %token, %left, %right, %nonassoc, %precedence. Syntax of these commands has been extended in Alpag but their default behavior is the same as in yacc. This means that yacc grammar can be copied verbatim to Alpag providing the same behavior.

type

Declares nonterminals with optional value type.

%type [<valtype>] nonterminal [[<valtype2>]nonterminal2]...

where:

valtype is name of value type declared earlier with %valtype. Value type of minus ( that is <-> ) can be used to declare tokens with no explicit value type, which overrides default value type specified with %isdef switch.
nonterminal is an identifier of grammar nonterminal

Grammar nonterminals do not have to be declared explicitly. Each identifier that appears in grammar productions and was not declared as a terminal is considered a valid nonterminal.

Examples:

%type FOR_STATEMENT WHILE_STATEMENT
%type <STRARR> STRING_ARRAY
%type <INT> DIGITS <-> NO_VALUE

parserror

Declares named error token.

%parserror identifier [options…]

where options are:

%on_begin_recovery { code }

Code which is executed when parser enters error recovery mode.

%on_skip_input { code }

Code executed in error recovery mode on each not matched and skipped input token

%on_recovered { code }

Code executed when error recovery is over

%eat_lah

With this option, when error recovery starts cached lookahead token is discareded

%leave_lah

With this option, when error recovery starts cached lookahead token is left for analysis

By default parser contains only one predefined error handling token named error.

If required error-recovery behavior differs from location to location, user can define multiple error tokens providing specific behavior and custom error handling code.

User-defined error tokens can be used anywhere built-in error token is.

< Lexer definitions | Lexer rules section >

Alpag Manual