Integration with code

Parser grammar defined in previous sections does not contain any custom user code. Such code must be provided to report recognized grammar constructs. Besides tokens that are fed to parser as input usually come with values. These values should be processed as well.

Whenever parser recognizes certain grammar construction it performs a reduction replacing multiple grammar elements with one nonterminal. User code executed during reduction must perform all processing necessary to preserve the data attached to reduced symbols.

Upon reduction user code can do two things:

Invoke code located outside of parser informing about just recognized grammar construction. Reporting results to the application is usually the final aim of entire parsing operation.
Collect values of reduced terminals and nonterminals and wrap them in a single value. This value can be then stored on the parser stack together with result of reduction.

Often some mixture of both approaches is used.

Parser can store on stack values provided by user and assigned to terminals and nonterminals. Type of value stored on stack can be chosen by user. It is also possible to declare a record with multiple value types, different for particular terminals and nonterminals.

Values for nonterminals are calculated during reduction. Values for terminals must be assigned when parser reads the input, and usually come from lexer.

In example below a single value of type string will be used for all terminals and nonterminals. Tested parser will not be interfaced with real lexer. Instead source of tokens (terminal symbols) will be simulated using a predefined array of tokens and an array of values for these tokens, both embedded directly in parser.

Below is a complete parser input file. It can be used as input file for Alpag.

User code to be placed inside parser class is contained in %code blocks.

Individual productions were added code executed during reduction.

/* myParser – my first paser */

// declaration of value type used by parser
%option Parser.Value.FieldType string

// code placed inside parser class
%code parser_body {
// position in simulated input stream
int inputOffset = 0;
// input tokens
int[] TOKENS = new int[]{
KEYWORD, NUMBER, KEYWORD, KEYWORD, NUMBER
};
// values for input tokens
string[] VALUES = new string[]{
"k1", "123", "k2", "k3", "456"
};

void Log( string msg )
{
Console.WriteLine( msg );
}
}

// code placed inside procedure reading next input token for parser
%code parser_next_token {
// arguments available here:
// valueData.Value – place to return token value
// return: identifier of token or EOF
if( inputOffset >= TOKENS.Length )
return EOF;
// next token and its value to return
int token = TOKENS[inputOffset];
valueData.Value = VALUES[inputOffset];
inputOffset++;
// log moment when parser fetches next symbol
Log( "Token: " + token + " Value: " + valueData.Value );
return token;
}

%%pdefs
%token KEYWORD
%token NUMBER

%%prules
// when FILE is matched, print File along with value of COMMANDS
FILE: COMMANDS { Log( "File: !" + $1 + "!" ); };
COMMANDS:
CMD {
// result value of COMMANDS ($$) = value of CMD ($1)
$$ = "[" + $1 + "]";
// Also log that reduction occurred
Log( "CMNDS:CMD " + $$ );
} |
COMMANDS CMD {
// result value of COMMANDS ($$)
// is calculated concatenating element values
$$ = "(" + $1 + ", " + $2 + ")";
// log reduction
Log( "CMNDS:CMNDS,CMD " + $$ );
}
;
// value of CMD ($$) is assigned from value of KEYWORD
CMD: KEYWORD {
$$ = "CMD<" + $1 + ">";
Log( "CMD:KEYW " + $$ );
};
// value of CMD ($$) is concatenated from KEYWORD ($1) and NUMBER ($2) values
CMD: KEYWORD NUMBER{
$$ = "CMD<" + $1 + ":" + $2 + ">";
Log( "CMD:KEYW,NUM " + $$ );
};

The parser_next_token code is executed each time parser needs next input token from lexer. In example code parser_next_token prints a line containing identifier of token fetched and it associated value.

Code for each production logs the reduction along with result value. This should enable tracing parser operation.

Reduction occurs when all right-hand-side elements were correctly recognized, and parser is fully confident that these match exactly one production. During reduction all right-hand side elements are replaced by one left-hand side nonterminal. Both reduced elements and result nonterminal can have associated value. User code can calculate result value and assign it to $$ variable. values of right-hand side elements are available in $1, $2, $3… variables and can be used in calculation.

Since string is used for all values result of each reduction can be calculated by concatenating values of all right-hand side elements. Results are additionally wrapped in easily distinguishable delimiters to emphasize order of reductions.

Executing above code should reveal sequence of all operations including fetching input tokens and reducing productions.

Rebuild parser invoking:

alpag myParser.alp

Now generated parser can be added to target application.

Create a console application and place following code inside its main method:

// default namespace for generated code
using Common;
// ...

MyParser parser = new MyParser();
int status = parser.Parse();
if( status != 0 )
Console.WriteLine( "Error: " + status );

Note that Parse() procedure does not exit until entire input file was read.

Upon launch above code should print:

Token: 1 Value: k1
Token: 2 Value: 123
Token: 1 Value: k2
CMD:KEYW,NUM CMD<k1:123>
CMNDS:CMD [CMD<k1:123>]
Token: 1 Value: k3
CMD:KEYW CMD<k2>
CMNDS:CMNDS,CMD ([CMD<k1:123>], CMD<k2>)
Token: 2 Value: 456
CMD:KEYW,NUM CMD<k3:456>
CMNDS:CMNDS,CMD (([CMD<k1:123>], CMD<k2>), CMD<k3:456>)
File: !(([CMD<k1:123>], CMD<k2>), CMD<k3:456>)!

Last row represents entire recognized input. It corresponds to FILE, the top-level nonterminal of grammar. Braces reveal the order of reductions.

Note that at the beginning parser read three symbols: two symbols k1, 123 necessary to perform first reduction CMD: KEYW, NUM and also third symbol k2. If you inspect the grammar carefully you will notice that reading first two symbols is completely sufficient to perform first reduction. In other words these first two symbols couldn’t be anything else but production CMD: KEYW, NUM. Why then parser decided to read one more symbol?

Parsers built by Alpag can be configured to read lookahead symbols always, or only when necessary. If not set explicitly Alpag will choose automatically which mode to use. For above, quite minimalistic grammar, Alpag decided to build parser that reads lookahead symbol always, whether necessary or not.

Add one more line to the head part of input file:

%option Parser.Reduction.Defaulting On

This option forces defaulting, that is performing default reductions without reading lookahead symbol.

Build parser once again, recompile test code and run the program.

This time output will be:

Token: 1 Value: k1
Token: 2 Value: 123
CMD:KEYW,NUM CMD<k1:123>
CMNDS:CMD [CMD<k1:123>]
Token: 1 Value: k2
Token: 1 Value: k3
CMD:KEYW CMD<k2>
CMNDS:CMNDS,CMD ([CMD<k1:123>], CMD<k2>)
Token: 2 Value: 456
CMD:KEYW,NUM CMD<k3:456>
CMNDS:CMNDS,CMD (([CMD<k1:123>], CMD<k2>), CMD<k3:456>)
File: !(([CMD<k1:123>], CMD<k2>), CMD<k3:456>)!

At the beginning parser reads two symbols, immediately decides that the only likely reduction at this point is CMD: KEYW, NUM and performs it right away. Final result, the string in last row, is the same as in earlier case, but the exact sequence of actions taken by parser is different.

Behavior of parsers generated by Alpag depends not only on input grammar but also on configuration of options controlling parser generation. Understanding these options is necessary to generate a parser that behaves exactly as expected.

Complete discussion of parser generation issues can be found in Parser section.

< Generating parser report | Input file format >

Alpag Manual