Passing value

Each terminal and nonterminal in parser grammar can have specified type of accompanying value. Nonterminal values are calculated during reduction and usually contain some summary of information extracted from elements of reduced production. Terminal values are passed to parser by external source, usually the lexer.

Native format of data processed and matched by lexer is text. If another value type is to be returned, a conversion must occur. Interpretation of the value may depend on both actual terminal symbol being reported, as well as context in which it appears. Here are some possible scenarios:

Terminal symbol identifier alone provides all necessary information. Exact contents of matched text is not relevant and will not be further processed. This is usually the case for immutable tokens, like operators (e.g. plus sign '+'), or fixed keywords (e.g. 'for' instruction). In such case terminal symbol can be reported without value.
Matched text is relevant. Interpretation of this text (e.g. as number, postal code, mail address etc.) can be done in lexer, since rule that matched the text also guarantees conformance with target format. Rule reports the match using particular token identifier, so it can be told that terminal X is always of type Y.
This situation is most common. Conversion of matched text to target format can be done directly in lexer rule that matched it.
Matched text is relevant but its destination format and interpretation cannot be told immediately in lexer. This can occur in grammars, where interpretation of token requires considering its position in entire phrase, which can be done only in parser. Terminal values must be passed from lexer to parser as strings and interpreted there when possible.
Situation like this is rare, since nowadays' computer languages usually feature syntax, where tokens having different purpose differ also lexicographically.

Interface

ValueData record used to store alternative value types is declared based on %valtype declarations. Given declarations:

%pdefs
%valtype INTEGER %type { int }
%valtype STRING %type { string }
%valtype OTHER_TYPE %type { someType }
//...

Resulting ValueData record usually looks like this:

struct ValueData {
int INTEGER:
string STRING;
someType OTHER_TYPE;
//...
}

The record is declared in parser class.

Prototype of parser's NextToken() method for getting following tokens is:

class ParserClass {
int NextToken( ref ValueData valueData )
{
// user code goes here
}
}

The method takes ValueData argument. User implementation of the method is responsible for filling this argument with value.

Prototype of lexer's NextToken() method is:

class LexerClass {
int NextToken()
}

This method by default takes no arguments, and leaves no room for returning token value.

Filling the gap between parser's NextToken() and lexer's NextToken() method can be done in more than one way. Possible solutions include:

placing return value from lexer in additional fields
modifying prototype of lexer's NextToken() to return value data
building combined lexer-parser with single common implementation of NextToken()

Either approach can be combined with predefined %retcode declarations which replace manually typed code.

Examples given below explain all above alternatives.

Variant 1: custom fields

Values associated with reported token can be stored in variables defined inside lexer's body and later picked up by parser. This is the least sophisticated solution possible. It provides strong separation of lexer and parser. It can be used when both analyzers are built separately.

Lexer declarations:

%%code lexer_top { // code to place in lexer
// lexer fields for all value types
public int tokenValueInt;
public string tokenValueString;
// ...
}

%%lrules
[0-9]+ {
// save parsed value in local field
// be sure to return value matching type declared in parser
tokenValueInt = ConvertStrToInt(TokenValueGetString());
return MyParser.INT;
}

Parser declarations:

%valtype INTEGER %type { int }
%valtype STRING %type { string }
%token <INTEGER> INT

%code parser_top { // code to place in parser
MyLexer myLexer; // reference to lexer from parser
}
%code parser_next_token { // body for parser's NextToken()
int token = myLexer.NextToken();
// copy all values from lexer to valueData
valueData.INTEGER = myLexer.tokenValueInt;
valueData.STRING = myLexer.tokenValueString;
// cleanup just in case
myLexer.tokenValueInt = 0;
myLexer.tokenValueString = null;
return token;
}

Variant 2: passing value data

Instead of declaring local variables in lexer it is better to pass ValueData record directly to the lexer. This requires extending parameters of lexer's NextToken() procedure by setting option:

Lexer.Code.NextTokenFuncArgs

to ParserDefault or Custom

When ParserDefault setting is used, lexer's NextToken() arguments match exactly the signature of parser's NextToken(). This option should be used for combined lexers-parsers.

When Custom setting is used, arguments for NextToken() must be provided in:

Lexer.Code.NextTokenFuncArgsCustom.

This variant is better for integrating separate lexers and parsers.

Lexer declarations:

// use custom NextToken() arguments
%option Lexer.Code.NextTokenFuncArgs Custom
// pass parser's ValueData data to be filled
%option Lexer.Code.NextTokenFuncArgsCustom {
ref MyParser.ValueData valueDataToBeFilled;
}

%%lrules
[0-9]+ {
// save value in field of provided valueData
valueDataToBeFilled.INTEGER = ConvertStrToInt(TokenValueGetString());
return MyParser.INT;
}

Parser declarations:

%valtype INTEGER %type { int }
%valtype STRING %type { string }
%token <INTEGER> INT

%code parser_top { // code to place in parser
MyLexer myLexer; // reference to lexer from parser
}
%code parser_ext_token { // body for parser's NextToken()
// value data will be filled directly in lexer
int token = myLexer.NextToken( ref valueData );
return token;
}

It is assumed here, that lexer has access to definition of ValueData in parser (i.e. MyParser.ValueData).

Parser takes care of cleanup of filled ValueData so everything lexer has to do is provide value for token.

Variant 3: combined lexer-parser

When lexer and parser are combined in the same class, passing value data can be further simplified by eliminating intermediate NextToken() procedure.

// parser declarations
%valtype INTEGER %type { int }
%valtype STRING %type { string }
%token <INTEGER> INT

// lexer's NextToken() signature same as expected by parser
%option Lexer.Code.NextTokenFuncArgs ParserDefault

// parser does not generate its own NextToken() method.
// it invokes lexer's NextToken() directly
%option Parser.Code.NextTokenImpl UserOwn

%%lrules
[0-9]+ {
// save value in field of provided valueData
valueData.INTEGER = ConvertStrToInt(TokenValueGetString());
return MyParser.INT;
}

Variant 4: Using retcodes

Writing lexer rule code by hand can lead to errors. User must make sure that value type returned by lexer is the same as value type defined in parser grammar.

With Alpag it is possible to define a number of code templates for returning value in accordance with value type declared in parser for particular terminal. To use this feature lexer and parser must be defined in the same input file that is lexer must have access to terminal definitions provided in parser grammar (see end of this chapter for exceptions).

Here is and outline of the mechanism:

Value types are declared in parser grammar using %valtype and assigned to terminal symbols
For each value type declared in parser grammar a corresponding code template is declared on lexer side using %retcode keyword.
Lexer rules report tokens using %return statement instead of custom code block. The %return statement takes as argument a terminal defined in parser grammar. Value type of the terminal is taken from parser grammar as well. A %retcode template matching value type is used to return result from lexer.

Following example illustrates use of this mechanism:

// parser declarations
%valtype INTEGER %type { int }
%valtype STRING %type { string }
%token <INTEGER> INT
%token <STRING> SOME_STR

// lexer declarations with code for each value type.
// Symbol $$ stands for returned terminal identifier.
%retcode <INTEGER> { // default template for INTEGER
valueData.INTEGER = ConvertStrToInt(TokenValueGetString();
return $$;
}
%retcode <STRING> { // default template for STRING
valueData.INTEGER = TokenValueGetString();
return $$;
}
%retcode MySpecialCase <STRING> { // custom named template for STRING
// do something special
valueData.INTEGER = TokenValueGetString();
return $$;
}

// Attach lexer to parser directly
// (other integration variants can be used as well)
%option Lexer.Code.NextTokenFuncArgs ParserDefault
%option Parser.Code.NextTokenImpl UserOwn

%%lrules
[0-9]+ %return INT // uses default template for INT
[A-Z]+ %return SOME_STR // uses default template for STRING
[a-z]+ %return MySpecialCase SOME_STR // use custom template

If additional processing is necessary for some rules a custom named %retcode can be also defined (like MySpecialCase in example above).

Symbol $$ in %retcode templates is replaced with actual identifier of terminal. In above example statement %return INT is replaced by template for INTEGER and becomes:

valueData.INTEGER = ConvertStrToInt(TokenValueGetString();
return INT;

By default parser verifies all %return statements against parser grammar which must also be present in the file. It is also possible to use %return feature without parser grammar.

By setting Lexer.ReturnIdSource to Custom user states, that both value types and token identifiers are custom and not declared anywhere. When this setting applies lexer cannot deduce value type of a terminal. Value types must be explicitly specified in each return statement like this:

someRule %return <INTEGER> INT;
someRule %return MySpecialCase <INTEGER> INT;

Identifier of returned token is passed without any verification.

< Token identifiers | Error codes >

Alpag Manual