alpag.net manual
Introduction / Parser / Grammar definition
< Grammars and algorithms | Parser algorithm >

Grammar definition

Context free grammar is defined by a finite set of productions. Each production specifies how higher-level grammar construct (a nonterminal) can be composed from lower-level terminals or nonterminals.

Input to the parser has form of stream of terminals (i.e. terminal symbols or tokens). This stream is usually provided by lexer. Groups of consecutive tokens are matched against productions of the grammar. When a match is found sequence of lower-level elements (terminals or nonterminals) is replaced by single higher-order nonterminal in the process known as reduction. This process continues until all input tokens are reduced to a single topmost-level nonterminal representing entire correctly parsed input.

Production is traditionally written with nonterminal being defined on the left and list of its component elements on the right:

DEFINED_NONTERMINAL : NONTERMINAL_1 Terminal_2 NONTERMINAL_3 ...;

There can be more than one production for the same nonterminal. Alternative productions for the same nonterminal can be written in separate lines or in one row, separated with '|' pipe symbol:

NT : NT1 Te2 NT3 | NT4 NT5 | Te3;

A grammar has exactly one top-level nonterminal which corresponds to entire input being parsed. Production for this topmost nonterminal is traditionally placed first.

Example CFG grammar is shown on Fig. 3.

The grammar must contain explicit declaration of all possible terminals that can appear on input. This declaration is preceded by keyword %token. Declaration of possible terminals (tokens) is a part of contract between lexer and parser: the lexer report tokens only from this predefined set recognized by parser.

Fig. 3 Example CFG grammar
// declaration of tokens (terminal symbols that can appear on parser input)
%token A Quick Brown Fox Jumps And Over Dog
// (1) main production
SENTENCE: A ADJECTIVES Fox MANYJUMPS Over A Dog;
// (2),(3),(4) three alternative productions for nonterminal ADJECTIVES
ADJECTIVES: Quick | Brown | Quick Brown;
// (5),(6) terminal production and left-recursive production for MANYJUMPS
MANYJUMPS: Jumps | MANYJUMPS And Jumps;

Nonterminals don’t have to explicitly declared. It is enough that these appear on left hand side of any production.

In the example above the nonterminal defined in first production is SENTENCE. Since it appears first, it is treated as main (topmost) nonterminal of entire grammar corresponding to entire input text to be matched.

The ADJECTIVES nonterminal has three alternative productions (2),(3),(4) separated by pipes. This means that whenever nonterminal ADJECTIVES appears in the grammar it can be replaced by right-hand-side of either of this three productions ( being Quick or Brown or Quick Brown ). These three alternatives could be written in separate lines as well.

Finally MANYJUMPS nonterminal has two productions (5) and (6). Second of these productions has MANYJUMPS also on its right side. This means that nonterminal MANYJUMPS is recursive. By recursively substituting occurrence of MANYJUMPS in production (6) with the same production, one can generate sequences of tokens from production (6) of any length as shown in Fig. 4.

Fig. 4 Application of recursive production
// Production (6)
MANYJUMPS: MANYJUMPS And Jumps;
// Production (6) with nonterminal recursively replaced by production (6)
MANYJUMPS: ( MANYJUMPS And Jumps ) And Jumps;
// Production (6) with nonterminal replaced by production (6) twice
MANYJUMPS: (( MANYJUMPS And Jumps ) And Jumps ) And Jumps;
// Innermost occurrence of nonterminal replaced by production (5). This stops recursion
MANYJUMPS: ((( Jumps ) And Jumps ) And Jumps ) And Jumps;
// Final form
MANYJUMPS: Jumps And Jumps And Jumps And Jumps;
< Grammars and algorithms | Parser algorithm >
Alpag Manual