Grammars and algorithms

This chapter presents information on grammar types supported by Alpag. Readers not interested in theoretical background may skip to next chapter which illustrates process of defining a grammar using some practical example.

Context Free Grammar

Parser must be able to cover entire complexity of analyzed language. Thanks to lexer tokenizing the input stream, the parser does not have to handle individual characters but rather tokens corresponding to whole words. Still, parser must recognize and process all valid combinations of tokens corresponding to syntactic constructions of the language.

Languages are specified using grammars. There are many types of grammars varying in complexity and expressive power. A single given language can be often described using more than one type of grammar, so multiple grammars may exist for the same language.

Defining a grammar (of any type) which well describes a complex language is a nontrivial task. It is very hard to translate grammar of one type into equivalent grammar of another type. That’s why, if a grammar for the language was found, one should learn to handle it, rather than trying to find more simplistic or convenient grammar.

Context Free Grammars (CFG) is a very general class of grammars often used to define computer languages. A context free grammar is described by a finite set of definitions known as rules or productions. Productions specify decomposition of high-level grammar constructs (like sentences) into low-level elements (like subsentences or words).

Several algorithms for processing context free grammars exist. Some algorithms are capable of parsing any CFG, but these can be prohibitively costly for large input files. Algorithms with linear cost exist as well, but are limited to particular subsets of CFG grammars.

There are two major types of linear cost CFG algorithms: LL and LR. These algorithms also define subsets of entire CFG space they can cover. Both work by processing input text linearly from left to right. LL algorithms make decision based on just initial part (prefix) of grammar production. LR algorithms read production to the very end before making final decision. LR algorithms are considered more general than LL because can resolve cases that are nod decidable based on observing just the initial part of a production like in the case of LL algorithm.

Writing grammars for LL parsers is considered easier. Parsers working in LL class have simple and intuitive construction which is easy to understand and debug. Stronger LR grammars, lead to more complex parsers with obscured and non-intuitive principles of operation.

Algorithm used by Alpag

Alpag is using LR parsing algorithm. Linear cost of LR method guarantees stability of generated parser in terms of time and space requirements. At the same time LR is the most powerful method among linear-cost CFG methods and is sufficient for most languages.

LR grammars have subclasses which differ in number of tokens fetched by parser before making the decision. LR(k) grammars after reading tokens that correspond to entire production being recognized perform also a lookahead: look past the end of production by k-symbols. Using lookahead enables resolving certain ambiguous cases.

Alpag can generate LR(1) class parsers. Single lookahead token is usually sufficient to resolve most common types of ambiguities found in computer languages.

Definition of LR parser is stored in tables. Parsers with lookahead, like LR(1), although more powerful, require bigger tables than parsers with no lookahead, like LR(0). For less demanding languages it is often sufficient to use weaker grammar that does not use lookahead.

Parser in LR(1) class can be calculated in more than one way. More general methods give more detailed lookahead contexts and can handle more complicated languages at the cost of bigger parser tables. Simplified methods give less detailed lookahead contexts and can cover fewer languages but result in smaller parser tables. This means that even within LR(1) class there are subclasses of grammars and languages covered depending on parser table calculation algorithm used.

The most general is canonical LR(1) giving full LR(1) class parser. A step down from canonical LR(1) is LALR, which uses simplified calculation of lookahead symbols and gives significantly smaller tables at the cost of some loss of power. Even simpler and less powerful is SLR (Simple LR) method. The LR(0) class (LR parser with no lookahead) is weaker than all variants of LR(1).

All kinds of LR(1) method (canonical, LALR and SLR) differ only in precision of lookahead symbols calculation. The principle of parser operation for all of them is the same. Choosing one of these methods is simply a matter of tradeoff between parser table size and range of covered languages.

Alpag can generate parser LR(1) tables using multiple methods. Usually the method that fully covers the language and produces smallest tables is preferred.

Alpag has one more mode of LR(1) table calculation called Optimized LR(1). This hybrid method gives parsers with same power as canonical LR(1) but usually smaller parser tables. When input grammar is of LALR class Optimized LR should give tables of equal size to LALR method. Optimized LR(1) can be thus considered a kind of self-adjusting option lying between canonical LR(1) and LALR giving best smallest size of parser tables for given grammar.

Alpag can be switched between canonical LR(1), Optimized LR(1) or LALR, with Optimized LR(1) being the default setting.

< Parser | Grammar definition >

Alpag Manual