General concepts

This section explains the rationale for using automatic lexer and parser generators and the motivation behind two-step lexing/parsing process.

Role and purpose of parser generator

Parsing text is a typical computing task. Code performing this task, a parser, can be written manually or generated by a dedicated tool: a parser generator. Parser generator can generate code capable of parsing text conforming to grammar provided by user. This generated code can be included as a part of any application.

When language features simple grammar, parser for that language can be usually written by hand, often resulting in code which is faster and easier to understand than an automatically generated code would ever be. If language has complex grammar, writing parser completely manually is not easy. In such cases automatic parser generator is preferred.

Using automatic parser generator has several advantages over manual approach:

cost of generating the parser is significantly lower than writing it by hand
if the grammar changes, a new updated parser can be re-generated very quickly
writing formal grammar of the language required as input for parser generator results in better understanding of that language
if grammar for the language is already available (e.g. as part of some formal specification), it can be simply used as input to parser generator

There are also disadvantages:

writing a new grammar from scratch can be challenging and time consuming
complexity of grammar can be outside of class of grammars covered by parser generator. In such case parser generation will not be possible (at least with this parser generator)
cost of learning the tool can outweigh the benefits of automatic parser generation
automatically generated parser code can be hard to understand and debug

For big and complex grammars it is usually impractical to write parsers by hand, so using an automatic parser generator is inevitable. Decision remains which tool to choose.

Alpag deals with above disadvantages in a number of ways:

Format of Alpag input files is similar to format used by lex and yacc standard lexer and parser generations used in C language community. Grammars defined for these tools can be simply used as input for Alpag. Many books and tutorials use lex/yacc format in training examples, so learning Alpag should be easy compared to other tools featuring custom input format.
Alpag generates parsers for grammars in LR(1) class. This grammar class is powerful, covers majority of computing languages in use today, and is considered standard in many applications. Other more intuitive and easier to use grammars exists as well but these are also less powerful. Choosing LR(1) as class of grammars for Alpag decreases the risk of "hitting the wall", that is inability to cover particular language.
Complexity of LR grammars makes them hard to write and hard to debug. Alpag handles this problem by generating code with rich debugging interface. Additional debug methods and data structures enable run-time analysis and tracking parser operation.

Approach to parsing

When reading input stream one must recognize such language elements like words, or sentences. Some of these constructs, like elementary words, can be easily distinguished using just their textual properties without analyzing entire text from start to end. Such analysis can be thus performed locally. On the other hand higher-level language constructs, like entire sentences, usually require considering context (like surrounding words) to properly identify role of individual language element. Such analysis cannot be done locally but must consider entire phrase.

Difference in requirements for low-level and high-level parsing paves the way for using two analyzers: one for discovery of low-level lexical elements and another for performing high-level text-wide analysis. Alpag follows this two-level approach and can generate two analyzers:

Lexer, a low level analyzer, performs lexical analysis of input text (stream of bytes or characters) by identifying easily distinguishable elements, like words, solely based on their textual properties. These elements, once resolved, become atomic symbols or tokens, and are conveyed to high-level analyzer.
Parser, a high level analyzer, reads stream of tokens returned by lexer looking for high-level grammar constructs. Parser output, recognized grammar constructions, is returned to the user.

It is possible to build a monolithic parser, capable of analyzing input stream starting from individual characters up to the level of entire sentences. Two-stage approach is far more elastic though. Combined power of two analyzers can be greater than power of a single monolithic parser. Moreover intermediate code can be placed between lexer and parser giving even more elasticity.

Alpag can generate a lexer alone a parser alone or both. There is no obligation to use lexer and parser both generated by Alpag. User is free to combine either of these analyzers with code written by hand or generated using third-party tools.

This manual presents lexer generation and parser generation issues is separate chapters. Still it should be understood that parser and lexer must cooperate in the task of analyzing input. In practice lexer and parser grammars are written in parallel and corrected one against the other.

< Introduction | Lexer >

Alpag Manual