Error recovery and lookahead

Sometimes it is necessary to decide what happens with lookahead symbol when error is detected. This chapter describes options for controlling lookahead symbol during error recovery.

Problem outline

When error occurs, offending symbol is also a current lookahead. First step of error recovery is shifting error token to the stack. When this happens, erroneous lookahead symbol still remains in its place. This means that when search for nearest matching terminal symbol starts, the first candidate tested is the initial offending symbol that was original cause of error. Giving this symbol a try is not unjustified. With certain grammars, it can indeed recover parser from error. With other grammars using symbol that caused the error as first candidate for recovery may be a terrible idea.

Examples given below illustrate both scenarios.

Quick recovery example

Consider following grammar:

SEQUENCE: SEQUENCE CMD | CMD;
CMD:
error ";" |
do something ";" ;

and input:

do ";"

After reading do a semicolon ";" appears which is invalid in this context. Parser steps back to state before do, which is also beginning of error-recovery production, shifts error token and attempts recovery at position:

CMD: error ^ ";"

Since semicolon ";" is still a current lookahead symbol it is matched and recovery ends. The impact of error is limited to a single erroneous CMD.

If input sequence is:

do something ";" do ";" do something ";" ";" do something ";"

Both erroneous entries, that is single do, and empty entry between semicolons will be recovered in place.

Recovery failure example

Using offending terminal as first candidate for error recovery may sometimes have catastrophic consequences. Consider grammar:

SEQUENCE: SEQUENCE CMD | CMD;
CMD:
error |
do something;

Note that in above grammar error terminal is last in production.

Consider input:

something

Symbol something is not expected at the beginning of input. An error is generated right away. Since error symbol is available in the same state parser shifts it and moves to end of error-recovery production:

CMD: error ^

Further parser steps depend on use of defaulting in parser tables:

If state at the end of production is not defaulted, then action table for this state allows reduction only if next symbol is do. Current input symbol something will not recover error and will be skipped. Parser will step forward.
If state at the end of production is defaulted, then action table for this state assumes reduction regardless of next input symbol. Reduction will be executed even though current symbol is something. After reduction parser will return to initial state. Since symbol something was not consumed it will generate another error. Parser will enter endless loop.

Using error token at the end of production is not illegal, but it is not very safe either. Action that recovers parser in such case is reduction, not shift. Conditions under which reduction occurs are related to surrounding context which is very hard to control (it is a context-free grammar after all). Nonterminal featuring error-catching production can be used in many places throughout the grammar. All these places attribute to total list of possible terminals that can appear next after reduced production. Any of them will recover parser from error.

Note that Alpag allows productions with explicit lookahead contexts. This mechanism can be used in combination with error recovery to narrow down recovery context:

NONTERMINAL: error / T1 T2; // recover from error providing next terminal is…

Configuration

Alpag can be configured to automatically consume input symbol which caused error. This guarantees that after error recovery is over parser has moved forward by at least a single symbol. Setting this option may however slow down certain recovery scenarios.

Option Parser.Errors.ErrorSymbolLookaheadAction controls consumption of lookahead symbol on error. Possible settings are:

Leave – erroneous input symbol is left to be reanalyzed
Eat – erroneous input symbol is discarded before recovery is attempted
Mixed – consumption of input symbol can be controlled locally in grammar (by default is not consumed)

When Mixed option is chosen, user can declare error tokens which explicitly leave or consume lookahead token, like this:

%parserror MY_ERROR1 %eat_lah
%parserror MY_ERROR2 %leave_lah

It is possible to clear lookahead symbol programmatically invoking LookaheadSymbolClear() from within code declared in %on_begin_recovery block for custom error.

< Custom error tokens | Programing interface >

Alpag Manual