Parser operation with lookaheads
The important element of LR(1) parser operation scheme is reading one input symbol ahead, and using it to resolve ambiguities. This will be explained using following recursive grammar:
@1 COMMAND_ARRAY: COMMAND_ARRAY COMMAND;
@2 COMMAND_ARRAY: COMMAND;
@3 COMMAND: CMD;
@4 COMMAND: CMD Int;
@5 COMMAND: CMD Str;
Grammar reads a sequence of CMD commands which can be followed by optional Int or St.
Example input might look like this:
CMD CMD Int CMD CMD Str CMD Int CMD Str CMD CMD
Each COMMAND starts with CMD terminal. If it is standalone CMD (production @3) reduction should happen next. If however next input terminal is Int or Str parser should continue reading to match productions @4 or @5. Available alternatives are:
CMD ^ if next symbol is Int, shift
CMD ^ if next symbol is Str, shift
CMD ^ if next symbol is neither Int nor Str, reduce last CMD to COMMAND
Decision should depend on next input symbol.
To resolve cases like this LR(1) parser reads one symbol ahead just for preview. This lookahead symbol is used to resolve current situation but is not 'consumed' yet. Lookahead symbol is eventually removed from input and put to stack when parser performs a shift.
Example
Below is a fragment of parser automaton generated for above grammar:
state #1
@6[0] $accept: . COMMAND_ARRAY
actions
CMD shift to #4
COMMAND_ARRAY goto #2
COMMAND goto #3
state #2
@6[1] $accept: COMMAND_ARRAY .
@1[1] COMMAND_ARRAY: COMMAND_ARRAY . COMMAND
actions
CMD shift to #4
$end reduce by @6 to $accept
COMMAND goto #7
state #3
@2[1] COMMAND_ARRAY: COMMAND .
actions
$end reduce by @2 to COMMAND_ARRAY
CMD reduce by @2 to COMMAND_ARRAY
state #4
@3[1] COMMAND: CMD .
@4[1] COMMAND: CMD . Int
@5[1] COMMAND: CMD . Str
actions
Int shift to #5
Str shift to #6
$end reduce by @3 to COMMAND
CMD reduce by @3 to COMMAND
state #5
@4[2] COMMAND: CMD Int .
actions
$end reduce by @4 to COMMAND
CMD reduce by @4 to COMMAND
state #6
@5[2] COMMAND: CMD Str .
actions
$end reduce by @5 to COMMAND
CMD reduce by @5 to COMMAND
state #7
@1[2] COMMAND_ARRAY: COMMAND_ARRAY COMMAND .
actions
$end reduce by @1 to COMMAND_ARRAY
CMD reduce by @1 to COMMAND_ARRAY
Parser operation with lookahead symbols is illustrated below. Some irrelevant steps were skipped:
State #1
stack: #1
lookahead: <empty>
pending input: CMD CMD Int CMD
In state #1 parser reads first symbol (providing it is CMD) and shifts it to the stack:
State #4
stack: #1 CMD #4
lookahead: <empty>
pending input: CMD Int CMD
To decide between three available options parser will read next symbol and keep it as a lookahead:
Still in state #4
stack: #1 CMD #4
lookahead: CMD
pending input: Int CMD
When lookahead is known parser can decide between three actions available in state #4. Since lookahead is CMD parser will reduce by production @3 to COMMAND:
State #3
stack: #1 COMMAND #3
lookahead: CMD
pending input: Int CMD
then to:
State #2
stack: #1 COMMAND_ARRAY #2
lookahead: CMD
pending input: Int CMD
In current state action for next symbol (i.e. lookahead is to shift it to stack). Parser shifts current lookahead to stack:
State #4
stack: #1 COMMAND_ARRAY #2 CMD #4
lookahead: <empty>
pending input: Int CMD
Parser is again in state #4 where three actions are available. To choose between them next input symbol mast be fetched and used as lookahead:
Still state #4
stack: #1 COMMAND_ARRAY #2 CMD #4
lookahead: Int
pending input: CMD
This time parser will not perform reduction. Since next symbol is Int parser shifts it and goes to state #5.
State #5
stack: #1 COMMAND_ARRAY #2 CMD #4 Int #5
lookahead: <empty>
pending input: CMD
Next parser fetches CMD as lookahead:
Still state #5
stack: #1 COMMAND_ARRAY #2 CMD #4 Int #5
lookahead: CMD
pending input: <empty>
Then parser reduces by @4 to COMMAND:
State #7
stack: #1 COMMAND_ARRAY #2 COMMAND #7
lookahead: CMD
pending input: <empty>
And finally reduces to COMMAND_ARRAY, still under the same lookahead CMD:
State #2
stack: #1 COMMAND_ARRAY #2
lookahead: CMD
pending input: <empty>
Conclusions
Analyzing lookahead mechanism several observations can be made:
- Lookahead symbol is simply a first symbol of input stream that was prefetched in order to choose between available actions in some state. Next time parser needs to shift, it shifts lookahead symbol is one is prefetched.
- There are states when lookahead symbol is absolutely necessary to make next decision (like state #4 above). There are also states when inspecting lookahead symbol is not really necessary since only one action is possible: a reduction (state #5 above)
- A lookahead symbol can be preloaded by parser or can be empty (if it wasn't necessary to complete recent steps). In terms of current position of parser and current position of input stream this means that sometimes input stream is one token ahead of parser (when lookahead preloaded).
Author of grammar can usually ignore lookahead mechanism. There are however few situations when lookahead prefetch must be taken into account. If activity of parser must be in some way synchronized with activity of lexer, the potential shift by one symbol between them may be relevant. Examples of this situation include:
- code executed from within parser actions (during reductions) is used to control the lexer (e.g. switch modes)
- code attached to parser actions checks lexer state for some conditions like reaching end of file.
- some actions are taken for lexer based on results of parsing (e.g. new input file is open for lexer based using filename detected when parsing)
- lexer state and parser state are compared during joint parser-lexer debug session
Lookahead processing must be taken into account when designing error recovery. Parser reports an error when invalid (i.e. unexpected) symbol appears on input. This invalid symbol is preloaded as lookahead when error fires. Sometimes it may be necessary to explicitly discard erroneous lookahead symbol before attempting error recovery.