To handle errors, grammar must be extended with additional productions. Purpose of such productions is to 'match errors', recover from them and resume normal parsing for rest of input data.
Special token error can be used when defining grammar. This token behaves like a terminal symbol, and can be placed on the right hand side of any production side-by-side with ordinary terminals and nonterminals. Purpose of the symbol is to 'match errors'.
When error occurs, that is when there is no action defined for next input symbol in current state, parser performs search for production that has an action for this error.
Search for error action is performed as follows:
Providing that error action was found in one of states saved on stack, parser attempts error recovery.
During error recovery parser reads and skips following input symbols trying to resynchronize input with current position in grammar. The procedure is performed as follows:
Note that error token can 'consume' any number of input symbols, until recovery occurs. It is also possible, that recovery happens for initial token that caused the error. By default offending token remains as lookahead symbol, and becomes the first symbol tested for match during recovery.
Below is a simple example of error recovery.
Grammar is:
Above grammar defines a sequence of CMD commands. Each command includes several terminals and ends with semicolon.
In line with ordinary productions, a single error recovery production was defined. Note that this error recovery production is defined for nonterminal CMD. This means that when error occurs, is handled by this production and recovered, it will be reported as another matched CMD occurrence. This illustrates the general idea behind error recovery mechanism: add production that catches error and reports it up the tree as if the erroneous symbol sequence was valid part of grammar. This way entire grammar at the topmost level remains 'valid' and can be further parsed.
Fragment of automaton with states relevant to the scenario is:
Erroneous input sequence analyzed is:
Parser steps are:
Shift mow and go to state #6. So far everything is fine…
Shift the and go to state #11. Still fine…
In state #11 the only action is for lawn. There is no action for poop. Error occurs.
Search for error handling action begins:
Find in current state (#11) action for error. None is found.
Pop the stack discarding top-of-stack element
Find in current state action for error. None found.
Pop stack again
Find in current state action for error. Action found.
Shift error symbol, go to state #4 and begin error recovery.
Next input symbol is poop. Try to find action for it. None is found, since the only action in state #4 is for semicolon ";".
Discard symbol poop and try next input symbol.
Try to recover. Find action for symbol make. None found.
Discard symbol make and read next symbol.
Find action for symbol tea. None found. Discard symbol tea and read next symbol.
Try to recover again. Find action for symbol ";". This time there is an action for semicolon.
Stop error recovery and resume normal operation.
Normal action in current state looks as follows:
Action for next symbol ";" is shift. Perform shift and go to state #16
Default action in state #16 is to reduce by error handling production @3.
CMD nonterminal produced as a result of error recovery eventually becomes part of SEQUENCE defined at grammar root. User must handle the error condition, preferably by providing custom code for error recovery production.
Once error occurs parser starts popping the stack until state accepting error is found. To catch an error it is necessary to have such state up the stack. This means that grammar designer must think in terms of possible parser stack configurations.
Above example features a very simple technique: error-catching production is defined side-by-side with productions that are possible sources of errors. It takes advantage of one fact: productions for the same nonterminal always start in the same automaton states. Placing error as first element of one of these productions guarantees that errors in 'parallel' productions will be caught. In production grammars more sophisticated error-catching techniques may be necessary.
Even more important than catching is the strategy of recovery. Grammar designer must predict what kinds of errors can occur, and, even more importantly, what future terminal symbol can be used for re-syncing grammar phrase with input symbols.
In example studied above CMD statements always end with semicolon ";", so it is safe to assume that next semicolon in the stream is end of some phrase and start of another phrase. That's why error-catching production could be defined as 'error ";"' which reads as: catch error and skip following symbols until semicolon is found; consume that semicolon and assume we are back in main SEQUENCE of CMD. In real grammars that do not feature obvious phrase terminators, finding a good anchor for recovery is not always that easy.