Some regular expression engines expose functionality known as capturing. With capturing it is possible to define parts of regular expressions as capturing groups. Once regular expression is matched, parts of the match that correspond to captured groups can be accessed as well. This functionality enables matching entire expression and parsing its subfields at the same time.
Parts of regular expression to be captured should be given names using syntax:
Name of captured block is not local to single regular expression. It is defined globally for entire lexer.
It is acceptable to have multiple distinct grammar rules using the same name for capturing of their inner blocks. All these blocks are considered parts of the same single capture. Lexer does not make distinction between a named block in one rule and block with the same name in different rule. Since eventually only one rule is matched, the named capture matches block from that rule.
After particular rule has been matched each named capture defined in lexer can be in either of two states:
Each named match is assigned integer identifier. This identifier is available inside lexer class as a constant and can be used to fetch matched text.
Methods for accessing captured values are:
Returns true if given capture was found in recent match.
Returns length of captured text.
Return copy of captured text. Separate methods depending on output format.
Copy captured value to provided buffer at given offset. Buffer must have enough room to accommodate the text. Separate methods depending on output format.
Returns capture value as string. Method available only when output format is chars.
Alpag is using DFA-based regular expression engine which generally does not allow capturing. Under certain conditions it is possible to provide limited support for this functionality. There are however many limitations to capturing in Alpag.
Capturing functionality is not guaranteed to work in Alpag for any expression and any grammar. Alpag may issue a warning, informing that particular capture group cannot be successfully processed given current definition of grammar. This usually means that one of above rules was violated.
It is possible that capturing functionality that once worked fails after having made some minor changes to the grammar. User should be prepared for that, and be ready to provide own code replacing capture functionality.
Even if Alpag cannot handle current configuration of named capture blocks it can still generate a valid lexer. The only disrupted feature of such lexer would be capturing. When option Lexer.Partitions.NoErrorOnCollision is set to true Alpag issues a warning and builds a lexer anyway. When this option is false a fatal error is reported.
Alpag handles capturing by partitioning automaton graph. Lexer detects crossing partition boundaries and records position of such events as starts and ends of captured text.
Alpag can use more than one method to mark partition boundaries. Used method can be switched using Lexer.Partitions.Mode option.
Available options (in the order of growing cost and complexity) are:
In most cases it is best to use Auto and let Alpag decide.
Alpag can sometimes use hybrid mode if Lexer.Partitions.HybridModeAllowed option is enabled.
An important performance limiting factor is number of different (unrelated) partitions which overlap somewhere in automaton.
Alpag limits number of overlapping partitions with following options.
Lexer.Partitions.MaxPartitionSetSize - maximum number of overlapping partitions
Lexer.Partitions.MaxPartitionSetCount - maximum number of different combinations of overlapping partitions.
Above limits can be raised if Alpag issues a warning message about crossing one of them.
Following expression captures a number with optional fractional part:
The INTEG part is always nonempty. Presence of FRACT part should be checked using call:
Consider following grammar. It features two similar rules:
Above rules have different prefixes ($ and #). That's why fragments of lexer automaton covering these two rules are disjoint. Although both rules feature the same named part IDENT, this part maps to two non-overlapping sets of automaton states for either of the rules. That's why there is no penalty to using the same part in both rules.
Above grammar could be rewritten to use two distinct named parts:
There is some performance penalty to using multiple named parts if these overlap somewhere in automaton. In this case however automaton states corresponding to IDENT_A and IDENT_B are disjoint, so using dedicated named parts in each rule is just fine.
Below is a different grammar with two rules that differ only by suffix:
Above rules have the same prefix. Since initial part of both expressions is the same it is covered by the same automaton states. It is best to use the same named part (here IDENT) to catch the same subexpression in both rules.