Capturing

Some regular expression engines expose functionality known as capturing. With capturing it is possible to define parts of regular expressions as capturing groups. Once regular expression is matched, parts of the match that correspond to captured groups can be accessed as well. This functionality enables matching entire expression and parsing its subfields at the same time.

Parts of regular expression to be captured should be given names using syntax:

(?<name>...)

Name of captured block is not local to single regular expression. It is defined globally for entire lexer.

It is acceptable to have multiple distinct grammar rules using the same name for capturing of their inner blocks. All these blocks are considered parts of the same single capture. Lexer does not make distinction between a named block in one rule and block with the same name in different rule. Since eventually only one rule is matched, the named capture matches block from that rule.

After particular rule has been matched each named capture defined in lexer can be in either of two states:

it is contained in rule that was matched, and has been matched (gave nonempty match). Its corresponding value can be retrieved
it is either not present in matched rule, or gave empty match. In this case value for the match cannot be retrieved.

Interface

Each named match is assigned integer identifier. This identifier is available inside lexer class as a constant and can be used to fetch matched text.

Methods for accessing captured values are:

bool CaptureOn( int captureId )

Returns true if given capture was found in recent match.

int CaptureGetLength( int captureId )

Returns length of captured text.

byte[] CaptureGetValueByte ( int captureId )
char[] CaptureGetValueChar ( int captureId )

Return copy of captured text. Separate methods depending on output format.

bool CaptureValueByteCopyTo ( int captureId, byte[] dstArr, int dstOffset )
bool CaptureValueCharCopyTo ( int captureId, char[] dstArr, int dstOffset )

Copy captured value to provided buffer at given offset. Buffer must have enough room to accommodate the text. Separate methods depending on output format.

string CaptureGetString ( int captureId )

Returns capture value as string. Method available only when output format is chars.

Limitations

Alpag is using DFA-based regular expression engine which generally does not allow capturing. Under certain conditions it is possible to provide limited support for this functionality. There are however many limitations to capturing in Alpag.

Boundaries of captured groups should be unambiguous. In practice this means, that characters used on boundary of regular expression for captured block should be different from characters in surrounding expression.
Only a single occurrence of a captured group can be reported. That's why captured blocks should not be placed inside any repeating constructs like star x* or plus x*. If named block is placed inside such construct, behavior of matching is in general unpredictable.
A single regular expression may contain multiple blocks with the same name, providing that at most one of these blocks will ever be matched.
A named block can appear inside optional construct (block which is matched one or zero times)
If several distinct rules contain blocks with the same name, then these rules should have similar regular expressions (e.g. having the same prefix part) and captured blocks should match similar (preferably the same) subparts of these rules. Using the same block name for unrelated syntactic elements should be avoided

Capturing functionality is not guaranteed to work in Alpag for any expression and any grammar. Alpag may issue a warning, informing that particular capture group cannot be successfully processed given current definition of grammar. This usually means that one of above rules was violated.

It is possible that capturing functionality that once worked fails after having made some minor changes to the grammar. User should be prepared for that, and be ready to provide own code replacing capture functionality.

Even if Alpag cannot handle current configuration of named capture blocks it can still generate a valid lexer. The only disrupted feature of such lexer would be capturing. When option Lexer.Partitions.NoErrorOnCollision is set to true Alpag issues a warning and builds a lexer anyway. When this option is false a fatal error is reported.

Configuration

Alpag handles capturing by partitioning automaton graph. Lexer detects crossing partition boundaries and records position of such events as starts and ends of captured text.

Alpag can use more than one method to mark partition boundaries. Used method can be switched using Lexer.Partitions.Mode option.

Available options (in the order of growing cost and complexity) are:

Auto –best option for current grammar is chosen
Borders – crossing boundaries of partitions is recorded
States – visiting all states of partitions is recorded
Transitions – entry/exit transitions to partitions are recorded

In most cases it is best to use Auto and let Alpag decide.

Alpag can sometimes use hybrid mode if Lexer.Partitions.HybridModeAllowed option is enabled.

An important performance limiting factor is number of different (unrelated) partitions which overlap somewhere in automaton.

Alpag limits number of overlapping partitions with following options.

Lexer.Partitions.MaxPartitionSetSize - maximum number of overlapping partitions

Lexer.Partitions.MaxPartitionSetCount - maximum number of different combinations of overlapping partitions.

Above limits can be raised if Alpag issues a warning message about crossing one of them.

Examples

Following expression captures a number with optional fractional part:

(?<INTEG>:[0-9]+)(\.(?<FRACT>:[0-9]+))?

The INTEG part is always nonempty. Presence of FRACT part should be checked using call:

if( CaptureOn( FRACT ) ) { ... }

Consider following grammar. It features two similar rules:

$(?<IDENT>:[A-Z]+)
#(?<IDENT>:[A-Z]+)

Above rules have different prefixes ($ and #). That's why fragments of lexer automaton covering these two rules are disjoint. Although both rules feature the same named part IDENT, this part maps to two non-overlapping sets of automaton states for either of the rules. That's why there is no penalty to using the same part in both rules.

Above grammar could be rewritten to use two distinct named parts:

$(?<IDENT_A>:[A-Z]+)
#(?<IDENT_B>:[A-Z]+)

There is some performance penalty to using multiple named parts if these overlap somewhere in automaton. In this case however automaton states corresponding to IDENT_A and IDENT_B are disjoint, so using dedicated named parts in each rule is just fine.

Below is a different grammar with two rules that differ only by suffix:

(?<IDENT>:[A-Z]+)$
(?<IDENT>:[A-Z]+)#

Above rules have the same prefix. Since initial part of both expressions is the same it is covered by the same automaton states. It is best to use the same named part (here IDENT) to catch the same subexpression in both rules.

< Rejecting | Debugging >

Alpag Manual