Regular expression syntax

This chapter describes regular expression language used to declare patterns for matching input.

Basic operators

a - matches a single letter a
bc - concatenation: b followed by c
a|b - alternative: either a or b
a* - repetition: zero or more occurrences of a
(ab) - grouping: wrapping elements of regular expression in parenthesis makes them behave as one element

Above elements are elementary operators of regular expression language. All regular expressions can be written using them. Other operators available in regular language are added for convenience.

Examples

foo|bar // matches either foo or bar
any(way|how) // matches anyway or anyhow
(yes)* // matches: <empty sting>, yes, yesyes, yesyesyes etc.
1(23)*4 // matches: 14, 1234, 123234, 12323234 etc.

Default precedence of basic operators is as follows:

Repetition a* has higher precedence than concatenation ab.
That is ab*c is treated like a(b)*c (asterisk applies only to b ). To change this behavior use grouping parentheses like (ab)*.
Concatenation ab has higher precedence than alternative b|c.
That is ab|cd is treated like (ab)|(cd). To change this behavior use grouping parentheses like a(b|c)d.

Repetition operators

Operators for repeating an element multiple times are as follows:

a* - zero or more occurrences of a
a+ - one or more occurrences of a
a? - zero or one occurrence of a. An optional occurrence a.
a{n} - n-occurrences of a (example: (ab){3} is equal ababab );
a{n,} - n or more occurrences of a
a{,m} - from zero to m-occurrences of a (inclusive)
a{n,m} - from n to m-occurenctes of a (inclusive)

Examples

(CD)+ // the same as CD(CD)*
X{,1} // the same as X?
ab{2,4} // the same as abab(ab(ab)?)?

Single characters

Most characters can be placed in regular expression verbatim. However some characters, like ? or *, have special meaning in regular expression language and must be escaped. Unicode characters which are outside of currently used input encoding must be escaped as well.

Here is a summary of expressions matching a single character (D stands for hexadecimal digit ):

. dot matches any single character. By default it does not match newline. This behavior changes when dot is used in single line mode, that is within block with option 's' (?s:...)
c character which is not one of reserved characters matches itself
\c reserved character must be escaped with backslash (like \? to get literal?). It is legal to escape non-reserved characters.
\xDD single character with hexadecimal code in range 00 to FF (0 to 255 decimally)
\uDDDD single character with hexadecimal code in range 0000 to FFFF (0 to 65535 decimally)
\x{DD...} or \u{DD...} single character with hexadecimal code. Number of hexadecimal digits in curly braces can be from 1 to 8 which covers a 4 byte (or 32 bit) range.

Examples

a.c // matches aac, abc, acc, adc...
ok\? // matches ok?
2\*5 // matches 2*5
x\y\z // matches xyz (neither of this letter is special character)
A\x42C // matches ABC (42 hex = 66 dec is character code of B)
\u0410 //matches Cyryllic A

Reserved characters

Several characters are reserved and have special meaning in regular expression. To get their literal value these must be preceded by backslash '\' (like \? to get literal ?).

Input file related

Because of Alpag input file syntax following characters must be escaped:

when first character in line: % < /
anywhere within expression: a space character

Global scope

Characters reserved in global scope of regular expression are:

anywhere: . \ ( ) [ ] { } * + ?
first occurrence of: /
as first character: ^
as last character: $

Character class

In character classes (inside [ ] ) reserved characters are:

anywhere: [ ] . –
as first character after opening square brace: ^ :

Special escape sequences

Several characters, when escaped, have special meaning:

\! matches any OOR (Out of Range character)
\a alert/beep (0x07)
\b backspace (0x08)
\f form feed (0x0C)
\n newline LF (0x0A) (see below for details)
\N any newline of all defined newlines (see below for details)
\r carriage return CR (0x0D)
\t tabulation (0x09)
\v vertical tabulation (0x0B)
\0 zero character (0)

Newlines

Newline characters can be matched using \n or \N escape sequences. Meaning of these characters depends on current configuration.

Handling newline characters is controlled by Lexer.Eol.Mode option.

When EOL mode is set to Symbol, newline character sequences are converted to a built-in symbol EOL. User can define several character sequences to be recognized as newlines. All of them are converted to the same singe symbol EOL and become indistinguishable. If input sequence is, say, CR LF (codes 0xD, 0xA) then character codes CR and LF will not appear on lexer input and will be replaced by EOL. In this mode both \n and \N match the same EOL symbol.

When EOL mode is set to Chars newline sequences are not converted to EOL but passed directly to the lexer. Character sequences like CR LF appear on lexer input as separate characters.

To match end-of-line in either Symbol or Chars mode two escapes are available: \n or \N.

Interpretation of \n is controlled by Lexer.Regex.LineFeedMode setting. When it is set to EOL then \n matches EOL symbol (in Symbol mode) or all defined escape sequences (in Chars mode). When LineFeedMode is set to LF, it matches only explicit Line-Feed (i.e. character 0x0A). When EOL mode is set to None the \n always matches Line-Feed.

Alpag supports also a convenience escape \N which stands for all predefined EOL character sequences. In Symbol mode it matches the EOL symbol. In Chars mode it matches all defined escape sequences. That is \N matches all EOLs regardless of other settings.

Note that in Chars mode \n and \N can match multi-byte EOL sequences. If defined EOL sequences are, say, LF and CRLF, then \N escape will be treated as if it was (\x0A|\x0D\x0A). This approach has several implications. For instance behavior of newline escapes in negated character classes like [^\N] is different in Chars from behavior in Symbol mode. A character class basically stands for a single character. A negated class [^…]should match a single character that is not present on the list. It is not possible to provide such behavior for [^\N] (not-a-newline) when newline is multi-byte sequence. In such case Alpag will simply not match characters that constitute predefined EOL sequences, which is not the same as not-matching a two-byte CRLF sequence.

In general Alpag behavior in Chars mode is different from Symbol mode. In fact several mechanisms do not work, or work not-as-expected in Chars mode.

The only extra functionality of Chars mode not available in Symbol mode is the ability to provide specific handling of individual EOL sequences at the regular expression level (like having different rules for LF and CRLF). If having such functionality is not required, Symbol mode should be used.

For more details refer to description of options: Lexer.Regex.LineFeedMode, Lexer.Regex.AllEolsEscapeEnabled, Lexer.Eol.Mode, Lexer.Eol.Codes.

Character classes

It is often necessary to match a single character from a certain range. It can be easily achieved using

character class. Character class is declared using square brackets:

[characterList] - matches any character from the list
[^characterList] - negated class; matches any character which is not on the list.

Elements of character list can be:

a single characters
a-z character ranges. Includes all characters from a to z
\c escaped chars
\xDD, \uDDDD, \x{D..}, \u{D..} hexadecimal character codes
range subclasses, described below

Examples

[abc] // a or b or c
[A-Z0-9] // uppercase letters and digits
[\ \?] // space or question mark

Character range subclasses

Character class can contain nested subclasses given in square braces (nested within braces of character class itself).

There are two kinds of character subclasses:

ordinary character list defined by enumerating all characters on the list, like:

[[characterList]]

predefined character classes. These can be specified either by referencing a well-known named range, or range which is a value of certain property.

[[:name:]]
[[:key=value:]]

Both types of ranges can be negated using ^ character:

[[^characterList]] characters that are not on the list
[[:^name:]] characters that are not in the named range

Two predefined sets of character class names are available: POSIX and Unicode. Active set is controlled using Lexer.Regex.NamedCharRanges option.

POSIX named rages are listed on Fig. 8.

Fig. 8 POSIX named character class

name	code range
alnum	A-Za-z0-9
alpha	A-Za-z
ascii	\0x00-\0x7f
blank	\0x20\0x09
cntrl	\0x00-\0x1f\0x7f
digit	0-9
graph	\0x21-\0x7e
lower	a-z
punct	\x21-\x2f\x3a-\x40\x5b-\x60\x7b-\x7e
space	\0x09-\x0d\x20
upper	A-Z
xdigit	0-9A-Fa-f

Unicode standard provides information on character ranges that fulfill certain criteria, e.g. belong to particular language or are uppercase. Categories of character ranges are derived from properties of individual characters.

Unicode ranges often have two names: full and short. Both names can be used in Alpag. Ranges that are analogues to POSIX ranges can be also referenced by POSIX name.

Alpag can handle following types of Unicode named ranges:

General Category – describe general type of character, like letter, digit or punctuation.
Accessed using [[:category:]] syntax.
Property – related to characteristics or role of a character, like being diacritic or uppercase. Property ranges are similar in concept to general categories but cover different aspects.
Accessed using [[:property:]] syntax.
Block – continuous (from-to) ranges of codepoints usually related to particular language. Not all codepoints in range have to be assigned, but range fully covers given language.
Accessed using [[:block=blockName:]] syntax.
Script – list of assigned codepoints related to particular language.
Accessed using [[:script=scriptName:]] syntax:

Examples

// General Category
[[:L:]] or [[:Letter:]] // a letter (of any language), short and long name
[[:Lu:]] or [[:Uppercase_Letter:]] or [[:upper:]] // uppercase letter, by short, long and POSIX name
[[:Sm:]] or [[:Math_Symbol:]]
// Property
[[:Hex:]] or [[:Hex_Digit:]] // hexadecimal digit
[[:Cased:]] // characters that are uppercase, lowercase or titlecase
// Block
[[:Block=Arabic:]]
[[:Block=Hebrew:]]
// Script
[[:Script=Arabic:]]
[[:Script=Hebrew:]]

Unicode character ranges can be both simple from-to ranges as well as complex enumerated lists of codepoints. Using complex ranges results in poor compression of symbol maps generated for lexer, especially if it is targeted at full Unicode range.

Complex are usually those ranges, that describe certain character property across all alphabets. Examples are [[Diacritic:]] or [[:Uppercase:]]. Avoid using these ranges, unless you really have to tell between uppercase and lowercase in all languages including, for instance, Glagolitic (which is a dead script by the way). Consider building a subset of such ranges by performing logical-and with languages of interest like [[:Uppercase:]]&&[[:Script=Cyrillic:]].

Remember that single large from-to character range compresses very well. Problematic are ranges that enumerate many small disjoint subranges.

Exact definition of Unicode character ranges changes with each new version of the standard. For a complete set of names and their corresponding codepoints refer to the standard.

To check which version of Unicode is supported by your copy of Alpag, inspect greeting message displayed when Alpag is run.

Subrange operators

Set operators can be used on nested subclasses. Available set operators are:

[charList1]||[charList2]

double pipe operator is sum of two sets of characters

[charList1]--[charList2]

double minus operator is difference: charList1 minus elements of charList2

[charList1]&&[charList2]

double-and operator is common part: elements that are found in both charList1 and charList2

[charList1]~~[charList2]

double tilde is exclusive or: elements that are either on charList1 or charLinst2 but not in both.

Examples

[[A-Z]--[D-F]] // effectively covers characters in A-C and G-Z ranges
[[:Block=Cyrillic:]&&[:Uppercase:]] // uppercase Cyrillic characters

Blocks

Regular expression elements can be grouped in blocks. Main reason of using blocks is overriding operator precedence. Blocks can also be used to locally change certain aspects of regular expression processing. Named blocks can be used with Capturing feature.

(contents) // simple block for grouping items
(?<name>:contents) // named block
(?#comment) // comment block
(?options:contents) // block with options.

Options are single character switches. Available options are:

i – ignore case
n –when set, newline \n escape matches literal LF (0xA)
s – single line mode. IN this mode . dot operator matches also newline characters.
x – ignore comments and whitespace in patterns
o – out of range. In this mode the . dot operator matches also out-of-range input characters

All above options have their default global settings (either on or off). These can be changed (switched to on or off) locally in the block. Options enumerated in header of the block are enabled (can be optionally precede by '+'). Options preceded with '-' are disabled.

Examples

(abc) // ordinary group of tokens
(?<MYGRP>:ab|cd) // named group
(?i:abc) // option 'i' becomes enabled
(?i-o:abc) // option 'i' enabled, option 'o' disabled
(?-o+i:abc) // option 'i' enabled, option 'o' disabled
a(?i:bc(?-i:de)f)g // option 'i' enabled in outer block but disabled in inner block.
(?# any comment allowed here, but without closing brace )

Context matching

Lexer rule pattern can be specified with additional symbols limiting possible surrounding context for the match. These symbols are:

^regularExpression

regular expression starting with circumflex (power) symbol is matched only if appears at beginning of line

regularExpression$

regular expression ending with dollar symbol is matched only if end of matched text appears at end of line

Both symbols can be specified together.

End of line (EOL) matched by any of above symbols is not included in matched text.

Symbols ^ and $ can be used only if lexer is generated with EOL processing enabled.

Regular expression pattern can also be specified with additional right context which is not included in matched text. This right context is specified after a slash '/' and is a regular expression itself.

regularExpressionReturningMatch/regularExpressionRequiredToRight

Examples

^BEGIN // matches word 'BEGIN' only if it appears at the start of line
end$ // matches word 'end' only if it appears at end of line
^=+$ // matches entire line full of equal '=' signs
[a-zA-Z]+/\ *= //matches a word providing it is followed by optional space and equal sign
END/\.$ // matches 'END' providing next char is a dot '.' which is last character in line

< Lexer | Character ranges and encodings >

Alpag Manual