Lexer operates on characters and requires character-oriented input. Typical source of data for lexer are byte-oriented streams or files. Character data in such streams is stored using particular encodings, which differ in supported character range, and method of storing characters in underlying bytes. To use such byte-oriented data as input for lexer it must be first decoded.
Decoding can be done using standard library methods for handling input streams. In such case input is converted to stream of characters and can be fed directly to lexer.
Handling character-oriented data is easy with modern languages like C#, having built-it data types for storing characters. Underlying storage for such character types has form of two-byte words. Such storage format dates back to early days of Unicode standard, which used characters from two-byte (16 bit) range. Since extension of Unicode standard to 20 bits single character no longer fits into a two-byte word. If character storage used is two-byte oriented, characters that stand out of 16 bit range must be encoded using variable encoding. This means that reading characters from byte-oriented stream and saving them in two-byte words does not really decode characters. It merely recodes them to another standard. Analyzing characters contained in such two-byte words requires decoding them again which is inefficient.
Lexers generated by Alpag can be configured to read input in either of ways:
Internally lexer stores characters using four-byte double words. Decoding both byte-oriented and two-byte oriented data to this internal format introduces some overhead.
Lexers generated by Alpag, when reading byte-oriented input, can process following input encodings:
name | storage | character range | description |
ASCII | 7bit (1 byte) | 7 bit (0..127) | standard ASCII |
ASCIIRAW (ASCII using 8 bits) | 8bits (1 byte) | 7 bit (0..127) range 128..255 can be transferred transparently | ASCII extended to 0..255 range |
ASCIICP (ASCII with code page) | 8 bits (1 byte) | depends on code page | one-byte encoded national character set |
UTF-8 | 20 bits (1 to 4 bytes) | full Unicode | byte-oriented variable length encoding |
UCS-2 | 16 bits (2bytes) | Unicode BMP Plane (0..65535 range). surrogate codes can be transferred transparently | Traditional fixed two-byte Unicode encoding supporting Basic Multilingual Plane |
UTF-16 | 20 bits (2 or 4 bytes) | full Unicode | Modern variable two-byte Unicode encoding |
Alpag can be configured to generate lexers that:
Normally lexer is configured to support all encodings that can appear on the input, but sometimes a simplified approach can be taken. Choosing to support particular encoding means that all characters in this encoding will be properly converted to full scale (32bit) codes. If lexer has no specific rules for all characters, there may be no point in analyzing them.
UTF-8 is an example of a byte-oriented encoding, backward compatible with ASCII. Characters outside of ASCII appear as bytes in 128…255 range. It is possible to transfer such bytes transparently without interpreting them as Unicode characters by using ASCIIRAW format. Such approach requires some caution though. User must be sure that no lexer rule cuts a multibyte UTF-8 character 'in half'.
Similar optimization is possible for UCS-2 / UTF-16 pair.
Format and encoding of lexer output can be also configured. Usually lexer output is set to native string format used throughout the application and so can be made fixed (not switchable). However, if application is tunneling data from input to another output stream, ability to use custom output encoding may come in handy.
Options controlling input encoding are in Lexer.In group.
Options controlling output encoding are in Lexer.Out group.
One of key lexer configuration parameters is character range. Character range specifies maximum character code lexer can handle.
Lexer character range calculation is based on following assumptions:
User can explicitly specify lexer character range narrower than range of input encodings. It is not possible though to make it narrower than range of characters used in lexer rules.
When lexer range is narrower than range of input encodings, some characters read from input may stand outside of lexer range. Such characters are called out-of-range characters (OOR characters). Out of range characters, before passing them to lexer, are converted to a special OOR symbol. Lexer rules can contain special symbol \! (escaped exclamation mark) which matches this OOR symbol (and any out-of-range character).
Note that conversion to OOR symbol is done only for the purpose of processing inside lexer engine. Original characters read from input remain intact. When lexer reports a match, matched text contains these original characters. All characters can be transferred verbatim from input to output, even if lexer automaton does not explicitly recognize them.
Illustration of character ranges and their relationships is shown on Fig. 7. Presented example shows lexer character range which was manually configured to be narrower than sum of input ranges, resulting in nonempty OOR range.
Lexer ranges calculated by Alpag are, by default, rounded up to nearest well known range limit like 8bit, 16bit or 20bit. You can manually set lexer range to any explicit value.
When lexer rules contain named character ranges like [[:Letter:]] characters in these ranges add up to summary range of characters used in grammar. When named range extends to entire Unicode it may be impossible to narrow down lexer range. In such case named character range should be trimmed using set operators to required subrange like this:
Options controlling lexer input range can be found in Lexer.Range group.
Some options related to OOR characters are also present in Lexer.Regex group.
Certain regular expression elements refer to the concept of 'all characters'. These are:
Behavior of these elements depends on several options.
The . dot symbol by default does not include EOL symbols.
To match any character including EOL symbols an expression [.\n] must be used.
Negated class like [^a-z] matches any character not in a-z range, including EOL.
This default behavior can be modified using 's' (single line) option:
Inside block with 's' option turned on, any occurrence of . matches also EOLs.
By default both dot symbol . and negated class [^...] do not match Out of Range (OOR) characters. This behavior can be modified using 'x' (out-of-range) option:
Inside block with 'x' option on, any occurrence of . matches also OORs.
Moreover if option Lexer.Regex.RangeIncludesOOR is set to true dot symbol . and negated class [^...] match OOR characters by default.
OOR characters can be also explicitly matched using \! escape. Expression [.\!] matches also OORs regardless of other settings.
Range handling depends on EOL handling mode set for lexer. Additional explanation of its principles is provided below. Examples given here assume that:
Lexer will process EOL symbols when option Lexer.Eol.Mode is set to either Symbol or Chars.
When EOL mode is set to Symbol, newline sequences (CR LF and LF) are converted to special EOL symbol. In such case:
When EOL mode is set to Chars, codes CR and LF are passed to the lexer untouched. Occurrences of \n escape are replaced by regular expression matching (\n|\r\n).
Handling 'entire lexer range' is simplified in this mode: codes that appear anywhere in EOL sequences are simply removed from default lexer range without paying attention to exact format of EOL sequences. This results in following behavior: