The patterns at the heart of every flex scanner use a rich regular expression language.
A regular expression is a pattern description using a metalanguage, a language that you
use to describe what you want the pattern to match. Flex’s regular expression language
is essentially POSIX-extended regular expressions (which is not surprising considering
their shared Unix heritage). The metalanguage uses standard text characters, some of
which represent themselves and others of which represent patterns. All characters other
than the ones listed below, including all letters and digits, match themselves.
The characters with special meaning in regular expressions are:
- . Matches any single character except the newline character ( \n ).
- [] A character class that matches any character within the brackets. If the first char- acter is a circumflex ( ^ ), it changes the meaning to match any character except the ones within the brackets. A dash inside the square brackets indicates a character range; for example, [0-9] means the same thing as [0123456789] and [a-z] means any lowercase letter. A - or ] as the first character after the [ is interpreted literally to let you include dashes and square brackets in character classes. POSIX intro- duced other special square bracket constructs that are useful when handling non- English alphabets, described later in this chapter. Other metacharacters do not have any special meaning within square brackets except that C escape sequences starting with \ are recognized. Character ranges are interpreted relative to the character coding in use, so the range [A-z] with ASCII character coding would match all uppercase and lowercase letters, as well as six punctuation characters whose codes fall between the code for Z and the code for a . In practice, useful ranges are ranges of digits, of uppercase letters, or of lowercase letters.
- [a-z]{-}[jv] A differenced character class, with the characters in the first class omitting the characters in the second class (only in recent versions of flex).
- ^ Matches the beginning of a line as the first character of a regular expression. Also used for negation within square brackets.
- $ Matches the end of a line as the last character of a regular expression.
- {} If the braces contain one or two numbers, indicate the minimum and maximum number of times the previous pattern can match. For example, A{1,3} matches one to three occurrences of the letter A, and 0{5} matches 00000. If the braces contain a name, they refer to a named pattern by that name.
- \ Used to escape metacharacters and as part of the usual C escape sequences; for example, \n is a newline character, while * is a literal asterisk.
- Matches zero or more copies of the preceding expression. For example, [ \t]* is a common pattern to match optional spaces and tabs, that is, whitespace, which matches “ ”, “
”, or an empty string.
- Matches zero or more copies of the preceding expression. For example, [ \t]* is a common pattern to match optional spaces and tabs, that is, whitespace, which matches “ ”, “
- Matches one or more occurrences of the preceding regular expression. For example, [0-9]+ matches strings of digits such as 1 , 111 , or 123456 but not an empty string.
- ? Matches zero or one occurrence of the preceding regular expression. For example, -?[0-9]+ matches a signed number including an optional leading minus sign.
- | The alternation operator; matches either the preceding regular expression or the following regular expression. For example, faith|hope|charity matches any of the three virtues.
- “…” Anything within the quotation marks is treated literally. Metacharacters other than C escape sequences lose their meaning. As a matter of style, it’s good practice to quote any punctuation characters intended to be matched literally.
- () Groups a series of regular expressions together into a new regular expression. For example, (01) matches the character sequence 01, and a(bc|de) matches abc or ade. Parentheses are useful when building up complex patterns with *, +, ?, and |.
- / Trailing context, which means to match the regular expression preceding the slash but only if followed by the regular expression after the slash. For example, 0/1 matches 0 in the string 01 but would not match anything in the string 0 or 02 . The material matched by the pattern following the slash is not “consumed” and remains to be turned into subsequent tokens. Only one slash is permitted per pattern. The repetition operators affect the smallest preceding expression, so abc+ matches ab followed by one or more c’s. Use parentheses freely to be sure your expressions match what you want, such as (abc)+ to match one or more repetitions of abc.