flex - Generates a C Language lexical analyzer
flex [-bcdfinpstvFILT8] -C[efmF] [-Sskeleton] [file...]
Generates backtracking information to lex.backtrack. This
is a list of scanner states that require backtracking and
the input characters on which they do so. By adding rules
you can remove backtracking states. If all backtracking
states are eliminated and -f or -F is used, the generated
scanner will run faster. Makes the generated scanner run
in debug mode. Whenever a pattern is recognized and the
global yy_lex_debug is nonzero (which is the default), the
scanner writes to stderr a line of the form:
--accepting rule at line 53 ("the matched text")
The line number refers to the location of the rule
in the file defining the scanner (the input to
lex). Messages are also generated when the scanner
backtracks, accepts the default rule, reaches the
end of its input buffer (or encounters a NULL), or
reaches an End-of-File. Specifies full table (no
table compression is done). The result is large but
fast. This option is equivalent to -Cf. Instructs
flex to generate a case-insensitive scanner. The
case of letters given in the flex input patterns
will be ignored, and tokens in the input will be
matched regardless of case. The matched text given
in yytext will have the original case (as read by
the scanner). Generates a performance report to
stderr. This identifies features of the flex input
file that will cause a loss of performance in the
resulting scanner. Causes the default rule (that
unmatched scanner input is echoed to stdout) to be
suppressed. If the scanner encounters input that
does not match any of its rules, it aborts with an
error. Instructs flex to write the scanner it generates
to standard output instead of lex.yy.c.
Specifies that flex should write to stderr a summary
of statistics regarding the scanner it generates.
Specifies that the fast scanner table representation
should be used. This representation is
about as fast as the full table representation
(-f), and for some sets of patterns will be considerably
smaller (and for others, larger). This
option is equivalent to -CF. Instructs flex to
generate an interactive scanner; that is, a scanner
that stops immediately rather than looking ahead if
it knows that the currently scanned text cannot be
part of a longer rule's match. Note, -I cannot be
used in conjunction with full or fast tables; that
is, the -f, -F, -Cf, or -CF options. Instructs
flex not to generate #line directives in lex.yy.c.
The default is to generate such directives so error
messages in the actions will be correctly located
with respect to the original lex input file. Makes
flex run in trace mode. It will generate a lot of
messages to stdout concerning the form of the input
and the resultant nondeterministic and deterministic
finite automata. This option is mostly for use
in maintaining flex. Instructs flex to generate an
8-bit scanner (which is the default). Controls the
degree of table compression. The default setting is
-Cem which provides the highest degree of table
compression. Faster-executing scanners can be
traded off at the cost of larger tables with the
following generally being true:
Slowest and smallest
-Cem -Cm -Ce -C -C{f,F}e -C{f,F}
Fastest and largest
The -C options are not cumulative; whenever the
option is encountered, the previous -C settings are
forgotten. The -f or -F and -Cm options do not
make sense together; there is no opportunity for
meta-equivalence classes if the table is not being
compressed. Otherwise, the options may be freely
mixed. A lone -C specifies that the scanner tables
should be compressed and neither equivalence
classes nor meta-equivalence classes should be
used. Directs flex to construct equivalence
classes; for example, sets of characters that have
identical lexical properties. Equivalence classes
usually give dramatic reductions in the final
table/object file sizes (typically a factor of 2 to
5) and are inexpensive performance-wise (one array
look-up per character scanned). Directs flex to
construct meta-equivalence classes, which are sets
of equivalence classes (or characters, if equivalence
classes are not being used) that are commonly
used together. Meta-equivalence classes are often
a big win when using compressed tables, but they
have a moderate performance impact (one or two "if"
tests and one array look-up per character scanned).
Specifies that the full scanner tables should be
generated; flex should not compress the tables by
taking advantage of similar transition functions
for different states. Specifies that the alternative
fast scanner representation should be used.
Overrides the default skeleton file from which flex
constructs its scanners. This is useful for flex
maintenance or development. Specifies table-compression
options. (Obsolescent) Suppresses the
statistics summaries that the -v option typically
generates. (Obsolete)
The flex command is a tool for generating scanners: programs
which recognize lexical patterns in text. The flex
command reads the given input files, or its standard input
if no filenames are given or if a file operand is - (dash)
for a description of a scanner to generate. The description
is in the form of pairs of regular expressions and C
code, called rules. The flex command generates as output
a C source file, lex.yy.c, which defines a routine
yylex(). This file is compiled and linked with the -ll
library to produce an executable. When the executable is
run, it scans its input and the regular expressions in its
rules looking for the best match (longest input). When it
has selected a rule it executes the associated C code
which has access to the matched input sequence (commonly
referred to as a token). This process then repeats until
input is exhausted.
The flex command treats multiple input files as one.
Syntax for Input [Toc] [Back]
This section contains a description of the flex input
file, which is normally named with a suffix. The section
provides a listing of the special values, macros, and
functions recognized by flex.
The flex input file consists of three sections, separated
by a line with just %% in it:
[ definitions ] %% [ rules ] [ %% [ user functions ]]
Contains declarations to simplify the scanner specification,
and declarations of start states which are explained
below. Describes what the scanner is to do. Contains
user-supplied functions that copied straight through to
lex.yy.c.
With the exception of the first %% sequence all
sections are optional. The minimal scanner %%,
copies its input to standard output.
Each line in the definitions section can be: Defines name
to expand to regexp. name is a word beginning with a letter
or an underscore (_) followed by zero or more letters,
digits, underscores or dashes (-). In the regular-expression
parts of the rules section, flex substitutes regexp
wherever you refer to {name} (name within braces).
Defines names for states used in the rules section. A rule
may be made conditionally active based on the current
scanner state. Multiple lines defining states can appear,
and each can contain multiple state names, separated by
white space. The name of a state follows the same syntax
as that of regexp names except that dashes ('-') are not
permitted. Unlike regexp names, state names share the C
#define namespace. In the rules section states are recognized
as <state> (state within angle brackets).
The %x directive names exclusive states. When a
scanner is in an exclusive state, only rules prefixed
with that state are active. Inclusive states
are named with the %s directive. When placed on
lines by themselves, these symbols enclose C code
to be passed verbatim into the global definitions
of the output file. Such lines commonly include
preprocessor directives and declarations of external
variables and functions. Lines beginning with
a space or tab in the definitions section are
passed directly into the lex.yy.c output file, as
part of the initial global definitions.
The rules section follows the definitions, separated by a
line consisting of %%. The rules section contains rules
for matching input and taking actions, in the following
format: pattern [action]
The pattern starts in the first column of the line and
extends until the first non-escaped white space character.
The flex command attempts to find the pattern that matches
the longest input sequence and execute the associated
action. If two or more patterns match the same input the
one which appears first in the rules section is chosen. If
no action exists the matched input is discarded. If no
pattern matches the input the default is to copy it to
standard output.
All action code is placed in the yylex() function. Text (C
code or declarations) placed at the beginning of the rules
section is copied to the beginning of the yylex() function
and may be used in actions. This text must begin with a
space or a tab (to distinguish it from rules). In addition,
any input (beginning with a space or within %{ and
%} delimiter lines) appearing at the beginning of the
rules section before any rules are specified will be written
to lex.yy.c after the declarations of variables for
the yylex() function and before the first line of code in
yylex().
Elements of each rule are: A pattern may begin with a
comma separated list of state names enclosed by angle
brackets (< state [,state...] >). These states are
entered via the BEGIN statement. If a pattern begins with
a state, the scanner can only recognize it when in that
state. The initial state is 0 (zero). A regular expression
to match against the input stream. The regular
expressions in flex provide a rich character matching syntax.
The following characters, shown in order of
decreasing precedence have special meanings:
Matches the character x. Enclose characters and
treat them as literal strings. For example, "*+"
is treated as the asterisk character followed by
the plus character. If str is one of the characters
a, b, f, n, r, t, or v, then the ANSI C interpretation
is adopted (for example, \n is a newline).
If str is a string of octal digits it is
interpreted as a character with octal value str. If
str is a string of hexadecimal digits with a leading
x it is interpreted as a character with that
value. Otherwise, it is interpreted literally with
no special meaning. For example, x\*yz represents
the four characters x*yz. Represents a character
class in the enclosed range ([.-.]) or the
enclosed list ([...]). The dash character is used
to define a range of characters from the ASCII
value or the 8-bit class of the character that
comes before it to the ASCII value or the 8-bit
class of the character that follows it. For example,
[abcx-z] matches a, b, c, x, y, or z.
The circumflex when it appears as the first character
in a character class, indicates the complement
of the set of characters within that class. For
example, [^abc] matches any character except a, b
or c, including special characters like newline.
Groups regular expressions. For example, (ab) will
be considered as a single regular expression. When
enclosing numbers, indicates a number of consecutive
occurrences of the expression that comes
before it. For example, (ab){1,5} indicates a
match for from 1 to 5 occurrences of the string ab.
When enclosing a name, the name represents a regular
expression defined in the definitions section.
For example, {digit} is replaced by the defined
regular expression for digit. Note that the expansion
takes place as if the definition were enclosed
in parentheses. Matches any single character
except newline. Matches zero or one of the preceding
expressions. For example, ab?c matches both ac
and abc. Matches zero or more of the preceding
expressions. For example, a* is zero or more consecutive
a characters. The utility of matching
zero occurrences is more obvious in complicated
expressions. For example, the expression, [A-Zaz][A-Za-z0-9]*
indicates all alphanumeric strings
with a leading alphabetic character, including
strings that are only one alphabetic character.
Matches one or more of the preceding expressions.
For example, [a-z]+ is all strings of lowercase
letters. Matches the expression x followed by the
expression y. Matches either the preceding expression
or the following expression. For example,
a(br matches either ab or cd. Matches expression x
only if expression y (trailing context) immediately
follows it. For example, ab/cd matches the string
ab but only if followed by cd. Only one trailing
context is permitted per pattern. When it appears
at the beginning of the pattern matches the beginning
of a line. For example, ^abc will match the
string abc if it is found at the beginning of a
line. When it appears at the end of a pattern
matches the end of a line. It is equivalent to /\n.
For example, abc$ will match the string abc if it
is found at the end of a line. Matches an End-ofFile.
Identifies a state name (see above) and may
only appear at the beginning of a pattern. For
example, <done><<EOF>> matches an End-of-File, but
only if it is in state done.
In addition, the following rules apply for bracket
expressions: These represent the set of collating
elements in an equivalence class and are enclosed
within bracket-equal delimiters ([= =]). An equivalence
class generally is designed to deal with primary-secondary
sorting; that is, for languages like
French that define groups of characters as sorting
to the same primary location, and then have a tiebreaking,
secondary sort. For example, if a, `, and
^ belong to the same equivalence class, then
[[=a=]b], [[=`=]b], and [[=^=]b] are each equivalent
to [a`^b]. These represent the set of characters
in the current locale belonging to the named
ctype class. These are expressed as a ctype class
name enclosed in bracket-colon delimiters ([: :]).
In the C or POSIX locale, this operating system
supports the following character class expressions:
[:alpha:], [:upper:], [:lower:], [:digit:],
[:alnum:], [:xdigit:], [:space:], [:print:],
[:punct:], [:graph:], [:cntrl:].
Other locales may define additional character
classes.
Letters and digits never have special meanings. A
character such as ^ or -, which has a special meaning
in particular contexts, refers simply to itself
when found outside that context. Spaces and tabs
must be escaped to appear in a regular expression;
otherwise they indicate the end of the expression.
Each pattern in a rule has a corresponding action,
which can be any arbitrary C statement. The pattern
ends at the first non-escaped white space character;
the remainder of the line is its action. If
the action is empty, then when the pattern is
matched the input which matched it is discarded.
If the action contains a {, then the action spans
till the balancing } is found, and the action may
cross multiple lines. Using a return statement in
an action returns from yylex().
An action consisting solely of a vertical bar (|)
means same as the action for the next rule.
The flex variables which can be used within actions
are: A string (char *) containing the current
matched input. It cannot be modified. The length
(int) of the current matched input. It cannot be
modified. A stream (FILE *) that flex reads from
(stdin by default). It may be changed but because
of the buffering flex uses this makes sense only
before scanning begins. Once scanning terminates
because an End-of-File was seen, void yyrestart
(FILE *new_file) may be called to point yyin at a
new input file. Alternatively, yyin may be changed
whenever a new or different buffer is selected (see
yy_switch_to_buffer()). A stream (FILE *) to which
ECHO output is written (stdout by default). It can
be changed by the user. Returns the current buffer
(YY_BUFFER_STATE) used for scanner input.
The flex command macros and functions that may be
used within actions are: Copies yytext to the scanner's
output. Changes the scanner state to be
state. This affects which rules are active. The
state must be defined in a %s, or %x definition.
The initial state of the scanner is INITIAL or 0
(zero). Directs the scanner to proceed immediately
to the next best pattern that matches the input
(which may be a prefix of the current match).
yytext and yyleng are reset appropriately. Note
that REJECT is a particularly expensive feature in
terms of scanner performance; if it is used in any
of the scanner's actions, it will slow down all of
the scanner's pattern matching operations. REJECT
cannot be used if flex is invoked with either -f or
-F options. Indicates that the next matched text
should be appended to the currently matched text in
yytext (rather than replace it). Returns all but
the first n characters of the current token back to
the input stream, where they will be rescanned when
the scanner looks for the next match. yytext and
yyleng are adjusted accordingly. Returns 0 (zero)
if there is more input to scan or 1 if there is
not. The default yywrap() always returns 1. Currently
it is implemented as a macro, however in
future implementations it may become a function.
Can be used in lieu of a return statement in an
action. It terminates the scanner and returns a 0
(zero) to the scanner's caller.
yyterminate() is automatically called when an Endof-File
is encountered. It is a macro and may be
redefined. Returns a YY_BUFFER_STATE handle to a
new input buffer large enough to accommodate size
characters and associated with the given file. When
in doubt, use YY_BUF_SIZE for the size. Switches
the scanner's processing to scan for tokens from
the given buffer, which must be a YY_BUFFER_STATE.
Deletes the given buffer. Enables scanning to continue
after yyin has been pointed at a new file to
process. Controls how the scanning function,
yylex() is declared. By default, it is int yylex(),
or, if prototypes are being used, int yylex(void).
This definition may be changed by redefining the
YY_DECL macro. This macro is expanded immediately
before the {...} (braces) that delimit the scanner
function body. Controls scanner input. By default,
YY_INPUT reads from the file-pointer yyin. Its
action is to place up to max_size characters in the
character array buf and return in the integer variable
result either the number of characters read or
the constant YY_NULL to indicate EOF. Following is
a sample redefinition of YY_INPUT, in the definitions
section of the input file:
%{ #undef YY_INPUT #define
YY_INPUT(buf,result,max_size)\
{\
int c = getchar();\
result = (c == EOF) ? YY_NULL : (buf[0] = c,
1);\
} %}
When the scanner receives an End-of-File indication
from YY_INPUT, it checks the yywrap() function. If
yywrap() returns zero, it is assumed that the yyin
has been set up to point to another input file, and
scanning continues. If it returns non-zero, then
the scanner terminates, returning zero to its
caller. Redefinable to provide an action which is
always executed prior to the matched pattern's
action. Redefinable to provide an action which is
always executed before the first scan. Is used in
the scanner to separate different actions. By
default, it is simply a break, but may be redefined
if necessary.
The user functions section consists of complete C functions,
which are passed directly into the lex.y.cc output
file (the effect is similar to defining the functions in
separate files and linking them with lex.y.cc). This section
is separated from the rules section by the %% delimiter.
Comments, in C syntax, can appear anywhere in the user
functions or definitions sections. In the rules section,
comments can be embedded within actions. Empty lines or
lines consisting of white space are ignored.
The following macros are not normally called explicitly
within an action, but are used internally by flex to handle
the input and output streams. Reads the next character
from the input stream. You cannot redefine input().
Writes the next character to the output stream. Puts the
character c back onto the input stream. It will be the
next character scanned. You cannot redefine unput().
The libl.a contains default functions to support
testing or quick use of a flex program without
yacc; these functions can be linked in through -ll.
They can also be provided by the user. A simple
wrapper that simply calls setlocale() and then
calls the yylex() function. The function called
when the scanner reaches the end of an input
stream. The default definition simply returns 1,
which causes the scanner in turn to return 0
(zero).
Some trailing context patterns cannot be properly matched
and generate warning messages
Dangerous trailing context
These are patterns where the ending of the first
part of the rule matches the beginning of the second
part, such as zx*/xy*, where the x* matches the
x at the beginning of the trailing context. For
some trailing context rules, parts that are actually
fixed length are not recognized as such, leading
to the previously mentioned performance loss.
In particular, patterns using {n} (such as test{3})
are always considered variable length.
Combining trailing context with the special | (vertical
bar) action can result in fixed trailing context
being turned into the more expensive variable
trailing context. This happens in the following
example:
%% abc| xyz/def Use of unput() invalidates the contents
of yytext and yyleng within the current flex
action. Use of unput() to push back more text than
was matched can result in the pushed-back text
matching a beginning-of-line (^) rule even though
it did not come at the beginning of the line. Pattern
matching of NULLs is substantially slower than
matching other characters. The flex command does
not generate correct #line directives for code
internal to the scanner; thus, bugs in flex.skel
yield invalid line numbers. Due to both buffering
of input and read-ahead, you cannot intermix calls
to <stdio.h> routines, such as, for example,
getchar(), with flex rules and expect it to work.
Call input() instead. The total table entries
listed by the -v option excludes the number of
table entries needed to determine what rule was
matched. The number of entries is equal to the
number of deterministic finite-state automaton
(DFA) states if the scanner does not use REJECT,
and somewhat greater than the number of states if
it does. REJECT cannot be used with the -f or -F
options.
The following command processes the file lexcommands to
produce the scanner file lex.yy.c: flex lexcommands
This is then compiled and linked by the command: cc
-oscanner lex.yy.c -ll
This produces a program scanner. The scanner program
converts uppercase to lowercase letters,
removes spaces at the end of a line, and replaces
multiple spaces with single spaces. The lexcommands
command contains:
%% [A-Z] putchar(tolower(yytext[0])); [ ]+$ [ ]+
putchar(' ');
Skeleton scanner. Generated scanner C source. Backtracking
information generated from -b option.
Commands: yacc(1), sed(1), awk(1)
Files: locale(4)
flex(1)
[ Back ] |