lex - Tru64

· Home

+ man pages

-> Linux

-> FreeBSD

-> OpenBSD

-> NetBSD

-> Tru64 Unix

-> HP-UX 11i

-> IRIX

· Linux HOWTOs

· FreeBSD Tips

· *niX Forums

man pages->Tru64 Unix man pages -> lex (1)

lex(1)

NAME [Toc] [Back]

       lex - Generates programs for lexical tasks

SYNOPSIS [Toc] [Back]

       lex [-ct] [-n  | -v] [file...]

       [Tru64   UNIX]  The  following  syntax  applies  when  the
       CMD_ENV environment variable is set to  svr4:  lex  [-crt]
       [-n  | -v] [-V] [-Qy  | -Qn] [file...]

STANDARDS [Toc] [Back]

       Interfaces  documented  on  this reference page conform to
       industry standards as follows:

       lex:  XPG4, XPG4-UNIX

       Refer to the standards(5) reference page for more information
 about industry standards and associated tags.

OPTIONS [Toc] [Back]

       Writes  C  code to the file lex.yy.c. This is the default.
       Suppresses the statistics summary. When you set  your  own
       table  sizes  for  the finite state machine, lex automatically
 produces this summary if  you  do  not  select  this
       flag.   [Tru64  UNIX]  Writes  RATFOR  code  to  the  file
       lex.yy.r. (There is no RATFOR compiler  for  Tru64  UNIX.)
       Writes  to  standard  output instead of writing to a file.
       Provides a summary of the generated finite  state  machine
       statistics.   [Tru64  UNIX]  Outputs lex version number to
       standard error. Requires the environment variable  CMD_ENV
       to  be  set to svr4.  [Tru64 UNIX]  Determines whether the
       lex version number is written to the output file. The  -Qn
       option  does  not  do  so and is the default. Requires the
       environment variable CMD_ENV to be set to svr4.

DESCRIPTION [Toc] [Back]

       The lex command uses the rules and  actions  contained  in
       file  to  generate  a program, lex.yy.c, which can be compiled
 with the cc command.  That program can then  receive
       input,  break the input into the logical pieces defined by
       the rules in file, and run program fragments contained  in
       the actions in file.

       The  generated  program  is  a  C Language function called
       yylex(). The lex command stores yylex() in  a  file  named
       lex.yy.c.   You can use yylex() alone to recognize simple,
       1-word input, or you can use it with other C Language programs
  to perform more difficult input analysis functions.
       For example, you can use lex to generate  a  program  that
       tokenizes  an  input  stream before sending it to a parser
       program generated by the yacc command.

       The yylex() function analyzes the  input  stream  using  a
       program  structure  called  a  finite  state machine. This
       structure allows the program to exist in  only  one  state
       (or  condition)  at a time.  A finite number of states are
       allowed. The rules in file determine how the program moves
       from  one  state  to another in response to the input that
       the program receives.

       The lex command reads its skeleton  finite  state  machine
       from the file /usr/ccs/lib/ncpform or /usr/ccs/lib/ncform.
       Use the environment  variable  LEXER  to  specify  another
       location for lex to read from.

       If you do not specify a file, lex reads standard input. It
       treats multiple files as a single file.

   Input File Format    [Toc]    [Back]
       The input file can contain three  sections:   definitions,
       rules,  and  user  subroutines. Each section must be separated
 from the others by a line containing only the delimiter,
 %%.  The format is as follows:

       definitions %% rules %% user_subroutines

       The  purpose  and  format  of  each  of these sections are
       described under the headings that follow.

   Definitions Section    [Toc]    [Back]
       If you want to use variables in  rules,  you  must  define
       them in the definitions section. The variables make up the
       left column, and their definitions make up the right  column.
   For  example,  to  define  D  as a numerical digit,
       enter: D    [0-9]

       You can use a defined variable in  the  rules  section  by
       enclosing the variable name in braces, {D}.

       In the definitions section, you can set either of the following
 two mutually exclusive  declarations:  Declare  the
       type  of  yytext  to be a null-terminated character array.
       Declare the type of yytext to be a pointer to a  null-terminated
  character  string. Use of the %pointer definition
       selects the /usr/ccs/lib/ncpform skeleton.

       In the definitions section, you can also set  table  sizes
       for  the resulting finite state machine. The default sizes
       are large enough for small programs.  You may want to  set
       larger  sizes  for  more complex programs: Number of positions
 is number (default 5000) Number of states is  number
       (default  2500)  Number  of  parse  tree  nodes  is number
       (default 2000) Number of transitions  is  number  (default
       5000)   Number  of  packed  character  classes  is  number
       (default 2000) Number of output slots is  number  (default
       5000)

       If   extended  characters  appear  in  regular  expression
       strings, you may need to reset the output array size  with
       the  %o  parameter  (possibly  to array sizes in the range
       10,000 to 20,000).  This reset reflects  the  much  larger
       number  of  extended  characters relative to the number of
       ASCII characters.

   Rules Section    [Toc]    [Back]
       The rules section is required, and it must be preceded  by
       the  %%  delimiter,  even if you do not have a definitions
       section. The lex command does not recognize rules  without
       the delimiter.

       In  this  section, the left column contains the pattern to
       be recognized in an input file to yylex().  The right column
  contains  the  C  program fragment executed when that
       pattern is recognized.

       Patterns can include extended characters with  one  exception:
   extended   characters  may  not  appear  in  range
       specifications within  character  class  expressions  surrounded
 by brackets.

       The columns are separated by a tab. For example, to search
       files for the word LEAD and replace it with GOLD,  perform
       the following steps: Create a file called transmute.l containing
 the lines:

              %% (LEAD)  printf("GOLD"); Then issue the following
              commands to the shell: lex transmute.l cc -o transmute
 lex.yy.c -ll You can test the  resulting  program
 with the command: transmute <transmute.l


       This  command echoes the contents of transmute.l, with the
       occurrences of LEAD changed to GOLD.

       Each pattern may have a corresponding action, that  is,  a
       fragment  of  C source code to execute when the pattern is
       matched.  Each statement must end with  a  ;  (semicolon).
       If  you use more than one statement in an action, you must
       enclose all of them in {} (braces).  A  second  delimiter,
       %%,  must follow the rules section if you have a user subroutine
 section.

       When yylex() matches a string  in  the  input  stream,  it
       copies  the  matched  text to an external character array,
       yytext, before it executes any actions in the  rules  section.


       You  can use the following operators to form patterns that
       you  want  to  match:  Matches  the  characters   written.
       Matches any one character in the enclosed range ([.-.]) or
       the enclosed list ([...]). [abcx-z] matches a,b,c,x,y,  or
       z.  Matches the enclosed character or string even if it is
       an operator.  "$" prevents lex  from  interpreting  the  $
       character as an operator.  Acts the same as double quotes.
       \$ prevents lex from interpreting the $  character  as  an
       operator.  Matches zero or more occurrences of the singlecharacter
 regular expression immediately preceding it.  x*
       matches  zero  or  more  repeated  literal  characters  x.
       Matches one or more occurrences  of  the  single-character
       regular  expression  immediately  preceding  it.   Matches
       either zero or one occurrence of the single-character regular
  expression  immediately  preceding  it.  Matches the
       character only at the beginning of a line.  ^x matches  an
       x  at  the  beginning  of  a  line.  Matches any character
       except for the characters following the ^.  [^xyz] matches
       any  character  but  x,  y,  or  z.  Matches any character
       except the newline character.  Matches the end of a  line.
       Matches either of two characters.  x|y matches either x or
       y.  Matches one extended  regular  expression  (ERE)  only
       when  followed  by  a  second ERE. It reads only the first
       token into yytext.  Given the  regular  expression  a*b/cc
       and the input aaabcc, yytext would contain the string aaab
       on this match.  Matches the pattern in the ( )  (parentheses).
  This  is used for grouping. It reads the whole pattern
 into yytext. A group in parentheses can  be  used  in
       place  of  any  single  character  in  any  other pattern.
       (xyz123) matches the pattern xyz123 and  reads  the  whole
       string  into  yytext.  Matches the character as defined in
       the definitions section.  If D is defined as numeric  digits,
  {D}  matches  all  numeric  digits.   Matches m-to-n
       occurrences of the specified character.  x{2,4} matches 2,
       3, or 4 occurrences of x.

       If  a  line begins with only a space, lex copies it to the
       lex.yy.c output file. If the line is  in  the  definitions
       section of file, lex copies it to the declarations section
       of lex.yy.c. If the line is  in  the  rules  section,  lex
       copies it to the program code section of lex.yy.c.

   User Subroutines Section    [Toc]    [Back]
       The  lex  library  has three subroutines defined as macros
       that you can use in the rules.   Reads  a  character  from
       yyin.   Replaces  a  character after it is read.  Writes a
       character to yyout.

       You can override these three macros by  writing  your  own
       code  for  these routines in the user subroutines section.
       But if you write your  own  routines,  you  must  undefine
       these macros in the definitions section as follows:

       %{ #undef input #undef unput #undef output }%

       When  you are using lex as a simple transformer/recognizer
       for stdin to stdout piping,  you  can  avoid  writing  the
       framework by using libl.a (the lex library). It has a main
       routine that calls yylex() for you.

       External names generated by lex all begin with the  prefix
       yy, as in yyin, yyout, yylex, and yytext.

   Putting Spaces in an Expression    [Toc]    [Back]
       Normally,  spaces  or  tabs end a rule and, therefore, the
       expression that defines a rule.  However, you can  enclose
       the  spaces  or  tab  characters  in "" (double quotes) to
       include them in the  expression.  Use  quotes  around  all
       spaces  in expressions that are not already within sets of
       [ ] (brackets).

   Other Special Characters    [Toc]    [Back]
       The lex program recognizes many of the normal  C  language
       special characters.  These character sequences are as follows:


       Sequence   Meaning
       \n         Newline
       \t         Tab
       \b         Backspace
       \\         Backslash
       \digits    The character whose encoding is represented
                  by the three-digit octal number
       \xdigits   The character whose encoding is represented
                  by the hexadecimal integer

       Do not use the actual newline character in an  expression.

       When  using these special characters in an expression, you
       do not need to enclose them in quotes.   Every  character,
       except   these   special  characters  and  the  previously
       described operator symbols, is always a text character.

   Matching Rules    [Toc]    [Back]
       When more than one expression can match the current input,
       lex  chooses  the  longest  match first.  Among rules that
       match the same number of characters, the rule that  occurs
       first is chosen.  For example:

       integer keyword action...; [a-z]+ identifier action...;

       If  the  preceding  rules  are  given  in  that  order and
       integers is the input word, lex matches the  input  as  an
       identifier  because [a-z]+ matches eight characters, while
       integer matches only seven.   However,  if  the  input  is
       integer,  both  rules  match seven characters. The keyword
       rule is selected because it occurs first. A shorter input,
       such  as  int,  does not match the expression rule integer
       and causes lex to select the rule identifier.

   Matching a String with Wildcard Characters    [Toc]    [Back]
       Because lex chooses the longest match first,  do  not  use
       rules containing expressions like (for example: '.*').

       The preceding rule might seem like a good way to recognize
       a string in single quotes.  However, the lexical  analyzer
       reads  far  ahead,  looking  for a distant single quote to
       complete the long match.  If a lexical analyzer with  such
       a  rule  gets  the  following  input, it matches the whole
       string:

       'first' quoted string here, 'second' here

       To find the smaller strings, first  and  second,  use  the
       following rule:

       '[^'\n]*'

       This rule stops after matching 'first'.

       Errors  of  this  type  are not far-reaching because the .
       (dot) operator does not match a newline character.  Therefore,
  expressions  like stop on the current line.  Do not
       try to defeat this with expressions like [.\n] +. The lexical
  analyzer tries to read the entire input file, and an
       internal buffer overflow occurs.

   Finding Strings within Strings    [Toc]    [Back]
       The lex program partitions the input stream and  does  not
       search  for all possible matches of each expression.  Each
       character is accounted for once and only once.  For  example,
  to  count occurrences of both she and he in an input
       text, try the following rules:

       she   s++; he    h++; \n    | .     ;

       The last two rules ignore everything besides he  and  she.
       However,  because  she includes he, lex does not recognize
       the instances of he that are included in she.

       To override this choice,  use  the  REJECT  action.   This
       directive  tells lex to go to the next rule.  The lex command
 then adjusts the position of  the  input  pointer  to
       where  it was before the first rule was executed, and executes
 the second choice rule. For example,  to  count  the
       included instances of he, use the following rules:

       she     {s++;  REJECT;}  he      {h++; REJECT;} \n     | .
       ;

       After counting the occurrences of  she,  lex  rejects  the
       input  stream  and  then  counts the occurrences of he. In
       this case, you can omit the REJECT action  on  he  because
       she includes he but not vice versa. In other cases, it may
       be difficult to determine which input  characters  are  in
       both classes.

       In  general,  REJECT is useful whenever the purpose of lex
       is not to partition the input stream  but  to  detect  all
       examples  of some items in the input, and the instances of
       these items may overlap or include each other.

NOTES [Toc] [Back]

       Because lex uses fixed names for intermediate  and  output
       files,  you  can  have only one lex-generated program in a
       given directory. If the -t option is not specified, informational,
  error, and warning messages are written to stdout.
 If the -t option is specified, informational,  error,
       and warning messages are written to stderr.

       [Tru64  UNIX]  The yytext array has a default dimension of
       200, controlled by the constant YYLMAX. If the  programmer
       needs  to allow a larger array, the YYLMAX constant may be
       redefined as follows from within the lex command file:

       { #undef YYLMAX #define YYLMAX 8192 }

       Two other arrays use YYLMAX, yysubf, and yylstate.

       The lex program can be compiled as a C program with -std0,
       -std, or -std1 mode. It can also be compiled as a C++ program.
 If YY_NOPROTO is defined on the compilation  command
       line, function prototypes are not generated.

EXAMPLES [Toc] [Back]

       The following command draws lex instructions from the file
       lexcommands and places the output in lex.yy.c: lex lexcommands
  The  file  lexcommands contains an example of a lex
       program that would be put into a lex  command  file.   The
       following program converts uppercase to lowercase, removes
       spaces at the end of a line, and replaces multiple  spaces
       with single spaces:

              %%  [A-Z] putchar(tolower(yytext[0])); [ ]+$ ; [ ]+
              putchar(' ');

ENVIRONMENT VARIABLES [Toc] [Back]

       The following environment variables affect the behavior of
       lex():  Provides  a  default value for the locale category
       variables that are not set or null.  If set, overrides the
       values  of  all  other  locale  variables.  Determines the
       order in which output is sorted for the -x option.  Determines
  the locale for the interpretation of byte sequences
       as characters (single-byte or multi-byte) in input parameters
  and files.  Determines the locale used to affect the
       format and contents of diagnostic  messages  displayed  by
       the  command.  Determines the location of message catalogs
       for the processing of LC_MESSAGES.

FILES [Toc] [Back]

       Run-time library.   Default  C  language  skeleton  finite
       state machine for lex.  Default C language skeleton finite
       state machine for lex, implemented with the pointer  definition
 of yytext.  Default RATFOR language skeleton finite
       state machine for lex.

lex(1)

Contents

NAME [Toc] [Back]

SYNOPSIS [Toc] [Back]

STANDARDS [Toc] [Back]

OPTIONS [Toc] [Back]

DESCRIPTION [Toc] [Back]

NOTES [Toc] [Back]

EXAMPLES [Toc] [Back]

ENVIRONMENT VARIABLES [Toc] [Back]

FILES [Toc] [Back]

SEE ALSO [Toc] [Back]