*nix Documentation Project
·  Home
 +   man pages
·  Linux HOWTOs
·  FreeBSD Tips
·  *niX Forums

  man pages->OpenBSD man pages -> regcomp (3)              
Title
Content
Arch
Section
 

REGEX(3)

Contents


NAME    [Toc]    [Back]

     regcomp, regexec, regerror,  regfree  -  regular  expression
routines

SYNOPSIS    [Toc]    [Back]

     #include <sys/types.h>
     #include <regex.h>

     int
     regcomp(regex_t *preg, const char *pattern, int cflags);

     int
     regexec(const  regex_t  *preg,  const  char  *string, size_t
nmatch,
             regmatch_t pmatch[], int eflags);

     size_t
     regerror(int errcode, const regex_t *preg, char *errbuf,
             size_t errbuf_size);

     void
     regfree(regex_t *preg);

DESCRIPTION    [Toc]    [Back]

     These routines implement IEEE Std 1003.2 (``POSIX.2'') regular expressions
 (``REs''); see re_format(7).  regcomp() compiles an RE
written as a
     string into an internal form, regexec() matches that  internal form
     against  a string and reports results, regerror() transforms
error codes
     from either  into  human-readable  messages,  and  regfree()
frees any dynamically
  allocated storage used by the internal form of an RE.

     The header <regex.h> declares two structure  types,  regex_t
and
     regmatch_t,  the  former for compiled internal forms and the
latter for
     match reporting.  It also declares  the  four  functions,  a
type regoff_t,
     and a number of constants with names starting with REG_.

     regcomp()  compiles  the regular expression contained in the
pattern
     string, subject to the flags in cflags, and places  the  results in the
     regex_t structure pointed to by preg.  cflags is the bitwise
OR of zero
     or more of the following flags:

     REG_EXTENDED    Compile modern  (``extended'')  REs,  rather
than the obsolete
 (``basic'') REs that are the default.

     REG_BASIC       This is a synonym for 0, provided as a counterpart to
                     REG_EXTENDED to improve readability.

     REG_NOSPEC      Compile  with  recognition  of  all  special
characters turned
                     off.  All characters are thus considered ordinary, so the
                     RE is a literal string.  This is  an  extension, compatible
                     with  but  not  specified by IEEE Std 1003.2
(``POSIX.2''),
                     and should be used with caution in  software
intended to
                     be  portable to other systems.  REG_EXTENDED
and
                     REG_NOSPEC may not be used in the same  call
to regcomp().

     REG_ICASE       Compile for matching that ignores upper/lower case distinctions.
  See re_format(7).

     REG_NOSUB       Compile for matching that need  only  report
success or
                     failure, not what was matched.

     REG_NEWLINE      Compile for newline-sensitive matching.  By
default, newline
 is a completely ordinary character with
no special
                     meaning in either REs or strings.  With this
flag, `[^'
                     bracket expressions and `.' never match newline, a `^'
                     anchor  matches  the  null  string after any
newline in the
                     string in addition to its  normal  function,
and the `$'
                     anchor  matches  the  null string before any
newline in the
                     string in addition to its normal function.

     REG_PEND        The regular  expression  ends,  not  at  the
first NUL, but
                     just  before the character pointed to by the
re_endp member
 of the structure  pointed  to  by  preg.
The re_endp
                     member  is  of type const char *.  This flag
permits inclusion
 of NULs in the RE; they are  considered
ordinary
                     characters.   This is an extension, compatible with but
                     not   specified   by   IEEE    Std    1003.2
(``POSIX.2''), and
                     should  be used with caution in software intended to be
                     portable to other systems.

     When successful, regcomp() returns 0 and fills in the structure pointed
     to  by  preg.   One  member  of  that  structure (other than
re_endp) is publicized:
 re_nsub, of  type  size_t,  contains  the  number  of
parenthesized
     subexpressions  within the RE (except that the value of this
member is undefined
 if the  REG_NOSUB  flag  was  used).   If  regcomp()
fails, it returns
     a non-zero error code; see DIAGNOSTICS.

     regexec() matches the compiled RE pointed to by preg against
the string,
     subject to the flags in eflags, and  reports  results  using
nmatch, pmatch,
     and the returned value.  The RE must have been compiled by a
previous invocation
 of regcomp().  The compiled  form  is  not  altered
during execution
     of regexec(), so a single compiled RE can be used simultaneously by multiple
 threads.

     By default, the null-terminated string pointed to by  string
is considered
     to be the text of an entire line, minus any terminating newline.  The
     eflags argument is the bitwise OR of zero  or  more  of  the
following flags:

     REG_NOTBOL      The first character of the string is not the
beginning of
                     a line, so the `^' anchor should  not  match
before it.
                     This  does  not  affect the behavior of newlines under
                     REG_NEWLINE.

     REG_NOTEOL      The NUL terminating the string does not  end
a line, so
                     the  `$'  anchor should not match before it.
This does not
                     affect  the  behavior  of   newlines   under
REG_NEWLINE.

     REG_STARTEND     The string is considered to start at string
+
                     pmatch[0].rm_so and to  have  a  terminating
NUL located at
                     string + pmatch[0].rm_eo (there need not actually be a
                     NUL at that  location),  regardless  of  the
value of nmatch.
                     See  below  for the definition of pmatch and
nmatch.  This
                     is an extension,  compatible  with  but  not
specified by
                     IEEE Std 1003.2 (``POSIX.2''), and should be
used with
                     caution in software intended to be  portable
to other systems.
   Note  that a non-zero rm_so does not
imply
                     REG_NOTBOL; REG_STARTEND  affects  only  the
location of the
                     string, not how it is matched.

     See re_format(7) for a discussion of what is matched in situations where
     an RE or a portion thereof could match any of  several  substrings of
     string.

     Normally,  regexec()  returns 0 for success and the non-zero
code
     REG_NOMATCH for failure.  Other non-zero error codes may  be
returned in
     exceptional situations; see DIAGNOSTICS.

     If  REG_NOSUB was specified in the compilation of the RE, or
if nmatch is
     0, regexec() ignores the pmatch argument (but see below  for
the case
     where  REG_STARTEND is specified).  Otherwise, pmatch points
to an array
     of nmatch structures of type regmatch_t.  Such  a  structure
has at least
     the members rm_so and rm_eo, both of type regoff_t (a signed
arithmetic
     type at least as large as an off_t and a ssize_t),  containing respectively
  the offset of the first character of a substring and the
offset of the
     first character after the end of the substring.  Offsets are
measured
     from   the   beginning  of  the  string  argument  given  to
regexec().  An empty
     substring is denoted by equal offsets, both  indicating  the
character following
 the empty substring.

     The  0th member of the pmatch array is filled in to indicate
what substring
 of string was matched by the  entire  RE.   Remaining
members report
     what  substring  was matched by parenthesized subexpressions
within the RE;
     member i reports subexpression i, with subexpressions counted (starting
     at  1)  by the order of their opening parentheses in the RE,
left to right.
     Unused entries in the array--corresponding either to  subexpressions that
     did  not  participate  in the match at all, or to subexpressions that do not
     exist in the RE (that  is,  i  >  preg->re_nsub)--have  both
rm_so and rm_eo
     set  to  -1.   If  a subexpression participated in the match
several times,
     the reported substring is the last one it  matched.   (Note,
as an example
     in  particular,  that when the RE ``(b*)+'' matches ``bbb'',
the parenthesized
 subexpression matches each of the three `bs' and  then
an infinite
     number  of  empty strings following the last `b', so the reported substring
     is one of the empties.)

     If REG_STARTEND is specified, pmatch must point to at  least
one
     regmatch_t (even if nmatch is 0 or REG_NOSUB was specified),
to hold the
     input offsets for REG_STARTEND.  Use for output is still entirely controlled
  by  nmatch;  if nmatch is 0 or REG_NOSUB was specified, the value
     of pmatch[0] will not be changed by a successful  regexec().

     regerror()  maps a non-zero errcode from either regcomp() or
regexec() to
     a human-readable, printable message.  If preg  is  non-NULL,
the error code
     should  have  arisen  from  use of the regex_t pointed to by
preg, and if the
     error code came from regcomp(), it should have been the  result from the
     most  recent  regcomp() using that regex_t.  (regerror() may
be able to
     supply a more detailed message using  information  from  the
regex_t.)
     regerror()  places  the  null-terminated  message  into  the
buffer pointed to
     by errbuf, limiting the length (including  the  NUL)  to  at
most errbuf_size
     bytes.   If  the  whole  message won't fit, as much of it as
will fit before
     the terminating NUL is supplied.  In any case, the  returned
value is the
     size  of  buffer needed to hold the whole message (including
the terminating
 NUL).  If errbuf_size is 0, errbuf is  ignored  but  the
return value is
     still correct.

     If  the  errcode  given  to  regerror()  is first OR'ed with
REG_ITOA, the
     ``message'' that results is the printable name of the  error
code, e.g.,
     ``REG_NOMATCH'',  rather  than  an  explanation thereof.  If
errcode is
     REG_ATOI, then preg shall be non-null and the re_endp member
of the
     structure  it  points to must point to the printable name of
an error code;
     in this case, the result in errbuf is the decimal digits  of
the numeric
     value  of  the error code (0 if the name is not recognized).
REG_ITOA and
     REG_ATOI are intended  primarily  as  debugging  facilities;
they are extensions,
  compatible with but not specified by IEEE Std 1003.2
(``POSIX.2'')
     and should be used with caution in software intended  to  be
portable to
     other  systems.  Be warned also that they are considered experimental and
     changes are possible.

     regfree() frees any dynamically allocated storage associated
with the
     compiled RE pointed to by preg.  The remaining regex_t is no
longer a
     valid  compiled  RE  and  the  effect  of  supplying  it  to
regexec() or
     regerror() is undefined.

     None  of  these functions references global variables except
for tables of
     constants; all are safe for use from multiple threads if the
arguments
     are safe.

IMPLEMENTATION CHOICES    [Toc]    [Back]

     There  are  a  number  of  decisions  that  IEEE  Std 1003.2
(``POSIX.2'') leaves
     up to the implementor, either by explicitly  saying  ``undefined'' or by
     virtue  of them being forbidden by the RE grammar.  This implementation
     treats them as follows.

     See re_format(7) for a discussion of the definition of caseindependent
     matching.

     There  is  no  particular limit on the length of REs, except
insofar as memory
 is limited.  Memory usage is approximately linear in  RE
size, and
     largely  insensitive  to  RE  complexity, except for bounded
repetitions.
     See BUGS for one short RE using them that  will  run  almost
any system out
     of memory.

     A  backslashed character other than one specifically given a
magic meaning
     by IEEE Std 1003.2 (``POSIX.2'') (such magic meanings  occur
only in obsolete
 REs) is taken as an ordinary character.

     Any unmatched `[' is a REG_EBRACK error.

     Equivalence  classes  cannot begin or end bracket-expression
ranges.  The
     endpoint of one range cannot begin another.

     RE_DUP_MAX, the limit on repetition counts in bounded  repetitions, is
     255.

     A repetition operator (?, *, +, or bounds) cannot follow another repetition
 operator.  A repetition operator cannot  begin  an  expression or
     subexpression or follow `^' or `|'.

     A  `|'  cannot appear first or last in a (sub)expression, or
after another
     `|', i.e., an operand of `|' cannot be an  empty  subexpression.  An empty
     parenthesized  subexpression,  `()', is legal and matches an
empty
     (sub)string.  An empty string is not a legal RE.

     A `{' followed by a digit is  considered  the  beginning  of
bounds for a
     bounded  repetition,  which  must then follow the syntax for
bounds.  A `{'
     not followed by a digit is considered an ordinary character.

     `^'  and `$' beginning and ending subexpressions in obsolete
(``basic'')
     REs are anchors, not ordinary characters.

DIAGNOSTICS    [Toc]    [Back]

     Non-zero error codes from regcomp()  and  regexec()  include
the following:

     REG_NOMATCH     regexec() failed to match
     REG_BADPAT      invalid regular expression
     REG_ECOLLATE    invalid collating element
     REG_ECTYPE      invalid character class
     REG_EESCAPE      applied to unescapable character
     REG_ESUBREG     invalid backreference number
     REG_EBRACK      brackets [ ] not balanced
     REG_EPAREN      parentheses ( ) not balanced
     REG_EBRACE      braces { } not balanced
     REG_BADBR       invalid repetition count(s) in { }
     REG_ERANGE      invalid character range in [ ]
     REG_ESPACE      ran out of memory
     REG_BADRPT      ?, *, or + operand invalid
     REG_EMPTY       empty (sub)expression
     REG_ASSERT      ``can't happen'' --you found a bug
     REG_INVARG        invalid  argument,  e.g.,  negative-length
string

SEE ALSO    [Toc]    [Back]

      
      
     grep(1), re_format(7)

     IEEE Std 1003.2 (``POSIX.2''), sections 2.8 (Regular Expression Notation)
     and B.5 (C Binding for Regular Expression Matching).

HISTORY    [Toc]    [Back]

     Originally  written by Henry Spencer.  Altered for inclusion
in the 4.4BSD
     distribution.

BUGS    [Toc]    [Back]

     This is an alpha release with known defects.  Please  report
problems.

     There is one known functionality bug.  The implementation of
internationalization
 is incomplete: the locale is always assumed to  be
the default
     one of IEEE Std 1003.2 (``POSIX.2''), and only the collating
elements
     etc. of that locale are available.

     The back-reference code is subtle and  doubts  linger  about
its correctness
     in complex cases.

     regexec() performance is poor.  This will improve with later
releases.
     nmatch exceeding 0  is  expensive;  nmatch  exceeding  1  is
worse.  regexec()
     is  largely  insensitive  to  RE complexity except that back
references are
     massively expensive.  RE length does matter; in  particular,
there is a
     strong  speed  bonus  for  keeping  RE length under about 30
characters, with
     most special characters counting roughly double.

     regcomp() implements bounded repetitions by macro expansion,
which is
     costly in time and space if counts are large or bounded repetitions are
     nested.           A          RE          like,          say,
``((((a{1,100}){1,100}){1,100}){1,100}){1,100}''
     will  (eventually)  run  almost  any existing machine out of
swap space.

     There are suspected problems with response to obscure  error
conditions.
     Notably,  certain  kinds of internal overflow, produced only
by truly enormous
 REs or by  multiply  nested  bounded  repetitions,  are
probably not handled
 well.

     Due  to  a  mistake in IEEE Std 1003.2 (``POSIX.2''), things
like `a)b' are
     legal REs because `)' is a special  character  only  in  the
presence of a
     previous  unmatched `('.  This can't be fixed until the spec
is fixed.

     The standard's definition of back references is vague.   For
example, does
     ``ab*2*d''  match  ``abbbd''?   Until the standard is clarified,
     behavior in such cases should not be relied on.

     The implementation of word-boundary matching is a bit  of  a
kludge, and
     bugs  may lurk in combinations of word-boundary matching and
anchoring.

OpenBSD     3.6                          March      20,      1994
[ Back ]
 Similar pages
Name OS Title
regexp OpenBSD obsolete regular expression routines
step Tru64 Regular expression compile and match routines
regexpr IRIX regular expression compile and match routines
compile Tru64 Regular expression compile and match routines
compile_r Tru64 Regular expression compile and match routines
regexp Tru64 Regular expression compile and match routines
advance_r Tru64 Regular expression compile and match routines
advance Tru64 Regular expression compile and match routines
step_r Tru64 Regular expression compile and match routines
regexp IRIX regular expression compile and match routines
Copyright © 2004-2005 DeniX Solutions SRL
newsletter delivery service