regcomp - IRIX

· Home

+ man pages

-> Linux

-> FreeBSD

-> OpenBSD

-> NetBSD

-> Tru64 Unix

-> HP-UX 11i

-> IRIX

· Linux HOWTOs

· FreeBSD Tips

· *niX Forums

man pages->IRIX man pages -> regcomp (5)


REGCOMP(5)							    REGCOMP(5)

NAME [Toc] [Back]

     regcomp - X/Open regular expressions definition and interface

DESCRIPTION [Toc] [Back]

     Note:  Two	versions of regular expressions	are supported:

     o	  the historical Simple	Regular	Expressions, which provide backward
	  compatibility, but which will	be withdrawn from a future issue of
	  this document	set

     o	  the improved internationalised version that complies with the
	  ISO/IEC 9945-2: 1993 standard.

     The first (historical) version is described as part of the	regexp
     function in the regexp(5) man page. The second (improved) version is
     described in this man page.

     Regular Expressions (REs) provide a mechanism to select specific strings
     from a set	of character strings.

     Regular expressions are a context-independent syntax that can represent a
     wide variety of character sets and	character set orderings, where these
     character sets are	interpreted according to the current locale. While
     many regular expressions can be interpreted differently depending on the
     current locale, many features, such as character class expressions,
     provide for contextual invariance across locales.

     The Basic Regular Expression (BRE)	notation and construction rules	in bre
     apply to most utilities supporting	regular	expressions.  Some utilities,
     instead, support the Extended Regular Expressions (ERE) described in ere;
     any exceptions for	both cases are noted in	the descriptions of the
     specific utilities	using regular expressions. Both	BREs and EREs are
     supported by the Regular Expression Matching interface in the regcmp(),
     regexec() and related functions.

Regular	Expression Definitions
     For the purposes of this section, the following definitions apply:

     entire regular expression

     The concatenated set of one or more BREs or EREs that make	up the pattern
     specified for string selection.

     matched

	  A sequence of	zero or	more characters	is said	to be matched by a BRE
	  or ERE when the characters in	the sequence correspond	to a sequence
	  of characters	defined	by the pattern.

	  Matching is based on the bit pattern used for	encoding the
	  character, not on the	graphic	representation of the character. This
	  means	that if	a character set	contains two or	more encodings for a



									Page 1






REGCOMP(5)							    REGCOMP(5)



	  graphic symbol, or if	the strings searched contain text encoded in
	  more than one	codeset, no attempt is made to search for any other
	  representation of the	encoded	symbol.	If that	is required, the user
	  can specify equivalence classes containing all variations of the
	  desired graphic symbol.

	  The search for a matching sequence starts at the beginning of	a
	  string and stops when	the first sequence matching the	expression is
	  found, where first is	defined	to mean	``begins earliest in the
	  string''. If the pattern permits a variable number of	matching
	  characters and thus there is more than one such sequence starting at
	  that point, the longest such sequence	will be	matched. For example:
	  the BRE bb* matches the second to fourth characters of abbbc,	and
	  the ERE (wee|week)(knights|night) matches all	ten characters of
	  weeknights.

	  Consistent with the whole match being	the longest of the leftmost
	  matches, each	subpattern, from left to right,	matches	the longest
	  possible string. For this purpose, a null string is considered to be
	  longer than no match at all. For example, matching the BRE \(.*\).*
	  against abcdef, the subexpression (\1) is abcdef, and	matching the
	  BRE \(a*\)* against bc, the subexpression (\1) is the	null string.

	  It is	possible to determine what strings correspond to
	  subexpressions by recursively	applying the leftmost longest rule to
	  each subexpression, but only with the	proviso	that the overall match
	  is leftmost longest. For example, matching \(ac*\)c*d[ac]*\1 against
	  acdacaaa matches acdacaaa (with \1=a); simply	matching the longest
	  match	for \(ac*\) would yield	\1=ac, but the overall match would be
	  smaller (acdac). Conceptually, the implementation must examine every
	  possible match and among those that yield the	leftmost longest total
	  matches, pick	the one	that does the longest match for	the leftmost
	  subexpression	and so on. Note	that this means	that matching by
	  subexpressions is context-dependent: a subexpression within a	larger
	  RE may match a different string from the one it would	match as an
	  independent RE, and two instances of the same	subexpression within
	  the same larger RE may match different lengths even in similar
	  sequences of characters. For example,	in the ERE (a.*b)(a.*b), the
	  two identical	subexpressions would match four	and six	characters,
	  respectively,	of accbaccccb.

	  When a multi-character collating element in a	bracket	expression is
	  involved, the	longest	sequence will be measured in characters
	  consumed from	the string to be matched; that is, the collating
	  element counts not as	one element, but as the	number of characters
	  it matches.

     BRE (ERE) matching	a single character

	  A BRE	or ERE that matches either a single character or a single
	  collating element.




									Page 2






REGCOMP(5)							    REGCOMP(5)



	  Only a BRE or	ERE of this type that includes a bracket expression
	  can match a collating	element.

	  The definition of single character has been expanded to include also
	  collating elements consisting	of two or more characters; this
	  expansion is applicable only when a bracket expression is included
	  in the BRE or	ERE.  An example of such a collating element may be
	  the Dutch ij,	which collates as a y.	In some	encodings, a ligature
	  ``i with j'' exists as a character and would represent a singlecharacter
 collating element. In another encoding, no such ligature
	  exists, and the two-character	sequence ij is defined as a multicharacter
 collating element. Outside brackets, the ij	is treated as
	  a two-character RE and matches the same characters in	a string.
	  Historically,	a bracket expression only matched a single character.
	  If, however, the bracket expression defines, for example, a range
	  that includes	ij, then this particular bracket expression will also
	  match	a sequence of the two characters i and j in the	string.

     BRE (ERE) matching	multiple characters

	  A BRE	or ERE that matches a concatenation of single characters or
	  collating elements.

     invalid

	  This section uses the	term invalid for certain constructs or
	  conditions.  Invalid REs will	cause the utility or function using
	  the RE to generate an	error condition. When invalid is not used,
	  violations of	the specified syntax or	semantics for REs produce
	  undefined results: this may entail an	error, enabling	an extended
	  syntax for that RE, or using the construct in	error as literal
	  characters to	be matched. For	example, the BRE construct \{1,2,3\}
	  does not comply with the grammar. A portable application cannot rely
	  on it	producing an error nor matching	the literal characters
	  \{1,2,3\}.

Regular	Expression General Requirements
     The requirements in this section apply to both basic and extended regular
     expressions.

     The use of	regular	expressions is generally associated with text
     processing.  REs (BREs and	EREs) operate on text strings; that is,	zero
     or	more characters	followed by an end-of-string delimiter (typically
     NUL).  Some utilities employing regular expressions limit the processing
     to	lines; that is,	zero or	more characters	followed by a newline
     character.	 In the	regular	expression processing described	in this
     document, the newline character is	regarded as an ordinary	character and
     both a period and a non-matching list can match one. The individual man
     pages specify within the individual descriptions of those standard
     utilities employing regular expressions whether they permit matching of
     newline characters; if not	stated otherwise, the use of literal newline
     characters	or any escape sequence equivalent produces undefined results.



									Page 3






REGCOMP(5)							    REGCOMP(5)



     Those utilities (like grep) that do not allow newline characters to match
     are responsible for eliminating any newline character from	strings	before
     matching against the RE. The regcomp() function (see regcomp(3G)),
     however, can provide support for such processing without violating	the
     rules of this section.

     The interfaces specified in this document set do not permit the inclusion
     of	a NUL character	in an RE or in the string to be	matched.  If during
     the operation of a	standard utility a NUL is included in the text
     designated	to be matched, that NUL	may designate the end of the text
     string for	the purposes of	matching.

     When a standard utility or	function that uses regular expressions
     specifies that pattern matching will be performed without regard to the
     case (upper- or lower-) of	either data or patterns, then when each
     character in the string is	matched	against	the pattern, not only the
     character,	but also its case counterpart (if any),	will be	matched. This
     definition	of case-insensitive processing is intended to allow matching
     of	multi-character	collating elements as well as characters. For
     instance, as each character in the	string is matched using	both its
     cases, the	RE [[.Ch.]] when matched against the string char, is in
     reality matched against ch, Ch, cH	and CH.

     The implementation	will support any regular expression that does not
     exceed 256	bytes in length.

Basic Regular Expressions [Toc] [Back]

     BREs Matching a Single Character or Collating Element

	  A BRE	ordinary character, a special character	preceded by a
	  backslash or a period	matches	a single character. A bracket
	  expression matches a single character	or a single collating element.

     BRE Ordinary Characters

	  An ordinary character	is a BRE that matches itself: any character in
	  the supported	character set, except for the BRE special characters
	  listed in brespec.

	  The interpretation of	an ordinary character preceded by a backslash
	  (\) is undefined, except for:

	  1.  the characters ),	(, { and }

	  2.  the digits 1 to 9	inclusive

	  3.  a	character inside a bracket expression.

     BRE Special Characters






									Page 4






REGCOMP(5)							    REGCOMP(5)



	  A BRE	special	character has special properties in certain contexts.
	  Outside those	contexts, or when preceded by a	backslash, such	a
	  character will be a BRE that matches the special character itself.
	  The BRE special characters and the contexts in which they have their
	  special meaning are:

     .[\  The period, left-bracket and backslash is special except when	used
	  in a bracket expression. An expression containing a [	that is	not
	  preceded by a	backslash and is not part of a bracket expression
	  produces undefined results.

     *	  The asterisk is special except when used:

	  o   in a bracket expression

	  o   as the first character of	an entire BRE (after an	initial	^,
	      if any)

	  o   as the first character of	a subexpression	(after an initial ^,
	      if any).

     ^	  The circumflex is special when used:

	  o   as an anchor

	  o   as the first character of	a bracket expression.

     $	  The dollar sign is special when used as an anchor.

     Periods in	BREs

	  A period (.),	when used outside a bracket expression,	is a BRE that
	  matches any character	in the supported character set except NUL.


     RE	Bracket	Expression

     A bracket expression (an expression enclosed in square brackets, [	]) is
     an	RE that	matches	a single collating element contained in	the non-empty
     set of collating elements represented by the bracket expression.

     The following rules and definitions apply to bracket expressions:


     1.	  A bracket expression is either a matching list expression or a nonmatching
 list	expression. It consists	of one or more expressions:
	  collating elements, collating	symbols, equivalence classes,
	  character classes or range expressions. Portable applications	must
	  not use range	expressions, even though all implementations support
	  them.	The right-bracket (]) loses its	special	meaning	and represents
	  itself in a bracket expression if it occurs first in the list	(after
	  an initial circumflex	(^), if	any). Otherwise, it terminates the



									Page 5






REGCOMP(5)							    REGCOMP(5)



	  bracket expression, unless it	appears	in a collating symbol (such as
	  [.].]) or is the ending right-bracket	for a collating	symbol,
	  equivalence class or character class.	The special characters:

	  . * [	\

	  (period, asterisk, left-bracket and backslash, respectively) lose
	  their	special	meaning	within a bracket expression.

	  The character	sequences:

	  [. [=	[:

	  (left-bracket	followed by a period, equals-sign or colon) are
	  special inside a bracket expression and are used to delimit
	  collating symbols, equivalence class expressions and character class
	  expressions. These symbols must be followed by a valid expression
	  and the matching terminating sequence	.], =] or :], as described in
	  the following	items.

     2.	  A matching list expression specifies a list that matches any one of
	  the expressions represented in the list. The first character in the
	  list must not	be the circumflex. For example,	[abc] is an RE that
	  matches any of the characters	a, b or	c.

     3.	  A non-matching list expression begins	with a circumflex (^), and
	  specifies a list that	matches	any character or collating element
	  except for the expressions represented in the	list after the leading
	  circumflex. For example, [^abc] is an	RE that	matches	any character
	  or collating element except the characters a,	b or c.	The circumflex
	  will have this special meaning only when it occurs first in the
	  list,	immediately following the left-bracket.

     4.	  A collating symbol is	a collating element enclosed within bracketperiod
 ([. .]) delimiters. Collating elements	are defined as
	  described in colltbl(1M). Multi-character collating elements must be
	  represented as collating symbols when	it is necessary	to distinguish
	  them from a list of the individual characters	that make up the
	  multi-character collating element.  For example, if the string ch is
	  a collating element in the current collation sequence	with the
	  associated collating symbol <ch>, the	expression [[.ch.]] will be
	  treated as an	RE matching the	character sequence ch, while [ch] will
	  be treated as	an RE matching c or h.	Collating symbols will be
	  recognised only inside bracket expressions. This implies that	the RE
	  [[.ch.]]*c matches the first to fifth	character in the string
	  chchch. If the string	is not a collating element in the current
	  collating sequence definition, or if the collating element has no
	  characters associated	with it	(for example, see the symbol <HIGH> in
	  the example collation	definition shown in colltbl(1M)), the symbol
	  will be treated as an	invalid	expression.





									Page 6






REGCOMP(5)							    REGCOMP(5)



     5.	  An equivalence class expression represents the set of	collating
	  elements belonging to	an equivalence class, as described in
	  colltbl(1M).	Only primary equivalence classes will be recognised.
	  The class is expressed by enclosing any one of the collating
	  elements in the equivalence class within bracket-equal ([= =])
	  delimiters.  For example, if a, agrave and acircumflex belong	to the
	  same equivalence class, then [=a=]b],	[[=agrave=]b] and
	  [[=acircumflex=]b] will each be equivalent to	[aagraveacircumflexb].
	  If the collating element does	not belong to an equivalence class,
	  the equivalence class	expression will	be treated as a	collating
	  symbol.

     6.	  A character class expression represents the set of characters
	  belonging to a character class, as defined in	the LC_CTYPE category
	  in the current locale. All character classes specified in the
	  current locale will be recognised. A character class expression is
	  expressed as a character class name enclosed within bracket-colon
	  ([: :]) delimiters.

	  The following	character class	expressions are	supported in all
	  locales:



	  The following	character class	expressions are	supported in all
	  locales:

	       [:alnum:]     [:cntrl:]	   [:lower:]	 [:space:]
	       [:alpha:]     [:digit:]	   [:print:]	 [:upper:]
	       [:blank:]     [:graph:]	   [:punct:]	 [:xdigit:]

	  In addition, character class expressions of the form:

	       [:name:]

	  are recognised in those locales where	the name keyword has been
	  given	a charclass  definition	in the LC_CTYPE	category.

     7.	  A range expression represents	the set	of collating elements that
	  fall between two elements in the current collation sequence,
	  inclusively. It is expressed as the starting point and the ending
	  point	separated by a hyphen (-).

	  Range	expressions must not be	used in	portable applications because
	  their	behaviour is dependent on the collating	sequence. Ranges will
	  be treated according to the current collating	sequence, and include
	  such characters that fall within the range based on that collating
	  sequence, regardless of character values. This, however, means that
	  the interpretation will differ depending on collating	sequence. If,
	  for instance,	one collating sequence defines aumlat as a variant of
	  a, while another defines it as a letter following z, then the
	  expression [aumlat-z]	 is valid in the first language	and invalid in



									Page 7






REGCOMP(5)							    REGCOMP(5)



	  the second.

	  In the following, all	examples assume	the collation sequence
	  specified for	the POSIX locale, unless another collation sequence is
	  specifically defined.

	  The starting range point and the ending range	point must be a
	  collating element or collating symbol. An equivalence	class
	  expression used as a starting	or ending point	of a range expression
	  produces unspecified results.	An equivalence class can be used
	  portably within a bracket expression,	but only outside the range.
	  For example, the unspecified expression [[=e=]-f] should be given as
	  [[=e=]e-f]. The ending range point must collate equal	to or higher
	  than the starting range point; otherwise, the	expression will	be
	  treated as invalid. The order	used is	the order in which the
	  collating elements are specified in the current collation
	  definition. One-to-many mappings (see	the description	of LC_COLLATE
	  in locale(1))	will not be performed. For example, assuming that the
	  character eszet is is	placed in the collation	sequence after r and
	  s, but before	t and that it maps to the sequence ss for collation
	  purposes, then the expression	[r-s] matches only r and s, but	the
	  expression [s-t] matches s, eszet ot t.

	  The interpretation of	range expressions where	the ending range point
	  is also the starting range point of a	subsequent range expression
	  (for instance	[a-m-o]) is undefined.

	  The hyphen character will be treated as itself if it occurs first
	  (after an initial ^, if any) or last in the list, or as an ending
	  range	point in a range expression. As	examples, the expressions [-
	  ac] and [ac-]	are equivalent and match any of	the characters a, c or
	  -; [^-ac] and	[^ac-] are equivalent and match	any characters except
	  a, c or -; the expression [%-	-] matches any of the characters
	  between % and	- inclusive; the expression [- -@] matches any of the
	  characters between - and @ inclusive;	and the	expression [a- -@] is
	  invalid, because the letter a	follows	the symbol - in	the POSIX
	  locale. To use a hyphen as the starting range	point, it must either
	  come first in	the bracket expression or be specified as a collating
	  symbol, for example: [][.-.]-0], which matches either	a right
	  bracket or any character or collating	element	that collates between
	  hyphen and 0,	inclusive.

	  If a bracket expression must specify both - and ], the ] must	be
	  placed first (after the ^, if	any) and the - last within the bracket
	  expression.

	  BREs Matching	Multiple Characters

	  The following	rules can be used to construct BREs matching multiple
	  characters from BREs matching	a single character:





									Page 8






REGCOMP(5)							    REGCOMP(5)



     1.	  The concatenation of BREs matches the	concatenation of the strings
	  matched by each component of the BRE.

     2.	  A subexpression  can be defined within a BRE by enclosing it between
	  the character	pairs \( and \)	. Such a subexpression matches
	  whatever it would have matched without the \(	and \),	except that
	  anchoring within subexpressions is optional behaviour.
	  Subexpressions can be	arbitrarily nested.

     3.	  The back-reference  expression \n matches the	same (possibly empty)
	  string of characters as was matched by a subexpression enclosed
	  between \( and \) preceding the \n. The character n must be a	digit
	  from 1 to 9 inclusive, nth subexpression (the	one that begins	with
	  the nth \( and ends with the corresponding paired \)).  The
	  expression is	invalid	if less	than n subexpressions precede the \n.
	  For example, the expression ^\(.*\)\1$ matches a line	consisting of
	  two adjacent appearances of the same string, and the expression
	  \(a\)*\1 fails to match a. The limit of nine back-references to
	  subexpressions in the	RE is based on the use of a single digit
	  identifier.  This does not imply that	only nine subexpressions are
	  allowed in REs. The following	is a valid BRE with ten
	  subexpressions:

     \(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)*


     4.	  When a BRE matching a	single character, a subexpression or a backreference
 is followed	by the special character asterisk (*),
	  together with	that asterisk it matches what zero or more consecutive
	  occurrences of the BRE would match. For example, [ab]* and [ab][ab]
	  are equivalent when matching the string ab.

     5.	  When a BRE matching a	single character, a subexpression or a backreference
 is followed	by an interval expression of the format	\{m\},
	  \{m,\} or \{m,n\}, together with that	interval expression it matches
	  what repeated	consecutive occurrences	of the BRE would match.	The
	  values of m and n will be decimal integers in	the range 0 <= m <= n
	  <= RE_DUP_MAX, where m specifies the exact or	minimum	number of
	  occurrences and n specifies the maximum number of occurrences. The
	  expression \{m\} matches exactly m occurrences of the	preceding BRE,
	  \{m,\} matches at least m occurrences	and \{m,n\} matches any	number
	  of occurrences between m and n, inclusive.

	  For example, in the string abababccccccd the BRE c\{3\} is matched
	  by characters	seven to nine, the BRE \(ab\)\{4,\} is not matched at
	  all and the BRE c\{1,3\}d is matched by characters ten to thirteen.

     The behaviour of multiple adjacent	duplication symbols (* and intervals)
     produces undefined	results.

     BRE Precedence




									Page 9






REGCOMP(5)							    REGCOMP(5)



	  The order of precedence is as	shown in the following table:

	  BRE Precedence (from high to low)

	       collation-related bracket symbols   [= =]  [: :]	 [. .]

	       escaped characters		   \<special character>

	       bracket expression		   []

	       subexpressions/back-references	   \(\)\n

	       single-character-BRE duplication	   *\{m,n\}

	       concatenation

	       anchoring			   ^ $

     BRE Expression Anchoring

	  A BRE	can be limited to matching strings that	begin or end a line;
	  this is called anchoring.  The circumflex and	dollar sign special
	  characters will be considered	BRE anchors in the following contexts:


     1.	  A circumflex (^) is an anchor	when used as the first character of an
	  entire BRE. The implementation may treat circumflex as an anchor
	  when used as the first character of a	subexpression. The circumflex
	  will anchor the expression (or optionally subexpression) to the
	  beginning of a string; only sequences	starting at the	first
	  character of a string	will be	matched	by the BRE. For	example, the
	  BRE ^ab matches ab in	the string abcdef, but fails to	match in the
	  string cdefab. The BRE \(^ab\) may match the former string.  A
	  portable BRE must escape a leading circumflex	in a subexpression to
	  match	a literal circumflex.

     2.	  A dollar sign	($) is an anchor when used as the last character of an
	  entire BRE. The implementation may treat a dollar sign as an anchor
	  when used as the last	character of a subexpression.  The dollar sign
	  will anchor the expression (or optionally subexpression) to the end
	  of the string	being matched; the dollar sign can be said to match
	  the end-of-string following the last character.

     3.	  A BRE	anchored by both ^ and $ matches only an entire	string.	For
	  example, the BRE ^abcdef$ matches strings consisting only of abcdef.

     Extended Regular Expressions

	  The extended regular expression (ERE)	notation and construction
	  rules	will apply to utilities	defined	as using extended regular
	  expressions; any exceptions to the following rules are noted in the
	  descriptions of the specific utilities using EREs.



								       Page 10






REGCOMP(5)							    REGCOMP(5)



     EREs Matching a Single Character or Collating Element

	  An ERE ordinary character, a special character preceded by a
	  backslash or a period	matches	a single character.  A bracket
	  expression matches a single character	or a single collating element.
	  An ERE matching a single character enclosed in parentheses matches
	  the same as the ERE without parentheses would	have matched.

     ERE Ordinary Characters

	  An ordinary character	 is an ERE that	matches	itself.	 An ordinary
	  character is any character in	the supported character	set, except
	  for the ERE special characters listed	in erespec.  The
	  interpretation of an ordinary	character preceded by a	backslash (\)
	  is undefined.

     ERE Special Characters

	  An ERE special character has special properties in certain contexts.
	  Outside those	contexts, or when preceded by a	backslash, such	a
	  character is an ERE that matches the special character itself.  The
	  extended regular expression special characters and the contexts in
	  which	they have their	special	meaning	are:


     . [ \ (
	  The period, left-bracket, backslash and left-parenthesis are special
	  except when used in a	bracket	expression.  Outside a bracket
	  expression, a	left-parenthesis immediately followed by a rightparenthesis
 produces undefined results.

     )	  The right-parenthesis	is special when	matched	with a preceding
	  left-parenthesis, both outside a bracket expression.

     * + ? {
	  The asterisk,	plus-sign, question-mark and left-brace	are special
	  except when used in a	bracket	expression.  Any of the	following uses
	  produce undefined results:

	       if these	characters appear first	in an ERE, or immediately
	       following a vertical-line, circumflex or	left-parenthesis

	       if a left-brace is not part of a	valid interval expression.

     |	  The vertical-line is special except when used	in a bracket
	  expression.  A vertical-line appearing first or last in an ERE, or
	  immediately following	a vertical-line	or a left-parenthesis, or
	  immediately preceding	a right-parenthesis, produces undefined
	  results.






								       Page 11






REGCOMP(5)							    REGCOMP(5)



     ^	  The circumflex is special when used:

	       as an anchor

	       as the first character of a bracket expression.

     $	  The dollar sign is special when used as an anchor.

     Periods in	EREs

	  A period (.),	when used outside a bracket expression,	is an ERE that
	  matches any character	in the supported character set except NUL.

     EREs Matching Multiple Characters

	  The following	rules will be used to construct	EREs matching multiple
	  characters from EREs matching	a single character:

     1.	  A concatenation of EREs matches the concatenation of the character
	  sequences matched by each component of the ERE. A concatenation of
	  EREs enclosed	in parentheses matches whatever	the concatenation
	  without the parentheses matches. For example,	both the ERE cd	and
	  the ERE (cd) are matched by the third	and fourth character of	the
	  string abcdefabcdef.

     2.	  When an ERE matching a single	character or an	ERE enclosed in
	  parentheses is followed by the special character plus-sign (+),
	  together with	that plus-sign it matches what one or more consecutive
	  occurrences of the ERE would match. For example, the ERE b+(bc)
	  matches the fourth to	seventh	characters in the string acabbbcde.
	  And, [ab]+ and [ab][ab]* are equivalent.

     3.	  When an ERE matching a single	character or an	ERE enclosed in
	  parentheses is followed by the special character asterisk (*),
	  together with	that asterisk it matches what zero or more consecutive
	  occurrences of the ERE would match.  For example, the	ERE b*c
	  matches the first character in the string cabbbcde, and the ERE b*cd
	  matches the third to seventh characters in the string
	  cabbbcdebbbbbbcdbc. And, [ab]* and [ab][ab] are equivalent when
	  matching the string ab.

     4.	  When an ERE matching a single	character or an	ERE enclosed in
	  parentheses is followed by the special character question-mark (?),
	  together with	that question-mark it matches what zero	or one
	  consecutive occurrences of the ERE would match. For example, the ERE
	  b?c matches the second character in the string acabbbcde.

     5.	  When an ERE matching a single	character or an	ERE enclosed in
	  parentheses is followed by an	interval expression of the format {m},
	  {m,} or {m,n}, together with that interval expression	it matches
	  what repeated	consecutive occurrences	of the ERE would match.	 The
	  values of m and  n will be decimal integers in the range 0 <=	m <= n



								       Page 12






REGCOMP(5)							    REGCOMP(5)



	  <= RE_DUP_MAX, where m specifies the exact or	minimum	number of
	  occurrences and n specifies the maximum number of occurrences. The
	  expression {m} matches exactly m occurrences of the preceding	ERE,
	  {m,} matches at least	m occurrences and {m,n}	matches	any number of
	  occurrences between m	and n, inclusive.  For example,	in the string
	  abababccccccd	the ERE	c{3} is	matched	by characters seven to nine
	  and the ERE (ab){2,} is matched by characters	one to six.

     The behaviour of multiple adjacent	duplication symbols (+,	*, ? and
     intervals)	produces undefined results.

     ERE Alternation

	  Two EREs separated by	the special character vertical-line (|)	match
	  a string that	is matched by either. For example, the ERE a((bc)|d)
	  matches the string abc and the string	ad. Single characters, or
	  expressions matching single characters, separated by the vertical
	  bar and enclosed in parentheses, will	be treated as an ERE matching
	  a single character.

     ERE Precedence

	  The order of precedence is as	shown in the following table:

	  BRE Precedence (from high to low)

	       collation-related bracket symbols   [= =]  [: :]	 [. .]

	       escaped characters		   \<special character>

	       bracket expression		   []

	       grouping				   ()

	       single-character-ERE duplication	   *+?{m,n}

	       concatenation

	       anchoring			   ^ $

	       alteration			   |

     For example, the ERE abba | cde matches either the	string abba or the
     string cde	(rather	than the string	abbade or abbcde, because
     concatenation has a higher	order of precedence than alternation).

     ERE Expression Anchoring

	  An ERE can be	limited	to matching strings that begin or end a	line;
	  this is called anchoring.  The circumflex and	dollar sign special
	  characters are considered ERE	anchors	when used anywhere outside a
	  bracket expression. This has the following effects:



								       Page 13






REGCOMP(5)							    REGCOMP(5)



     1.	  A circumflex (^) outside a bracket expression	anchors	the expression
	  or subexpression it begins to	the beginning of a string; such	an
	  expression or	subexpression can match	only a sequence	starting at
	  the first character of a string. For example,	the EREs ^ab and (^ab)
	  match	ab in the string abcdef, but fail to match in the string
	  cdefab, and the ERE a^b is valid, but	can never match	because	the a
	  prevents the expression ^b from matching starting at the first
	  character.

     2.	  A dollar sign	($) outside a bracket expression anchors the
	  expression or	subexpression it ends to the end of a string; such an
	  expression or	subexpression can match	only a sequence	ending at the
	  last character of a string. For example, the EREs ef$	and (ef$)
	  match	ef in the string abcdef, but fail to match in the string
	  cdefab, and the ERE e$f is valid, but	can never match	because	the f
	  prevents the expression e$ from matching ending at the last
	  character.

     Regular Expression	Grammar

	  Grammars describing the syntax of both basic and extended regular
	  expressions are presented in this section. The grammar takes
	  precedence over the text.

     BRE/ERE Grammar Lexical Conventions

	  The lexical conventions for regular expressions are as described in
	  this section.

	  Except as noted, the longest possible	token or delimiter beginning
	  at a given point will	be recognised.

	  The following	tokens will be processed (in addition to those string
	  constants shown in the grammar):


     COLL_ELEM	    Any	single-character collating element, unless it is a
		    META_CHAR.

     BACKREF	    Applicable only to basic regular expressions. The
		    character string consisting	of \ followed by a singledigit
 numeral, 1 to	9.

     DUP_COUNT	    Represents a numeric constant. It is an integer in the
		    range 0 <= DUP_COUNT <= RE_DUP_MAX.	This token will	only
		    be recognised when the context of the grammar requires it.
		    At all other times,	digits not preceded by \ will be
		    treated as ORD_CHAR.

     META_CHAR	    One	of the characters:

			 ^   when found	first in a bracket expression



								       Page 14






REGCOMP(5)							    REGCOMP(5)



			 -   when found	anywhere but first (after an initial
			     ^,	if any)	or last	in a bracket expression, or as
			     the ending
			     range point in a range expression

			 ]   when found	anywhere but first (after an initial
			     ^,	if any)	in a bracket expression.

     L_ANCHOR	    Applicable only to basic regular expressions. The
		    character ^	when it	appears	as the first character of a
		    basic regular expression and when not QUOTED_CHAR. The ^
		    may	be recognised as an anchor elsewhere.

     ORD_CHAR	    A character, other than one	of the special characters in
		    SPEC_CHAR.

     QUOTED_CHAR    In a BRE, one of the character sequences:

		    \^	    \.	    \*	    \[	    \$	    \\

		    In an ERE, one of the character sequences:

		    \^	 \.   \[   \$	\(   \)	 \|  \*	  \+   \?   \{	 \\

     R_ANCHOR	    (Applicable	only to	basic regular expressions.) The
		    character $	when it	appears	as the last character of a
		    basic regular expression and when not QUOTED_CHAR. The $
		    may	be recognised as an anchor elsewhere.

     SPEC_CHAR	    For	basic regular expressions, will	be one of the
		    following special characters:


		    \  anywhere	outside	bracket	expressions

		    [	anywhere outside bracket expressions

		    ^	when used as an	anchor or when
			first in a bracket expression

		    $	when used as an	anchor

		    *	anywhere except: first in an entire RE;
			anywhere in a bracket expression; directly
			following \(; directly following an
			anchoring ^.

		    For	extended regular expressions, will be one of the
		    following special characters found anywhere	outside
		    bracket expressions:

		    ^	 .    [	   $	(    )	  |    *    +	 ?    {	   \



								       Page 15






REGCOMP(5)							    REGCOMP(5)



		    (The close-parenthesis is considered special in this
		    context only if matched with a preceding openparenthesis.)

RE and Bracket Expression Grammar [Toc] [Back]

     This section presents the grammar for basic regular expressions,
     including the bracket expression grammar that is common to	both BREs and
     EREs.

     %token ORD_CHAR QUOTED_CHAR DUP_COUNT
     %token BACKREF L_ANCHOR R_ANCHOR
     %token Back_open_paren  Back_close_paren
     /*	   '\('		     '\)'		      */
     %token Back_open_brace  Back_close_brace
     /*	   '\{'		     '\}'		      */
     /*	The following tokens are for the Bracket Expression
	grammar	common to both REs and EREs.		  */
     %token COLL_ELEM META_CHAR
     %token Open_equal Equal_close Open_dot Dot_close Open_colon Colon_close
     /*	   '[='		'=]'	  '[.'	  '.]'	     '[:'	 ':]'  */
     %token class_name
     /*	class_name is a	keyword	to the LC_CTYPE	locale category	*/
     /*	(representing a	character class) in the	current	locale	*/
     /*	and is only recognised between [: and :]		*/
     %start basic_reg_exp
     %%
     /*		    --------------------------------------------
		    Basic Regular Expression
		    --------------------------------------------
     */
     basic_reg_exp :	      RE_expression
		   | L_ANCHOR
		   |			    R_ANCHOR
		   | L_ANCHOR		    R_ANCHOR
		   | L_ANCHOR RE_expression
		   |	      RE_expression R_ANCHOR
		   | L_ANCHOR RE_expression R_ANCHOR
		   ;

     RE_expression :		   simple_RE
		   | RE_expression simple_RE
		   ;

     simple_RE	   : nondupl_RE
		   | nondupl_RE	RE_dupl_symbol
		   ;

     nondupl_RE	   : one_character_RE
		   | Back_open_paren RE_expression Back_close_paren
		   | Back_open_paren Back_close_paren
		   | BACKREF



								       Page 16






REGCOMP(5)							    REGCOMP(5)



		   ;

     one_character_RE :	ORD_CHAR
		      |	QUOTED_CHAR
		      |	'.'
		      |	bracket_expression
		      ;

     RE_dupl_symbol : '*'
		    | Back_open_brace DUP_COUNT		      Back_close_brace
		    | Back_open_brace DUP_COUNT	','	      Back_close_brace
		    | Back_open_brace DUP_COUNT	',' DUP_COUNT Back_close_brace
		    ;

     /*			--------------------------------------------
			Bracket	Expression
			-------------------------------------------
     */
     bracket_expression	: '[' matching_list    ']'
			| '[' nonmatching_list ']'
			;

     matching_list	: bracket_list
			;

     nonmatching_list	: '^' bracket_list
			;

     bracket_list	: follow_list
			| follow_list '-'
			;

     follow_list	:	      expression_term
			| follow_list expression_term
			;

     expression_term	: single_expression
			| range_expression
			;

     single_expression	: end_range
			| character_class
			| equivalence_class
			;

     range_expression	: start_range end_range
			| start_range '-'
			;

     start_range	: end_range '-'
			;




								       Page 17






REGCOMP(5)							    REGCOMP(5)



     end_range		: COLL_ELEM
			| collating_symbol
			;

     collating_symbol	: Open_dot COLL_ELEM Dot_close
			| Open_dot META_CHAR Dot_close
			;

     equivalence_class	: Open_equal COLL_ELEM Equal_close
			;

     character_class	: Open_colon class_name	Colon_close
			;

     The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside \( and	\)
     (which implies that ^ and $ are ordinary characters).

ERE Grammar [Toc] [Back]

     This section presents the grammar for extended regular expressions,
     excluding the bracket expression grammar.

     Note:  The	bracket	expression grammar and the associated %token
	    lines are identical	between	BREs and EREs. It has been omitted
	    from the ERE section to avoid unnecessary editorial	duplication.

     %token ORD_CHAR QUOTED_CHAR DUP_COUNT
     %start extended_reg_exp
     %%
     /*		      --------------------------------------------
		      Extended Regular Expression
		      --------------------------------------------
     */

     extended_reg_exp :			     ERE_branch
		      |	extended_reg_exp ' | ' ERE_branch
		      ;

     ERE_branch	      :		   ERE_expression
		      |	ERE_branch ERE_expression
		      ;

     ERE_expression   :	one_character_ERE
		      |	'^'
		      |	'$'
		      |	'(' extended_reg_exp ')'
		      |	ERE_expression ERE_dupl_symbol
		      ;

     one_character_ERE : ORD_CHAR
		       | QUOTED_CHAR
		       | '.'



								       Page 18






REGCOMP(5)							    REGCOMP(5)



		       | bracket_expression
		       ;

     ERE_dupl_symbol  :	'*'
		      |	'+'
		      |	'?'
		      |	'{' DUP_COUNT		    '}'
		      |	'{' DUP_COUNT ','	    '}'
		      |	'{' DUP_COUNT ',' DUP_COUNT '}'
		      ;

     The ERE grammar does not permit several constructs	that previous sections
     specify as	having undefined results:

     o	 ORD_CHAR preceded by \

     o	 one or	more ERE_dupl_symbols appearing	first in an ERE,
	 or immediately	following |, ^ or (

     o	 { not part of a valid ERE_dupl_symbol

     o	 | appearing first or last in an ERE,
	 or immediately	following | or
	 (, or immediately preceding ).

     Implementations are permitted to extend the language to allow these.
     Portable applications cannot use such constructs.


								       PPPPaaaaggggeeee 11119999

[ Back ]

Similar pages

Name	OS	Title
perlre	IRIX	Perl regular expressions
re_exec	Tru64	Handle regular expressions
perlre	OpenBSD	Perl regular expressions
re_comp	Tru64	Handle regular expressions
regex	Linux	POSIX 1003.2 regular expressions
re_format	FreeBSD	POSIX 1003.2 regular expressions
re_format	OpenBSD	POSIX 1003.2 regular expressions
perlreref	OpenBSD	Perl Regular Expressions Reference
perlretut	OpenBSD	Perl regular expressions tutorial
perlfaq6	OpenBSD	Regular Expressions ($Revision: 1.6 $, $Date: 2003/12/03 03:02:44 $)

newsletter delivery service

Contents

NAME [Toc] [Back]

DESCRIPTION [Toc] [Back]

Basic Regular Expressions [Toc] [Back]

RE and Bracket Expression Grammar [Toc] [Back]

ERE Grammar [Toc] [Back]