REGCOMP(5) REGCOMP(5)
regcomp - X/Open regular expressions definition and interface
Note: Two versions of regular expressions are supported:
o the historical Simple Regular Expressions, which provide backward
compatibility, but which will be withdrawn from a future issue of
this document set
o the improved internationalised version that complies with the
ISO/IEC 9945-2: 1993 standard.
The first (historical) version is described as part of the regexp
function in the regexp(5) man page. The second (improved) version is
described in this man page.
Regular Expressions (REs) provide a mechanism to select specific strings
from a set of character strings.
Regular expressions are a context-independent syntax that can represent a
wide variety of character sets and character set orderings, where these
character sets are interpreted according to the current locale. While
many regular expressions can be interpreted differently depending on the
current locale, many features, such as character class expressions,
provide for contextual invariance across locales.
The Basic Regular Expression (BRE) notation and construction rules in bre
apply to most utilities supporting regular expressions. Some utilities,
instead, support the Extended Regular Expressions (ERE) described in ere;
any exceptions for both cases are noted in the descriptions of the
specific utilities using regular expressions. Both BREs and EREs are
supported by the Regular Expression Matching interface in the regcmp(),
regexec() and related functions.
Regular Expression Definitions
For the purposes of this section, the following definitions apply:
entire regular expression
The concatenated set of one or more BREs or EREs that make up the pattern
specified for string selection.
matched
A sequence of zero or more characters is said to be matched by a BRE
or ERE when the characters in the sequence correspond to a sequence
of characters defined by the pattern.
Matching is based on the bit pattern used for encoding the
character, not on the graphic representation of the character. This
means that if a character set contains two or more encodings for a
Page 1
REGCOMP(5) REGCOMP(5)
graphic symbol, or if the strings searched contain text encoded in
more than one codeset, no attempt is made to search for any other
representation of the encoded symbol. If that is required, the user
can specify equivalence classes containing all variations of the
desired graphic symbol.
The search for a matching sequence starts at the beginning of a
string and stops when the first sequence matching the expression is
found, where first is defined to mean ``begins earliest in the
string''. If the pattern permits a variable number of matching
characters and thus there is more than one such sequence starting at
that point, the longest such sequence will be matched. For example:
the BRE bb* matches the second to fourth characters of abbbc, and
the ERE (wee|week)(knights|night) matches all ten characters of
weeknights.
Consistent with the whole match being the longest of the leftmost
matches, each subpattern, from left to right, matches the longest
possible string. For this purpose, a null string is considered to be
longer than no match at all. For example, matching the BRE \(.*\).*
against abcdef, the subexpression (\1) is abcdef, and matching the
BRE \(a*\)* against bc, the subexpression (\1) is the null string.
It is possible to determine what strings correspond to
subexpressions by recursively applying the leftmost longest rule to
each subexpression, but only with the proviso that the overall match
is leftmost longest. For example, matching \(ac*\)c*d[ac]*\1 against
acdacaaa matches acdacaaa (with \1=a); simply matching the longest
match for \(ac*\) would yield \1=ac, but the overall match would be
smaller (acdac). Conceptually, the implementation must examine every
possible match and among those that yield the leftmost longest total
matches, pick the one that does the longest match for the leftmost
subexpression and so on. Note that this means that matching by
subexpressions is context-dependent: a subexpression within a larger
RE may match a different string from the one it would match as an
independent RE, and two instances of the same subexpression within
the same larger RE may match different lengths even in similar
sequences of characters. For example, in the ERE (a.*b)(a.*b), the
two identical subexpressions would match four and six characters,
respectively, of accbaccccb.
When a multi-character collating element in a bracket expression is
involved, the longest sequence will be measured in characters
consumed from the string to be matched; that is, the collating
element counts not as one element, but as the number of characters
it matches.
BRE (ERE) matching a single character
A BRE or ERE that matches either a single character or a single
collating element.
Page 2
REGCOMP(5) REGCOMP(5)
Only a BRE or ERE of this type that includes a bracket expression
can match a collating element.
The definition of single character has been expanded to include also
collating elements consisting of two or more characters; this
expansion is applicable only when a bracket expression is included
in the BRE or ERE. An example of such a collating element may be
the Dutch ij, which collates as a y. In some encodings, a ligature
``i with j'' exists as a character and would represent a singlecharacter
collating element. In another encoding, no such ligature
exists, and the two-character sequence ij is defined as a multicharacter
collating element. Outside brackets, the ij is treated as
a two-character RE and matches the same characters in a string.
Historically, a bracket expression only matched a single character.
If, however, the bracket expression defines, for example, a range
that includes ij, then this particular bracket expression will also
match a sequence of the two characters i and j in the string.
BRE (ERE) matching multiple characters
A BRE or ERE that matches a concatenation of single characters or
collating elements.
invalid
This section uses the term invalid for certain constructs or
conditions. Invalid REs will cause the utility or function using
the RE to generate an error condition. When invalid is not used,
violations of the specified syntax or semantics for REs produce
undefined results: this may entail an error, enabling an extended
syntax for that RE, or using the construct in error as literal
characters to be matched. For example, the BRE construct \{1,2,3\}
does not comply with the grammar. A portable application cannot rely
on it producing an error nor matching the literal characters
\{1,2,3\}.
Regular Expression General Requirements
The requirements in this section apply to both basic and extended regular
expressions.
The use of regular expressions is generally associated with text
processing. REs (BREs and EREs) operate on text strings; that is, zero
or more characters followed by an end-of-string delimiter (typically
NUL). Some utilities employing regular expressions limit the processing
to lines; that is, zero or more characters followed by a newline
character. In the regular expression processing described in this
document, the newline character is regarded as an ordinary character and
both a period and a non-matching list can match one. The individual man
pages specify within the individual descriptions of those standard
utilities employing regular expressions whether they permit matching of
newline characters; if not stated otherwise, the use of literal newline
characters or any escape sequence equivalent produces undefined results.
Page 3
REGCOMP(5) REGCOMP(5)
Those utilities (like grep) that do not allow newline characters to match
are responsible for eliminating any newline character from strings before
matching against the RE. The regcomp() function (see regcomp(3G)),
however, can provide support for such processing without violating the
rules of this section.
The interfaces specified in this document set do not permit the inclusion
of a NUL character in an RE or in the string to be matched. If during
the operation of a standard utility a NUL is included in the text
designated to be matched, that NUL may designate the end of the text
string for the purposes of matching.
When a standard utility or function that uses regular expressions
specifies that pattern matching will be performed without regard to the
case (upper- or lower-) of either data or patterns, then when each
character in the string is matched against the pattern, not only the
character, but also its case counterpart (if any), will be matched. This
definition of case-insensitive processing is intended to allow matching
of multi-character collating elements as well as characters. For
instance, as each character in the string is matched using both its
cases, the RE [[.Ch.]] when matched against the string char, is in
reality matched against ch, Ch, cH and CH.
The implementation will support any regular expression that does not
exceed 256 bytes in length.
Basic Regular Expressions [Toc] [Back] BREs Matching a Single Character or Collating Element
A BRE ordinary character, a special character preceded by a
backslash or a period matches a single character. A bracket
expression matches a single character or a single collating element.
BRE Ordinary Characters
An ordinary character is a BRE that matches itself: any character in
the supported character set, except for the BRE special characters
listed in brespec.
The interpretation of an ordinary character preceded by a backslash
(\) is undefined, except for:
1. the characters ), (, { and }
2. the digits 1 to 9 inclusive
3. a character inside a bracket expression.
BRE Special Characters
Page 4
REGCOMP(5) REGCOMP(5)
A BRE special character has special properties in certain contexts.
Outside those contexts, or when preceded by a backslash, such a
character will be a BRE that matches the special character itself.
The BRE special characters and the contexts in which they have their
special meaning are:
.[\ The period, left-bracket and backslash is special except when used
in a bracket expression. An expression containing a [ that is not
preceded by a backslash and is not part of a bracket expression
produces undefined results.
* The asterisk is special except when used:
o in a bracket expression
o as the first character of an entire BRE (after an initial ^,
if any)
o as the first character of a subexpression (after an initial ^,
if any).
^ The circumflex is special when used:
o as an anchor
o as the first character of a bracket expression.
$ The dollar sign is special when used as an anchor.
Periods in BREs
A period (.), when used outside a bracket expression, is a BRE that
matches any character in the supported character set except NUL.
RE Bracket Expression
A bracket expression (an expression enclosed in square brackets, [ ]) is
an RE that matches a single collating element contained in the non-empty
set of collating elements represented by the bracket expression.
The following rules and definitions apply to bracket expressions:
1. A bracket expression is either a matching list expression or a nonmatching
list expression. It consists of one or more expressions:
collating elements, collating symbols, equivalence classes,
character classes or range expressions. Portable applications must
not use range expressions, even though all implementations support
them. The right-bracket (]) loses its special meaning and represents
itself in a bracket expression if it occurs first in the list (after
an initial circumflex (^), if any). Otherwise, it terminates the
Page 5
REGCOMP(5) REGCOMP(5)
bracket expression, unless it appears in a collating symbol (such as
[.].]) or is the ending right-bracket for a collating symbol,
equivalence class or character class. The special characters:
. * [ \
(period, asterisk, left-bracket and backslash, respectively) lose
their special meaning within a bracket expression.
The character sequences:
[. [= [:
(left-bracket followed by a period, equals-sign or colon) are
special inside a bracket expression and are used to delimit
collating symbols, equivalence class expressions and character class
expressions. These symbols must be followed by a valid expression
and the matching terminating sequence .], =] or :], as described in
the following items.
2. A matching list expression specifies a list that matches any one of
the expressions represented in the list. The first character in the
list must not be the circumflex. For example, [abc] is an RE that
matches any of the characters a, b or c.
3. A non-matching list expression begins with a circumflex (^), and
specifies a list that matches any character or collating element
except for the expressions represented in the list after the leading
circumflex. For example, [^abc] is an RE that matches any character
or collating element except the characters a, b or c. The circumflex
will have this special meaning only when it occurs first in the
list, immediately following the left-bracket.
4. A collating symbol is a collating element enclosed within bracketperiod
([. .]) delimiters. Collating elements are defined as
described in colltbl(1M). Multi-character collating elements must be
represented as collating symbols when it is necessary to distinguish
them from a list of the individual characters that make up the
multi-character collating element. For example, if the string ch is
a collating element in the current collation sequence with the
associated collating symbol <ch>, the expression [[.ch.]] will be
treated as an RE matching the character sequence ch, while [ch] will
be treated as an RE matching c or h. Collating symbols will be
recognised only inside bracket expressions. This implies that the RE
[[.ch.]]*c matches the first to fifth character in the string
chchch. If the string is not a collating element in the current
collating sequence definition, or if the collating element has no
characters associated with it (for example, see the symbol <HIGH> in
the example collation definition shown in colltbl(1M)), the symbol
will be treated as an invalid expression.
Page 6
REGCOMP(5) REGCOMP(5)
5. An equivalence class expression represents the set of collating
elements belonging to an equivalence class, as described in
colltbl(1M). Only primary equivalence classes will be recognised.
The class is expressed by enclosing any one of the collating
elements in the equivalence class within bracket-equal ([= =])
delimiters. For example, if a, agrave and acircumflex belong to the
same equivalence class, then [=a=]b], [[=agrave=]b] and
[[=acircumflex=]b] will each be equivalent to [aagraveacircumflexb].
If the collating element does not belong to an equivalence class,
the equivalence class expression will be treated as a collating
symbol.
6. A character class expression represents the set of characters
belonging to a character class, as defined in the LC_CTYPE category
in the current locale. All character classes specified in the
current locale will be recognised. A character class expression is
expressed as a character class name enclosed within bracket-colon
([: :]) delimiters.
The following character class expressions are supported in all
locales:
The following character class expressions are supported in all
locales:
[:alnum:] [:cntrl:] [:lower:] [:space:]
[:alpha:] [:digit:] [:print:] [:upper:]
[:blank:] [:graph:] [:punct:] [:xdigit:]
In addition, character class expressions of the form:
[:name:]
are recognised in those locales where the name keyword has been
given a charclass definition in the LC_CTYPE category.
7. A range expression represents the set of collating elements that
fall between two elements in the current collation sequence,
inclusively. It is expressed as the starting point and the ending
point separated by a hyphen (-).
Range expressions must not be used in portable applications because
their behaviour is dependent on the collating sequence. Ranges will
be treated according to the current collating sequence, and include
such characters that fall within the range based on that collating
sequence, regardless of character values. This, however, means that
the interpretation will differ depending on collating sequence. If,
for instance, one collating sequence defines aumlat as a variant of
a, while another defines it as a letter following z, then the
expression [aumlat-z] is valid in the first language and invalid in
Page 7
REGCOMP(5) REGCOMP(5)
the second.
In the following, all examples assume the collation sequence
specified for the POSIX locale, unless another collation sequence is
specifically defined.
The starting range point and the ending range point must be a
collating element or collating symbol. An equivalence class
expression used as a starting or ending point of a range expression
produces unspecified results. An equivalence class can be used
portably within a bracket expression, but only outside the range.
For example, the unspecified expression [[=e=]-f] should be given as
[[=e=]e-f]. The ending range point must collate equal to or higher
than the starting range point; otherwise, the expression will be
treated as invalid. The order used is the order in which the
collating elements are specified in the current collation
definition. One-to-many mappings (see the description of LC_COLLATE
in locale(1)) will not be performed. For example, assuming that the
character eszet is is placed in the collation sequence after r and
s, but before t and that it maps to the sequence ss for collation
purposes, then the expression [r-s] matches only r and s, but the
expression [s-t] matches s, eszet ot t.
The interpretation of range expressions where the ending range point
is also the starting range point of a subsequent range expression
(for instance [a-m-o]) is undefined.
The hyphen character will be treated as itself if it occurs first
(after an initial ^, if any) or last in the list, or as an ending
range point in a range expression. As examples, the expressions [-
ac] and [ac-] are equivalent and match any of the characters a, c or
-; [^-ac] and [^ac-] are equivalent and match any characters except
a, c or -; the expression [%- -] matches any of the characters
between % and - inclusive; the expression [- -@] matches any of the
characters between - and @ inclusive; and the expression [a- -@] is
invalid, because the letter a follows the symbol - in the POSIX
locale. To use a hyphen as the starting range point, it must either
come first in the bracket expression or be specified as a collating
symbol, for example: [][.-.]-0], which matches either a right
bracket or any character or collating element that collates between
hyphen and 0, inclusive.
If a bracket expression must specify both - and ], the ] must be
placed first (after the ^, if any) and the - last within the bracket
expression.
BREs Matching Multiple Characters
The following rules can be used to construct BREs matching multiple
characters from BREs matching a single character:
Page 8
REGCOMP(5) REGCOMP(5)
1. The concatenation of BREs matches the concatenation of the strings
matched by each component of the BRE.
2. A subexpression can be defined within a BRE by enclosing it between
the character pairs \( and \) . Such a subexpression matches
whatever it would have matched without the \( and \), except that
anchoring within subexpressions is optional behaviour.
Subexpressions can be arbitrarily nested.
3. The back-reference expression \n matches the same (possibly empty)
string of characters as was matched by a subexpression enclosed
between \( and \) preceding the \n. The character n must be a digit
from 1 to 9 inclusive, nth subexpression (the one that begins with
the nth \( and ends with the corresponding paired \)). The
expression is invalid if less than n subexpressions precede the \n.
For example, the expression ^\(.*\)\1$ matches a line consisting of
two adjacent appearances of the same string, and the expression
\(a\)*\1 fails to match a. The limit of nine back-references to
subexpressions in the RE is based on the use of a single digit
identifier. This does not imply that only nine subexpressions are
allowed in REs. The following is a valid BRE with ten
subexpressions:
\(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)*
4. When a BRE matching a single character, a subexpression or a backreference
is followed by the special character asterisk (*),
together with that asterisk it matches what zero or more consecutive
occurrences of the BRE would match. For example, [ab]* and [ab][ab]
are equivalent when matching the string ab.
5. When a BRE matching a single character, a subexpression or a backreference
is followed by an interval expression of the format \{m\},
\{m,\} or \{m,n\}, together with that interval expression it matches
what repeated consecutive occurrences of the BRE would match. The
values of m and n will be decimal integers in the range 0 <= m <= n
<= RE_DUP_MAX, where m specifies the exact or minimum number of
occurrences and n specifies the maximum number of occurrences. The
expression \{m\} matches exactly m occurrences of the preceding BRE,
\{m,\} matches at least m occurrences and \{m,n\} matches any number
of occurrences between m and n, inclusive.
For example, in the string abababccccccd the BRE c\{3\} is matched
by characters seven to nine, the BRE \(ab\)\{4,\} is not matched at
all and the BRE c\{1,3\}d is matched by characters ten to thirteen.
The behaviour of multiple adjacent duplication symbols (* and intervals)
produces undefined results.
BRE Precedence
Page 9
REGCOMP(5) REGCOMP(5)
The order of precedence is as shown in the following table:
BRE Precedence (from high to low)
collation-related bracket symbols [= =] [: :] [. .]
escaped characters \<special character>
bracket expression []
subexpressions/back-references \(\)\n
single-character-BRE duplication *\{m,n\}
concatenation
anchoring ^ $
BRE Expression Anchoring
A BRE can be limited to matching strings that begin or end a line;
this is called anchoring. The circumflex and dollar sign special
characters will be considered BRE anchors in the following contexts:
1. A circumflex (^) is an anchor when used as the first character of an
entire BRE. The implementation may treat circumflex as an anchor
when used as the first character of a subexpression. The circumflex
will anchor the expression (or optionally subexpression) to the
beginning of a string; only sequences starting at the first
character of a string will be matched by the BRE. For example, the
BRE ^ab matches ab in the string abcdef, but fails to match in the
string cdefab. The BRE \(^ab\) may match the former string. A
portable BRE must escape a leading circumflex in a subexpression to
match a literal circumflex.
2. A dollar sign ($) is an anchor when used as the last character of an
entire BRE. The implementation may treat a dollar sign as an anchor
when used as the last character of a subexpression. The dollar sign
will anchor the expression (or optionally subexpression) to the end
of the string being matched; the dollar sign can be said to match
the end-of-string following the last character.
3. A BRE anchored by both ^ and $ matches only an entire string. For
example, the BRE ^abcdef$ matches strings consisting only of abcdef.
Extended Regular Expressions
The extended regular expression (ERE) notation and construction
rules will apply to utilities defined as using extended regular
expressions; any exceptions to the following rules are noted in the
descriptions of the specific utilities using EREs.
Page 10
REGCOMP(5) REGCOMP(5)
EREs Matching a Single Character or Collating Element
An ERE ordinary character, a special character preceded by a
backslash or a period matches a single character. A bracket
expression matches a single character or a single collating element.
An ERE matching a single character enclosed in parentheses matches
the same as the ERE without parentheses would have matched.
ERE Ordinary Characters
An ordinary character is an ERE that matches itself. An ordinary
character is any character in the supported character set, except
for the ERE special characters listed in erespec. The
interpretation of an ordinary character preceded by a backslash (\)
is undefined.
ERE Special Characters
An ERE special character has special properties in certain contexts.
Outside those contexts, or when preceded by a backslash, such a
character is an ERE that matches the special character itself. The
extended regular expression special characters and the contexts in
which they have their special meaning are:
. [ \ (
The period, left-bracket, backslash and left-parenthesis are special
except when used in a bracket expression. Outside a bracket
expression, a left-parenthesis immediately followed by a rightparenthesis
produces undefined results.
) The right-parenthesis is special when matched with a preceding
left-parenthesis, both outside a bracket expression.
* + ? {
The asterisk, plus-sign, question-mark and left-brace are special
except when used in a bracket expression. Any of the following uses
produce undefined results:
if these characters appear first in an ERE, or immediately
following a vertical-line, circumflex or left-parenthesis
if a left-brace is not part of a valid interval expression.
| The vertical-line is special except when used in a bracket
expression. A vertical-line appearing first or last in an ERE, or
immediately following a vertical-line or a left-parenthesis, or
immediately preceding a right-parenthesis, produces undefined
results.
Page 11
REGCOMP(5) REGCOMP(5)
^ The circumflex is special when used:
as an anchor
as the first character of a bracket expression.
$ The dollar sign is special when used as an anchor.
Periods in EREs
A period (.), when used outside a bracket expression, is an ERE that
matches any character in the supported character set except NUL.
EREs Matching Multiple Characters
The following rules will be used to construct EREs matching multiple
characters from EREs matching a single character:
1. A concatenation of EREs matches the concatenation of the character
sequences matched by each component of the ERE. A concatenation of
EREs enclosed in parentheses matches whatever the concatenation
without the parentheses matches. For example, both the ERE cd and
the ERE (cd) are matched by the third and fourth character of the
string abcdefabcdef.
2. When an ERE matching a single character or an ERE enclosed in
parentheses is followed by the special character plus-sign (+),
together with that plus-sign it matches what one or more consecutive
occurrences of the ERE would match. For example, the ERE b+(bc)
matches the fourth to seventh characters in the string acabbbcde.
And, [ab]+ and [ab][ab]* are equivalent.
3. When an ERE matching a single character or an ERE enclosed in
parentheses is followed by the special character asterisk (*),
together with that asterisk it matches what zero or more consecutive
occurrences of the ERE would match. For example, the ERE b*c
matches the first character in the string cabbbcde, and the ERE b*cd
matches the third to seventh characters in the string
cabbbcdebbbbbbcdbc. And, [ab]* and [ab][ab] are equivalent when
matching the string ab.
4. When an ERE matching a single character or an ERE enclosed in
parentheses is followed by the special character question-mark (?),
together with that question-mark it matches what zero or one
consecutive occurrences of the ERE would match. For example, the ERE
b?c matches the second character in the string acabbbcde.
5. When an ERE matching a single character or an ERE enclosed in
parentheses is followed by an interval expression of the format {m},
{m,} or {m,n}, together with that interval expression it matches
what repeated consecutive occurrences of the ERE would match. The
values of m and n will be decimal integers in the range 0 <= m <= n
Page 12
REGCOMP(5) REGCOMP(5)
<= RE_DUP_MAX, where m specifies the exact or minimum number of
occurrences and n specifies the maximum number of occurrences. The
expression {m} matches exactly m occurrences of the preceding ERE,
{m,} matches at least m occurrences and {m,n} matches any number of
occurrences between m and n, inclusive. For example, in the string
abababccccccd the ERE c{3} is matched by characters seven to nine
and the ERE (ab){2,} is matched by characters one to six.
The behaviour of multiple adjacent duplication symbols (+, *, ? and
intervals) produces undefined results.
ERE Alternation
Two EREs separated by the special character vertical-line (|) match
a string that is matched by either. For example, the ERE a((bc)|d)
matches the string abc and the string ad. Single characters, or
expressions matching single characters, separated by the vertical
bar and enclosed in parentheses, will be treated as an ERE matching
a single character.
ERE Precedence
The order of precedence is as shown in the following table:
BRE Precedence (from high to low)
collation-related bracket symbols [= =] [: :] [. .]
escaped characters \<special character>
bracket expression []
grouping ()
single-character-ERE duplication *+?{m,n}
concatenation
anchoring ^ $
alteration |
For example, the ERE abba | cde matches either the string abba or the
string cde (rather than the string abbade or abbcde, because
concatenation has a higher order of precedence than alternation).
ERE Expression Anchoring
An ERE can be limited to matching strings that begin or end a line;
this is called anchoring. The circumflex and dollar sign special
characters are considered ERE anchors when used anywhere outside a
bracket expression. This has the following effects:
Page 13
REGCOMP(5) REGCOMP(5)
1. A circumflex (^) outside a bracket expression anchors the expression
or subexpression it begins to the beginning of a string; such an
expression or subexpression can match only a sequence starting at
the first character of a string. For example, the EREs ^ab and (^ab)
match ab in the string abcdef, but fail to match in the string
cdefab, and the ERE a^b is valid, but can never match because the a
prevents the expression ^b from matching starting at the first
character.
2. A dollar sign ($) outside a bracket expression anchors the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs ef$ and (ef$)
match ef in the string abcdef, but fail to match in the string
cdefab, and the ERE e$f is valid, but can never match because the f
prevents the expression e$ from matching ending at the last
character.
Regular Expression Grammar
Grammars describing the syntax of both basic and extended regular
expressions are presented in this section. The grammar takes
precedence over the text.
BRE/ERE Grammar Lexical Conventions
The lexical conventions for regular expressions are as described in
this section.
Except as noted, the longest possible token or delimiter beginning
at a given point will be recognised.
The following tokens will be processed (in addition to those string
constants shown in the grammar):
COLL_ELEM Any single-character collating element, unless it is a
META_CHAR.
BACKREF Applicable only to basic regular expressions. The
character string consisting of \ followed by a singledigit
numeral, 1 to 9.
DUP_COUNT Represents a numeric constant. It is an integer in the
range 0 <= DUP_COUNT <= RE_DUP_MAX. This token will only
be recognised when the context of the grammar requires it.
At all other times, digits not preceded by \ will be
treated as ORD_CHAR.
META_CHAR One of the characters:
^ when found first in a bracket expression
Page 14
REGCOMP(5) REGCOMP(5)
- when found anywhere but first (after an initial
^, if any) or last in a bracket expression, or as
the ending
range point in a range expression
] when found anywhere but first (after an initial
^, if any) in a bracket expression.
L_ANCHOR Applicable only to basic regular expressions. The
character ^ when it appears as the first character of a
basic regular expression and when not QUOTED_CHAR. The ^
may be recognised as an anchor elsewhere.
ORD_CHAR A character, other than one of the special characters in
SPEC_CHAR.
QUOTED_CHAR In a BRE, one of the character sequences:
\^ \. \* \[ \$ \\
In an ERE, one of the character sequences:
\^ \. \[ \$ \( \) \| \* \+ \? \{ \\
R_ANCHOR (Applicable only to basic regular expressions.) The
character $ when it appears as the last character of a
basic regular expression and when not QUOTED_CHAR. The $
may be recognised as an anchor elsewhere.
SPEC_CHAR For basic regular expressions, will be one of the
following special characters:
\ anywhere outside bracket expressions
[ anywhere outside bracket expressions
^ when used as an anchor or when
first in a bracket expression
$ when used as an anchor
* anywhere except: first in an entire RE;
anywhere in a bracket expression; directly
following \(; directly following an
anchoring ^.
For extended regular expressions, will be one of the
following special characters found anywhere outside
bracket expressions:
^ . [ $ ( ) | * + ? { \
Page 15
REGCOMP(5) REGCOMP(5)
(The close-parenthesis is considered special in this
context only if matched with a preceding openparenthesis.)
RE and Bracket Expression Grammar [Toc] [Back] This section presents the grammar for basic regular expressions,
including the bracket expression grammar that is common to both BREs and
EREs.
%token ORD_CHAR QUOTED_CHAR DUP_COUNT
%token BACKREF L_ANCHOR R_ANCHOR
%token Back_open_paren Back_close_paren
/* '\(' '\)' */
%token Back_open_brace Back_close_brace
/* '\{' '\}' */
/* The following tokens are for the Bracket Expression
grammar common to both REs and EREs. */
%token COLL_ELEM META_CHAR
%token Open_equal Equal_close Open_dot Dot_close Open_colon Colon_close
/* '[=' '=]' '[.' '.]' '[:' ':]' */
%token class_name
/* class_name is a keyword to the LC_CTYPE locale category */
/* (representing a character class) in the current locale */
/* and is only recognised between [: and :] */
%start basic_reg_exp
%%
/* --------------------------------------------
Basic Regular Expression
--------------------------------------------
*/
basic_reg_exp : RE_expression
| L_ANCHOR
| R_ANCHOR
| L_ANCHOR R_ANCHOR
| L_ANCHOR RE_expression
| RE_expression R_ANCHOR
| L_ANCHOR RE_expression R_ANCHOR
;
RE_expression : simple_RE
| RE_expression simple_RE
;
simple_RE : nondupl_RE
| nondupl_RE RE_dupl_symbol
;
nondupl_RE : one_character_RE
| Back_open_paren RE_expression Back_close_paren
| Back_open_paren Back_close_paren
| BACKREF
Page 16
REGCOMP(5) REGCOMP(5)
;
one_character_RE : ORD_CHAR
| QUOTED_CHAR
| '.'
| bracket_expression
;
RE_dupl_symbol : '*'
| Back_open_brace DUP_COUNT Back_close_brace
| Back_open_brace DUP_COUNT ',' Back_close_brace
| Back_open_brace DUP_COUNT ',' DUP_COUNT Back_close_brace
;
/* --------------------------------------------
Bracket Expression
-------------------------------------------
*/
bracket_expression : '[' matching_list ']'
| '[' nonmatching_list ']'
;
matching_list : bracket_list
;
nonmatching_list : '^' bracket_list
;
bracket_list : follow_list
| follow_list '-'
;
follow_list : expression_term
| follow_list expression_term
;
expression_term : single_expression
| range_expression
;
single_expression : end_range
| character_class
| equivalence_class
;
range_expression : start_range end_range
| start_range '-'
;
start_range : end_range '-'
;
Page 17
REGCOMP(5) REGCOMP(5)
end_range : COLL_ELEM
| collating_symbol
;
collating_symbol : Open_dot COLL_ELEM Dot_close
| Open_dot META_CHAR Dot_close
;
equivalence_class : Open_equal COLL_ELEM Equal_close
;
character_class : Open_colon class_name Colon_close
;
The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside \( and \)
(which implies that ^ and $ are ordinary characters).
This section presents the grammar for extended regular expressions,
excluding the bracket expression grammar.
Note: The bracket expression grammar and the associated %token
lines are identical between BREs and EREs. It has been omitted
from the ERE section to avoid unnecessary editorial duplication.
%token ORD_CHAR QUOTED_CHAR DUP_COUNT
%start extended_reg_exp
%%
/* --------------------------------------------
Extended Regular Expression
--------------------------------------------
*/
extended_reg_exp : ERE_branch
| extended_reg_exp ' | ' ERE_branch
;
ERE_branch : ERE_expression
| ERE_branch ERE_expression
;
ERE_expression : one_character_ERE
| '^'
| '$'
| '(' extended_reg_exp ')'
| ERE_expression ERE_dupl_symbol
;
one_character_ERE : ORD_CHAR
| QUOTED_CHAR
| '.'
Page 18
REGCOMP(5) REGCOMP(5)
| bracket_expression
;
ERE_dupl_symbol : '*'
| '+'
| '?'
| '{' DUP_COUNT '}'
| '{' DUP_COUNT ',' '}'
| '{' DUP_COUNT ',' DUP_COUNT '}'
;
The ERE grammar does not permit several constructs that previous sections
specify as having undefined results:
o ORD_CHAR preceded by \
o one or more ERE_dupl_symbols appearing first in an ERE,
or immediately following |, ^ or (
o { not part of a valid ERE_dupl_symbol
o | appearing first or last in an ERE,
or immediately following | or
(, or immediately preceding ).
Implementations are permitted to extend the language to allow these.
Portable applications cannot use such constructs.
PPPPaaaaggggeeee 11119999 [ Back ]
|