awk - Pattern scanning and processing language
awk [-F ERE] [-f program_file]... [-v var=val]... [argument]...
awk [-F ERE] [-v var=val]... ['program_text'] [argument]...
Interfaces documented on this reference page conform to
industry standards as follows:
awk: XCU5.0
Refer to the standards(5) reference page for more information
about industry standards and associated tags.
Defines ERE (extended regular expression) as the value of
the input field separator before any input is read. Using
this option is comparable to assigning a value to the
built-in variable FS. Specifies the pathname (program_file)
of a file containing a awk program. If multiple
instances of this option are specified, the concatenation
of the files specified as program_file in the order
specified is the awk program. The awk program can alternatively
be specified on the command line as the single
argument program_text. The var=val argument is an assignment
operand that specifies a value (val) for a variable
(var). The specified variable assignment occurs prior to
executing the awk program, including the actions associated
with BEGIN patterns (if any are in the program).
Multiple occurrences of the -v option can be specified on
the awk command line.
If -f program_file is not specified, the first parameter
to awk is program_text, delimited by single quotation (')
characters.
See the DESCRIPTION section for the processing of
this parameter. The following two types of argument
can be intermixed: A pathname of a file that
contains the input to be read, which is matched
against the set of patterns in the program. If no
input_file operands are specified, or if the
input_file argument is -, standard input is used.
The characters before the = represent the name of
an awk variable. If that name is an awk reserved
word, the behavior is undefined. The characters
following the = are interpreted as if they appeared
in the awk program preceded and followed by a double
quotation (") character, in other words, as a
string value. If the value is considered a numeric
string, the variable is assigned a numeric value.
Each such variable assignment occurs just prior to
the processing of the following program_file, if
any. Thus, an assignment before the first program_file
argument is executed after the BEGIN
actions (if any), while an assignment after the
last program_file argument occurs before the END
actions (if any). If there are no program_file
arguments, assignments are executed before processing
the standard input.
The awk command executes programs written in the awk programming
language, a powerful pattern matching utility for
textual data manipulation. An awk program is a sequence
of patterns and corresponding actions that are carried out
when a pattern is read. The awk command is a more powerful
tool for text manipulation than either sed or grep.
The awk command: Performs convenient numeric processing
Allows variables within actions Allows general selection
of patterns Allows control flow in the actions Does not
require any compiling of programs
The pattern-matching and action statements of the awk language
can be specified either on the command line or in a
program file. In either case, the awk command first reads
all program statements.
If -f program_file is not specified, the first operand to
awk is program_text, delimited by single quotation (')
characters.
Execution of an awk program starts by executing the
actions associated with all BEGIN patterns in the order
they occur in the program. Then, each operand in an
input-file argument (or standard input if an input file is
not specified) is processed in turn by: Reading input data
until a record separator is seen (a newline character by
default) Splitting the current record into fields using
the current value of FS Evaluating each pattern in the
program in the order of occurrence Executing the action
associated with each pattern that matches the current
record
The action for a matching pattern is executed
before evaluating subsequent patterns. The actions
associated with all END patterns are executed in
program order.
Refer to the EXAMPLES section for an example that demonstrates
the results of specifying a variable assignment as
a flag argument or command argument in different positions
on the awk command line.
The awk command reads input data in the order stated on
the command line. If you specify input_file as a - (dash)
or do not specify a filename, awk reads standard input.
The awk command reads input data from any of the following
sources: Any input_file operands or their equivalents,
which can be affected by modifying the awk variables ARGV
and ARGC Standard input, in the absence of any input_file
operands Arguments to the getline function
Input files must be text files. When the built-in variable
RS is set to a value other than a newline character,
awk supports records terminated with the specified separator
up to LINE_MAX bytes.
Pattern-action statements on the command line are enclosed
in ' (single quote characters) to protect them from interpretation
by the shell. Consecutive pattern-action statements
on the same command line are separated by a ; (semicolon),
within one set of quote delimiters.
By default, the awk command treats input lines as records,
separated by spaces, tabs, or a field separator you set
with the FS variable. (When a space character is the
field separator, multiple spaces are recognized as a single
separator.) Fields are referenced as $1, $2, and so
on. The reference $0 specifies the entire record (by
default, a line).
Program Structure [Toc] [Back]
A awk program is composed of pairs of the form: pattern {
action}
Either the pattern or the action (including the enclosing
brace characters) can be omitted.
If pattern lacks a corresponding action, awk writes the
entire record that contains the pattern to standard output.
If action lacks a corresponding pattern, awk applies
the action to every record.
Actions [Toc] [Back]
An action is a sequence of statements that follow C language
syntax. Any single statement can be replaced by a
statement list enclosed in braces. When statement is a
list of statements, they must be separated by newline
characters or semicolons, and are executed sequentially in
order of appearance. Statements in the awk language
include:
break continue delete array [expression] exit [expression]
for (expression;expression;expression) statement for
(variable in array) statement if (expression) statement
[else statement] next print [expression_list][>file|>>file][|
command] printf format[
,expression_list][>file|>>file][| command] printf format[,expression_list
][>file] while (expression) statement
variable=expression
Statements can end with a semicolon, a newline character,
or the right brace enclosing the action:
{ [ statement ... ] }
Expressions can have string or numeric values and are
built using the operators +, -, *, /, %, a space for
string concatenation, and the C operators ++, --, +=, -=,
*=, /=, =, ^=, ?:, >, >=, <, <=, ==, $, (), ~, !~, in, ||,
&&, !, and !=.
Because the actions process fields, input white space is
not preserved in the output.
The file and command arguments in awk statements can be
literal names or expressions enclosed in double quotation
(") characters. Identical string values in different
statements refer to the same open file.
The print statement writes its arguments to standard output
(or to a file if > file or >> file is present), separated
by the current output field separator and terminated
by the current output record separator.
The printf statement formats its expression list according
to the format of the printf subroutine, and writes it
arguments to standard output, separated by the output
field separator and terminated by the output record separator.
You can redirect the output into a file using the
print ... file or printf( ...) > file statements.
Variables [Toc] [Back]
Variables can be scalars, array elements (denoted x[i]),
or fields. With the exception of function parameters,
variables are not explicitly declared.
Variable names can consist of uppercase and lowercase
alphabetic letters, the underscore character, the digits
(0 to 9), and extended characters. Variable names cannot
begin with a digit. Field variables are designated by $
(dollar sign), followed by a number or numerical expression.
The effect of the field number expression evaluating
to anything other than a non-negative integer is
unspecified.
Variables are initialized to the null string. Array subscripts
can be any string; they do not have to be numeric.
This allows for a form of associative memory. Enclose
string constants in expressions in double quotation (")
characters.
There are several variables with special meaning to awk.
They include: The number of elements in the ARGV array.
An array of command line arguments, excluding options and
the program_file arguments, numbered from zero to ARGC-1.
The arguments in ARGV can be modified or added to;
ARGC can be altered. As each input file ends, awk
treats the next non-null element of ARGV, up to and
including the current value of ARGC-1, as the name
of the next input file. Therefore, setting an element
of ARGV to null means that it is not be
treated as an input file. When the element is the
character -, standard input is specified. When the
element matches the format for an assignment (variable=value),
the element is treated as an assignment
rather than as the name of an awk input file.
The PRINTF format for converting numbers to strings
(except for output statements, where OFMT is used);
%.6g by default. The variable ENVIRON is an array
representing the value of the environment. The
indexes of the array are strings consisting of the
names of the environmental variables, and the value
of each array element is a string consisting of the
value of that variable. The name of the current
input file. Inside a BEGIN action, the FILENAME
value is undefined. Inside an END action, the
value is the name of the last input file processed.
The ordinal number of the current input line
(record) in the current file. Inside a BEGIN
action, the value is zero. Inside an END action,
the value is the number of the last record processed
in the last file processed. Input field
separator (default is a space). If it is a space,
then any number of spaces and tabs can separate
fields. The number of fields in the current input
line (record) with a limit of 199. The number of
the current input line (record). The print statement
output field separator (default is a space).
The print statement output record separator
(default is a newline character). The printf
statement output format for converting numbers to
strings in output statements (default is %.6g).
The length of the string matched by the match function.
Input record separator (default is a newline
character). The starting position of the string
matched by the match function, numbering from 1.
This is always equivalent to the return value of
the match function. The subscript separator string
for multi-dimensional arrays.
Functions [Toc] [Back]
There are a variety of built-in functions that can be used
in awk actions.
Arithmetic Functions [Toc] [Back]
The arithmetic functions, except for int, are based on the
ISO C standard. The behavior is undefined in cases where
the ISO C standard specifies that an error be returned or
that the behavior is undefined. Returns the arctangent of
y/x. Returns the cosine of x, where x is in radians.
Returns the sine of x where x is in radians. Returns the
exponential factor of x. Returns the natural logarithm of
x. Returns the square root of x. Truncates its argument
to an integer. It is truncated toward 0 when x > 0.
Returns a random number n, such that 0 <= n < 1. Sets the
seed value for rand to expr or uses the time of day if
expr is omitted. The previous seed value is returned.
String Functions [Toc] [Back]
Behave like sub (see below), except replace all occurrences
of the regular expression (like the ed utility
global substitute) in $0 or in the in argument, when specified.
Returns the position, in characters, numbering
from 1, in string s where string t first occurs, or zero
if it does not occur at all. Returns the length, in characters,
of its argument taken as a string, or of the whole
record, $0, if there is no argument. Returns the position,
in characters, numbering from 1, in string s where
the extended regular expression ere occurs, or zero if it
does not occur at all. RSTART is set to the starting
position, zero if no match is found; RLENGTH is set to the
length of the matched string, -1 if no match is found.
Splits the string s into array elements a[1], a[2], ...
a[n], and return n. The separation is done with the
extended regular expression fs or with the field separator
FS if fs is not given. Each array element has a string
value when created. If the string assigned to any array
element, with any occurrence of the decimal point character
from the current locale changed to a period character,
would be considered a numeric string, the array element
also has the numeric value of the numeric string. The
effect of a null string as the value of fs is unspecified.
Formats the expressions according to the printf format
given by fmt and return the resulting string. Substitutes
the string repl in place of the first instance of the
extended regular expression ERE in string in and return
the number of substitutions. An ampersand (&) appearing
in the string repl is replaced by the string from in that
matches the regular expression. For each occurrence of
backslash (\) encountered when scanning the string repl
from beginning to end, the next character is taken literally
and loses its special meaning (for example, \& is
interpreted as a literal ampersand character). Except for
& and \, it is unspecified what the special meaning of any
such character is. If in is specified and it is not an
lvalue, the behavior is undefined. If in is omitted, awk
substitutes in the current record ($0). Returns the at
most n character substring of s that begins at position m,
numbering from 1. If n is missing, the length of the substring
is limited by the length of the string s. Returns
a string based on the string s. Each character in s that
is an upper case letter specified to have a tolower mapping
by the LC_TYPE category of the current locale is
replaced in the returned string by the lower case letter
specified by the mapping. Other characters in s are
unchanged in the returned string. Returns a string based
on the string s. Each character in s that is a lower case
letter specified to have a toupper mapping by the LC_TYPE
category of the current locale is replaced in the returned
string by the upper case letter specified by the mapping.
Other characters in s are unchanged in the returned
string.
Input/Output and General Functions
Closes the file or pipe opened by a print or printf statement
or a call to getline with the same string-valued
expression. If the close was successful, the function
returns zero; otherwise, it returns non-zero. Reads a
record of input from a stream piped from the output of a
command. The stream is created if no stream is currently
open with the value of expression as its common name. The
stream created is equivalent to one created by a call to
the popen function with the value of expression as the
command argument and a value of r as the mode argument.
As long as the stream remains open, subsequent calls in
which expression evaluates to the same string read subsequent
records from the file. The stream will remain open
until the close function is called with an expression that
evaluates to the same string value. At that time, the
stream is closed as if by a call to the pclose function.
If var is missing, $0 and NF are set; otherwise, var is
set. Sets $0 to the next input record from the current
input file. This form of getline sets the NF, NR, and FNR
variables. Sets variable var to the next input record
from the current input file. This form of getline sets
the FNR and NR variables. Reads the next record of input
from a named file. The expression is evaluated to produce
a string that is used as a full pathname. If the file of
that name is not currently open, it is opened. As long as
the stream remains open, subsequent calls in which expression
evaluates to the same string value, read subsequent
records from the file. The file remains open until the
close function is called with an expression that evaluates
to the same string value. If var is missing, $0 and NF
are set; otherwise, var is set. Executes the command
given by expression in a manner equivalent to the system
function and returns the exit status to the command.
All forms of getline return 1 for successful input, zero
for end of file, and -1 for an error.
The getline function sets $0 to the next input record from
the current input file; getline < file sets $0 to the next
record from file. The function getlinex sets variable x
instead. Finally, command| getline pipes the output of
command into getline. Each call of getline returns the
next line of output from command. In all cases, getline
returns 1 for a successful input, 0 (zero) for End-ofFile,
and -1 for an error.
The getline function sets $0 to the next input record from
the current input file. The getline function returns 1
for a successful input and 0 for End-of-File.
Where strings are used as the name of a file or pipeline,
the strings must be textually identical. The terminology
"same string value" implies that "equivalent strings",
even those that differ only by space characters, represent
different files.
User-defined Functions [Toc] [Back]
The awk language also provides user-defined functions.
Such functions can be defined as: function name(args,...)
{ statements }
A function can be referred to anywhere in an awk program;
in particular, the function's use can precede the function
definition. The scope of a function is global.
Function arguments can be either scalars or arrays; the
behavior is undefined if an array name is passed as an
argument that the function uses as a scalar, or if a
scalar expression is passed as an argument that the function
uses as an array. Function arguments are passed by
value if scalar and by reference if array name. Argument
names are local to the function; all other variable names
are global. The same name is not used as both an argument
name and as the name of a function or special awk variable.
The same name must not be used both as a variable
name with global scope and as the name of a function. The
same name must not be used within the same scope both as a
scalar variable and as an array.
The number of parameters in the function definition need
not match the number of parameters in the function call.
Excess formal parameters can be used as local variables.
If fewer arguments are supplied in a function call than
are in the function definition, the extra parameters that
are used in the function body as scalars is initialized
with a string value of the null string and a numeric value
of zero, and the extra parameters that are used in the
function body as arrays are initialized as empty arrays.
If more arguments are supplied in a function call than are
in the function definition, the behavior is undefined.
When invoking a function, no white space can be placed
between the function name and the opening parenthesis.
Function calls can be nested and recursive calls can be
made upon functions. Upon return from any nested or
recursive function call, the values of all the calling
function's parameters are unchanged, except for array
parameters passed by reference. The return statement can
be used to return a value.
Patterns [Toc] [Back]
Patterns are arbitrary Boolean combinations of patterns
and relational expressions (the !, ||, and && operators
and parentheses for grouping). You must start and end
regular expressions with slashes. You can use regular
expressions as described for grep, including the following
special characters: One or more occurrences of the pattern.
Zero or one occurrence of the pattern. Either of
two statements. Grouping of expressions.
Isolated regular expressions in a pattern apply to the
entire line. Regular expressions can occur in relational
expressions. Any string (constant or variable) can be used
as a regular expression, except in the position of an isolated
regular expression in a pattern.
If two patterns are separated by a comma, the action is
performed on all lines between an occurrence of the first
pattern and the next occurrence of the second.
There are two types of relational expressions that you can
use. The first type has the form: expression match_operator
pattern
where match_operator is either: ~ (for contains) or !~
(for does not contain).
The second type has the form: expression relational_operator
expression
where relational_operator is any of the six C relational
operators: <, >, <=, >=, ==, and !=. An expression can be
an arithmetic expression, a relational expression, or a
Boolean combination of these.
Special Patterns [Toc] [Back]
You can use the BEGIN and END special patterns to capture
control before the first and after the last input line is
read, respectively. BEGIN must be the first pattern; END
must be the last.
Each BEGIN pattern is matched once and its associated
action executed before the first record of input is read
and before command line assignment is done. Each END pattern
is matched once and its associated action executed
after the last record of input has been read. These two
patterns have associated actions.
BEGIN and END do not combine with other patterns. Multiple
BEGIN and END patterns are allowed. The actions associated
with the BEGIN patterns is executed in the order
specified in the program, as are the END actions. An END
pattern can precede a BEGIN pattern in a program.
You have two ways to designate an extended regular expression
other than white space to separate fields. You can
use the -Fere option on the command line, or you can
assign a string with the expression to the built-in variable
FS. Either action changes the field separator to
ere.
There are no explicit conversions between numbers and
strings. To force an expression to be treated as a number,
add 0 to it. To force it to be treated as a string,
append a null string ("").
Comment Delimiter [Toc] [Back]
In the awk language, a comment starts with the sharp sign
character, #, and continues to the end of the line. The #
does not have to be the first character on the line. The
awk language ignores the rest of the line following a
sharp sign. For example : # This program prints a nice
friendly message. It helps
# Keep novice users from being afraid of the computer.
The purpose of a comment is to help you or another person
understand the program at a later time.
The following exit values are returned: Successful completion.
An error occurred.
To display the file lines that are longer than 72 bytes,
enter: % awk 'length >72' chapter1
This command selects each line of the file chapter1
that is longer than 72 bytes. The command then
writes these lines to standard output because no
action is specified. To display all lines between
the words start and stop, enter: % awk
'/start/,/stop/' chapter1 To run an awk program
(sum2.awk) that processes a file (chapter1), enter:
% awk -f sum2.awk chapter1 The following awk
program computes the sum and average of the numbers
in the second column of the input file:
{
sum += $2
} END {
print "Sum: ", sum;
print "Average:", sum/NR;
}
The first action adds the value of the second field
of each line to the sum variable. The awk command
initializes sum, and all variables, to 0 (zero)
before starting. The keyword END before the second
action causes awk to perform that action after all
of the input file is read. The NR variable, which
is used to calculate the average, is a special
variable containing the number of records (lines)
that were read. To print the names of the users
who have the C shell as the initial shell, enter: %
awk -F: '$7 ~ /csh/ {print $1}' /etc/passwd To
print the first two fields in reversed order,
enter: % awk '{ print $2, $1 }' The following awk
program prints the first two fields of the input
file in reversed order, with input fields separated
by a comma, then adds up the first column and
prints the sum and average:
BEGIN { FS = "," }
{ print $2, $1}
{ s += $1 } END { print "sum is", s,
"average is", s/NR } The following example shows
how command line assignments synchronize with awk
program statements.
Consider the following set of awk statements that
make up a program named test_program:
BEGIN { if (RS == ":")
print "Assignment in effect for BEGIN
statements"
}
{ if (RS == ":")
print "Assignment in effect for middle
statements"
} END { if (RS == ":")
print "Assignment in effect for END statements"
}
Notice the different results that are produced by
different ways of assigning a value to RS on the
awk command line. The file text_file contains the
line "Hello, Hello". % awk -f test_program -v RS=:
text_file
Assignment in effect for BEGIN statements Assignment
in effect for middle statements Assignment in
effect for END statements
% awk -f test_program RS=: text_file
Assignment in effect for middle statements Assignment
in effect for END statements
% awk -f test_program text_file RS=:
Assignment in effect for END statements
ENVIRONMENT VARIABLES [Toc] [Back] The following environment variables affect the execution
of awk: Provides a default value for the internationalization
variables that are unset or null. If LANG is unset or
null, the corresponding value from the default locale is
used. If any of the internationalization variables contain
an invalid setting, the utility behaves as if none of the
variables had been defined. If set to a non-empty string
value, overrides the values of all the other internationalization
variables. Determines the locale for the interpretation
of sequences of bytes of text data as characters
(for example, single-byte as opposed to multi-byte characters
in arguments). Determines the locale for the format
and contents of diagnostic messages written to standard
error. Determines the location of message catalogs for
the processing of LC_MESSAGES.
Commands: grep(1), lex(1), sed(1)
Routines: printf(3)
Programming Support Tools
awk(1)
[ Back ] |