PERFEX(1) PERFEX(1)
perfex - Command line interface to processor event counters
perfex [-a | -e event0 [-e event1]] [-mp | -s | -p] [-x] [-k] [-y] [-t]
[-T] [-o file] [-c file] command
The given command is executed; after it is complete, perfex prints the
values of various hardware performance counters. The counts returned are
aggregated over all processes that are descendants of the target command,
as long as their parent process controls the child through wait (see
wait(2)).
The R10000 event counters are different from R12000 event counters. See
the r10k_counters(5) man page for differences. For R10000 CPUs, the
integers event0 and event1 index the following table:
0 = Cycles
1 = Issued instructions
2 = Issued loads
3 = Issued stores
4 = Issued store conditionals
5 = Failed store conditionals
6 = Decoded branches. (This changes meaning in 3.x
versions of R10000. It becomes resolved branches).
7 = Quadwords written back from secondary cache
8 = Correctable secondary cache data array ECC errors
9 = Primary (L1) instruction cache misses
10 = Secondary (L2) instruction cache misses
11 = Instruction misprediction from secondary cache way prediction table
12 = External interventions
13 = External invalidations
14 = Virtual coherency conditions. (This changes meaning in 3.x
versions of R10000. It becomes ALU/FPU forward progress
cycles. On the R12000, this counter is always 0).
15 = Graduated instructions
16 = Cycles
17 = Graduated instructions
18 = Graduated loads
19 = Graduated stores
20 = Graduated store conditionals
21 = Graduated floating point instructions
22 = Quadwords written back from primary data cache
23 = TLB misses
24 = Mispredicted branches
25 = Primary (L1) data cache misses
26 = Secondary (L2) data cache misses
27 = Data misprediction from secondary cache way prediction table
28 = External intervention hits in secondary cache (L2)
29 = External invalidation hits in secondary cache
30 = Store/prefetch exclusive to clean block in secondary cache
Page 1
PERFEX(1) PERFEX(1)
31 = Store/prefetch exclusive to shared block in secondary cache
For R12000 CPUs, the integers event0 and event1 index the following
table:
0 = Cycles
1 = Decoded instructions
2 = Decoded loads
3 = Decoded stores
4 = Miss handling table occupancy
5 = Failed store conditionals
6 = Resolved conditional branches
7 = Quadwords written back from secondary cache
8 = Correctable secondary cache data array ECC errors
9 = Primary (L1) instruction cache misses
10 = Secondary (L2) instruction cache misses
11 = Instruction misprediction from secondary cache way prediction table
12 = External interventions
13 = External invalidations
14 = ALU/FPU progress cycles. (This counter in current versions of R12000
is always 0).
15 = Graduated instructions
16 = Executed prefetch instructions
17 = Prefetch primary data cache misses
18 = Graduated loads
19 = Graduated stores
20 = Graduated store conditionals
21 = Graduated floating-point instructions
22 = Quadwords written back from primary data cache
23 = TLB misses
24 = Mispredicted branches
25 = Primary data cache misses
26 = Secondary data cache misses
27 = Data misprediction from secondary cache way prediction table
28 = State of intervention hits in secondary cache (L2)
29 = State of invalidation hits in secondary cache
30 = Store/prefetch exclusive to clean block in secondary cache
31 = Store/prefetch exclusive to shared block in secondary cache
-e event Specify an event to be counted.
2, 1, or 0 event specifiers may be given, the default
events being to count cycles. Events may also be
specified by setting one or both of the environment
variables T5_EVENT0 and T5_EVENT1. Command line event
specifiers, if present, override the environment
variables. The order of events specified is not
important. The counts, together with an event
description, are written to stderr unless redirected
with the -o option. Two events that must be counted on
the same hardware counter (see r10k_counters(5)) will
cause a conflicting counters error.
Page 2
PERFEX(1) PERFEX(1)
-a Multiplexes over all events, projecting totals. Ignores
event specifiers.
The option -a produces counts for all events by
multiplexing over 16 events per counter. The OS does
the switching round robin at clock interrupt
boundaries. The resulting counts are normalized by
multiplying by 16 to give an estimate of the values
they would have had for exclusive counting. Due to the
equal-time nature of the multiplexing, events present
in large enough numbers to contribute significantly to
the execution time will be fairly represented. Events
concentrated in a few short regions (for instance,
instruction cache misses) may not be projected very
accurately.
-mp Report per-thread counts for multiprocessing programs
as well as (default) totals.
By default, perfex aggregates the counts of all the
child threads and reports this number for each selected
event. The -mp option causes the counters for each
thread to be collected at thread exit time and printed
out; the counts aggregated across all threads are
printed next. The per-thread counts are labeled by
process ID (pid).
-o file Redirects perfex output to the specified file.
In the -mp case, the file name includes the pid of the
sproc child thread.
-s Starts (or stops) counting when a SIGUSR1 (or SIGUSR2)
signal is received by a perfex process.
-p period Profiles (samples) the counters with the given period.
This option causes perfex to wait until it (i.e., the
perfex process) receives a SIGUSR1 before it starts
counting (for the child process, the target). It will
stop counting if it receives a SIGUSR2. Repeated cycles
of this will aggregate counts. If no SIGUSR2 is
received (the usual case), the counting will continue
until the child exits. Note that counting for
descendants of the child will not be affected, meaning
counting for mp programs cannot be controlled with this
option.
Page 3
PERFEX(1) PERFEX(1)
-x Counts at exception level (as well as the default user
level).
Exception level includes time spent on behalf of the
user during, for example, TLB refill exceptions. Other
counting modes (kernel, supervisor) are available
through the OS ioctl interface (see r10k_counters(5) ).
-k Counts at kernel level (as well as user and exception
level, if set), program superuser privileges.
To collect instruction and data secondary cache miss counts on a program
normally executed by
% bar < bar.in > bar.out
would be accomplished by
% perfex -e 26 -e 10 bar < bar.in > bar.out .
COST ESTIMATE OPTIONS [Toc] [Back] -y Report statistics and ranges of estimated times per event.
Without the -y option, perfex reports the counts recorded by the
event counters for the events requested. Since they are simply raw
counts, it is difficult to know by inspection which events are
responsible for significant portions of the job's run time. The -y
option associates time cost with some of the event counts.
The reported times are approximate. Due to the superscalar nature
of the R10000 and R12000 CPUs, and their ability to hide latency,
stating a precise cost for a single occurrence of many of the events
is not possible. Cache misses, for example, can be overlapped with
other operations, so there is a wide range of times possible for any
cache miss.
To account for the fact that the cost of many events cannot be known
precisely, perfex -y reports a range of time costs for each event.
"Maximum," "minimum," and "typical" time costs are reported. Each is
obtained by consulting an internal table that holds the maximum,
minimum, and typical costs for each event, and multiplying this cost
by the count for the event. Event costs are usually measured in
terms of machine cycles, and so the cost of an event generally
depends on the clock speed of the processor, which is also reported
in the output.
The maximum value contained in the table corresponds to the worst
Page 4
PERFEX(1) PERFEX(1)
case cost of a single occurrence of the event. Sometimes this can be
a very pessimistic estimate. For example, the maximum cost for
graduated floating-point instructions assumes that all such
instructions are double precision reciprocal square roots, since
that is the most costly floating-point instruction.
Due to the latency-hiding capabilities of the CPUs, the minimum cost
of virtually any event could be zero, since most events can be
overlapped with other operations. To avoid simply reporting minimum
costs of 0, which would be of no practical use, the minimum time
reported by perfex -y corresponds to the "best case" cost of a
single occurrence of the event. The best case cost is obtained by
running the maximum number of simultaneous occurrences of that event
and averaging the cost. For example, two floating-point instructions
can complete per cycle, so the best case cost on the R10000 is 0.5
cycles per floating-point instruction.
The typical cost falls somewhere between minimum and maximum and is
meant to correspond to the cost one would expect to see in average
programs. For example, to measure the typical cost of a cache miss,
stride-1 accesses to an array too big to fit in cache were timed,
and the number of cache misses generated was counted. The same
number of stride-1 accesses to an in-cache array were then timed.
The difference in times corresponds to the cost of the cache misses,
and this was used to calculate the average cost of a cache miss.
This typical cost is lower than the worst case in which each cache
miss cannot be overlapped, and it is higher than the best case, in
which several independent, and hence, overlapping, cache misses are
generated. (Note that on Origin systems, this methodology yields
the time for secondary cache misses to local memory only.)
Naturally, these typical costs are somewhat arbitrary. If they do
not seem right for the application being measuring by perfex, they
can be replaced by user-supplied values. See the -c option below.
perfex -y prints the event counts and associated cost estimates
sorted from most costly to least costly. While resembling a
profiling output, it is not a true profile. The event costs reported
are only estimates. Furthermore, since events do overlap with each
other, the sum of the estimated times will usually exceed the
program's run time. This output should only be used to identify
which events are responsible for significant portions of the
program's run time and to get a rough idea of what those costs might
be.
With this in mind, the built-in cost table does not make an attempt
to provide detailed costs for all events. Some events provide
summary or redundant information. These events are assigned minimum
and typical costs of 0, so that they sort to the bottom of the
output. The maximum costs are set to 1 cycle, so that you can get
an indication of the time corresponding to these events. Issued
instructions and graduated instructions are examples of such events.
In addition to these summary or redundant events, detailed cost
Page 5
PERFEX(1) PERFEX(1)
information has not been provided for a few other events, such as
external interventions and external invalidations, since it is
difficult to assign costs to these asynchronous events. The built-in
cost values may be overridden by user-supplied values using the -c
option.
In addition the event counts and cost estimates, perfex -y also
reports a number of statistics derived from the typical costs. The
meaning of many of the statistics is self-evident (for example,
graduated instructions/cycle). The following are statistics whose
definitions require more explanation. These are available with both
R10000 and R12000 CPUs.
Data mispredict/Data secondary cache hits
This is the ratio of the counts for data misprediction from
secondary cache way prediction table and secondary data cache
misses.
Instruction mispredict/Instruction secondary cache hits
This is the ratio of the counts for instruction misprediction from
secondary cache way prediction table and secondary instruction cache
misses.
Primary cache line reuse
The is the number of times, on average, that a primary data cache
line is used after it has been moved into the cache. It is
calculated as graduated loads plus graduated stores minus primary
data cache misses, all divided by primary data cache misses.
Secondary Cache Line Reuse
The is the number of times, on average, that a secondary data cache
line is used after it has been moved into the cache. It is
calculated as primary data cache misses minus secondary data cache
misses, all divided by secondary data cache misses.
Primary Data Cache Hit Rate
This is the fraction of data accesses that are satisfied from a
cache line already resident in the primary data cache. It is
calculated as 1.0 - (primary data cache misses divided by the sum of
graduated loads and graduated stores).
Page 6
PERFEX(1) PERFEX(1)
Secondary Data Cache Hit Rate
This is the fraction of data accesses that are satisfied from a
cache line already resident in the secondary data cache. It is
calculated as 1.0 - (secondary data cache misses divided by primary
data cache misses).
Time accessing memory/Total time
This is the sum of the typical costs of graduated loads, graduated
stores, primary data cache misses, secondary data cache misses, and
TLB misses, divided by the total program run time. The total program
run time is calculated by multiplying cycles by the time per cycle
(the inverse of the processor's clock speed).
Primary-to-secondary bandwidth used (MB/s, average per process)
This is the amount of data moved between the primary and secondary
data caches, divided by the total program run time. The amount of
data moved is calculated as the sum of the number of primary data
cache misses multiplied by the primary cache line size and the
number of quadwords written back from primary data cache multiplied
by the size of a quadword (16 bytes). For multiprocess programs,
the resulting figure is a per-process average, since the counts
measured by perfex are aggregates of the counts for all the threads.
You must multiply by the number of threads to get the total program
bandwidth.
Memory bandwidth used (MB/s, average per process)
This is the amount of data moved between the secondary data cache
and main memory, divided by the total program run time. The amount
of data moved is calculated as the sum of the number of secondary
data cache misses multiplied by the secondary cache line size and
the number of quadwords written back from secondary data cache
multiplied by the size of a quadword (16 bytes). For multiprocess
programs, the resulting figure is a per-process average, since the
counts measured by perfex are aggregates of the counts for all the
threads. You must multiply by the number of threads to get the total
program bandwidth.
MFLOPS (MB/s, average per process)
This is the ratio of the graduated floating-point instructions and
the total program run time. Note that while a multiply-add carries
out two floating-point operations, it only counts as one
instruction, so this statistic may underestimate the number of
floating-point operations per second. For multiprocess programs, the
resulting figure is a per-process average, since the counts measured
by perfex are aggregates of the counts for all the threads. You must
multiply by the number of threads to get the total program rate.
Page 7
PERFEX(1) PERFEX(1)
The following statistics are computed only on R12000 CPUs:
Cache misses in flight per cycle (average)
This is the count of event 4 (Miss Handling Table (MHT) population)
divided by cycles. It can range between 0 and 5 and represents the
average number of cache misses of any kind that are outstanding per
cycle.
Prefetch miss rate
This is the count of event 17 (prefetch primary data cache misses)
divided by the count of event 16 (executed prefetch instructions).
A high prefetch miss rate (about 1) is desirable, since prefetch
hits are wasting instruction bandwidth.
A statistic is only printed if counts for the events which define it have
been gathered.
-c file
Load a cost table from file (requires that -y is specified).
This option allows you to override the internal event costs used by
the -y option. file contains the list of event costs that are to be
overridden. This file must be in the same format as the output
produced by the -c option. Costs may be specied in units of "clks"
(machine cycles) or "nsec" (nanoseconds). You can override all or
only a subset of the default costs.
You can also use the file /etc/perfex.costs to override event costs.
If this file exists, any costs listed in it will override those
built into perfex. Costs supplied with the -c option will override
those provided by the /etc/perfex.costs file.
-t Print the cost table used for perfex -y cost estimates to stdout.
These internal costs can be overridden by specifying different
values in the file /etc/perfex.costs or by using the -c file option.
Both file and /etc/perfex.costs must use the format as provided by
the -t option. It is recommended that you capture this output to a
file and edit it to create a suitable file for /etc/perfex.costs or
the -c option. You do not have to specify costs for every event,
however. Lines corresponding to events with values you do not wish
to override may simply be deleted from the file.
The following is an option for systems with both R10000 and R12000 CPUs.
Page 8
PERFEX(1) PERFEX(1)
-T Allows experienced users to use perfex on a system of mixed CPUs.
Although perfex cannot verify it, the specification of this option means
that you have used either dplace(1) or some other means to ensure that
the program is using either all R10000 CPUs or all R12000 CPUs.
When used with this option, the -y option will not produce cost estimates
due to the fact that the cost estimation cannot know which type of CPU is
actually targeted. Nothing prevents you, however, from loading a cost
table with -c. This cost table could be directly dumped from a pureR10000
or pure-R12000 system, depending on which CPU flavor the program
is running.
CHANGE IN BEHAVIOR OF DEFAULT EVENTS [Toc] [Back] Because of limitations of ABI/API compliance with Irix version 6.5/R10000
in the operating system counter interface, it is only possible to count
cycles and graduated instructions on counter 0. Accordingly, when the
R12000 user specifies an event in the range 0-15 to perfex, either
through a -e argument or environment variables, cycles cannot be counted
simultaneously with that event as they can on the R10000. (perfex only
multiplexes events for the -a option, never for individually specified
events). In these cases perfex will count event 16 (executed prefetch
instructions) as the second event.
For similar reasons, perfex no longer remaps events 0, 15, 16, and 17 to
fit them on two (R10000) counters, since that would induce a different
behavior for identical arguments on R10000 and R12000 systems. It would
create problems when mixed-CPU systems are supported. To be specific,
prior to 6.5.3 a user could specify:
% perfex -e 0 -e 15 a.out
This would execute as if the user had specified:
% perfex -e 0 -e 17 a.out
or
% perfex -e 15 -e 16 a.out
After Irix version 6.5.3, this argument combination is an error, and the
user must decide which of the equivalent (for R10000 only) forms to use.
It is the lack of equivalence for R12000 that makes this regression
necessary.
/etc/perfex.costs
perfex only works on an R10000 or R12000 system. Programs running on
mixed R1000 and R12000 CPUs are not supported, although specifying the -T
option will permit you to verify that only CPUs of the same type are
being used. Usually, perfex prints an informative message and fails on
Page 9
PERFEX(1) PERFEX(1)
mixed CPU systems.
For the -mp option, only binaries linked-shared are currently supported;
this is due to a dependency on libperfex.so. The options -s and -mp are
currently mutually exclusive.
The signal control interface (-s) can control only the immediate target
process, not any of its descendants. This makes it unusable with
multiprocess targets in their parallel regions.
r10k_counters(5), libperfex(3C), time(1), timex(1)
PPPPaaaaggggeeee 11110000 [ Back ]
|