uprofile, kprofile - Profile a program (uprofile) or kernel
(kprofile) with Alpha on-chip performance counters
uprofile [-v] [-quiet] [-dirname path] [-[no]pids] [-all
| -each | -one] [-stride n] [-average] [-pixie] [-display
| prof-option...] [statistic...] program [argument...]
kprofile [-v] [-quiet] [-dirname path] [-[no]pids] [-all
| -each | -one] [-stride n] [-average] [-pixie] [-display
| prof-option...] [-k kernel_name] [-t] [-ra] [statistic...]
[program [argument...]]
See prof_intro(1) for an introduction to the application
performance tuning tools provided with Tru64 UNIX.
The uprofile command uses the Alpha on-chip performance
counters to produce a finely-grained program-counter profile
of a user program. The command runs the program you
specify with the arguments you specify, collecting the
selected statistics on the program's process and its
descendants. It writes the profile data to the umon.out
file, by default. If the program calls shared libraries,
those libraries are not profiled.
The kprofile command uses the Alpha on-chip performance
counters to produce a detailed program-counter profile of
the kernel. If you specify a program, kprofile runs the
program with the arguments you specify, and it collects
the selected statistics on the kernel for the duration of
the program's execution. If you do not specify a program,
kprofile collects the selected statistics on the kernel
until you enter Ctrl/C or the kprofile process receives a
SIGTERM signal. Note that if SIGINT (usually generated by
entering a Ctrl/C at the controlling terminal) is currently
being ignored, it will continue to be ignored and
SIGTERM must be used to terminate data collection. kprofile
writes the profile data to the kmon.out file, by
default.
If you specify -display or any of the prof-options, the
uprofile and kprofile commands display the profile by
runnning the prof tool (with any specified prof-options).
You can also run the prof command separately, to help analyze
the data in the umon.out or kmon.out file. The following
examples show how to invoke the prof command to
analyze data in the respective files: % prof a.out
umon.out % prof /vmunix kmon.out
The CPU-time profile displayed by prof will not be accurate
if the CPU speed of the processors that executed the
application are not the same, as in certain multiprocessor
systems containing EV67 or later processors. The inaccuracy
may be avoided by using the hiprof (sampling) or cc
-p/-pg profilers, or by running the application on a subset
of the processors: Select a single processor using the
runon command. Check the processor speeds using the
psrinfo -v command and run the application in a processor
set comprising only processors that run at the same speed
(see processor_sets(4))
The name of an event that your particular Alpha hardware
can profile, as detailed in the STATISTICS section, below.
If no statistic is named, machine cycles are counted, giving
a CPU-time profile. One statistic can be specified for
each of the hardware counters on your machine. The name
of the executable to run while profiling operations are
being performed. An argument to pass to the program that
is run. Multiple arguments can be specified, as needed by
the program.
Options can be abbreviated to three characters, except the
prof-options, which can be abbreviated (usually to one
character) as in a prof command. For example, -qui is
interpreted as quiet, but -q is interpreted as -quit. (See
the -display option for the supported prof-options.)
For options that specify a procedure name (proc), C++ procedures
can omit the argument type list, though this will
match all overloaded procedures with that name. To select
a specific procedure, specify the full symbol name (as
printed by the nm command). Symbol names containing
spaces, *, and so on must be quoted. Engages verbose
mode, which prints some useful information about the program
being profiled. Prevents informational and progress
messages from being printed. Specifies the directory path
in which the profiling data file or files are created.
[Disables] or enables the addition of the process-id number
to the name of the profiling data file or files.
Specifies which mode to use for profiling on multiprocessor
machines. Using the -all option (the default) aggregates
the data for all CPUs into one umon.out file. Using
the -each option collects separate profiles for each CPU
and writes the output into a set of files named
umon.out.n, where n is the CPU number. Using the -one
option profiles only the current CPU. For the -one option
to work, the uprofile or kprofile program must be run
using the runon command. Sets the granularity of the sample
counts, where n is the number of consecutive instructions
grouped together for each sample count. The default
is -stride 4. The -asm, -heavy, and -lines prof-options
need a separate sample count for each instruction (for
their reports to be precise enough), so these options
imply -stride 1. This makes the output file four times
bigger than the default size. The -stride argument must be
a power of two (for example, 1, 2, 4, 8). Attempts to
average samples within basic blocks so that each instruction
within a basic block will show the same number of
samples. Ensures fine grain profiles by setting stride to
1. Produces and files similar to those produced by running
an executable instrumented with pixie (see pixie(1)).
Uses cycles0 statistic (freq on EV67) by default. Ensures
fine grain profiles by setting stride to 1. Overrides the
name of the kernel to profile. (The default is the booted
kernel.) Enables triggered mode for kprofile. This option
sets up all required information for running the performance
counters, but does not invoke them. See the STATISTICS
section for additional information. Enables PCNTCALLER
mode for kprofile. Collects profiling data on the
caller of certain kernel utility routines (for example,
bcopy, bzero, simple_lock), instead of the routine itself.
Runs prof on the resulting profile data file(s). The
following prof options are supported: Reports the profile
as an annotated disassembly. Excludes procedure proc from
the profile but includes its CPU time or other statistic
in the total. Excludes procedure proc from the profile
and from the total. Profiles source lines, printing those
with the highest CPU time or other statistic first.
Reports the profile per source line within each procedure.
Merges all profile data files into file. Prints each procedure's
starting line number. Includes only procedure
proc in the profile, but totals all procedures. Includes
only procedure proc in the profile and in the total. Profiles
procedures, printing those with the highest CPU time
or other statistic first. Truncates the reports after n
lines or after (cumulative) n percent of the whole.
You specify the statistics that you want to collect for
the program being profiled in one or more statistic
operands.
If you specify multiple statistics, uprofile and kprofile
accumulate their results. You cannot then view the results
of any single statistic separately. Because collected data
is merged into a single buffer, interpretation of multiply
collected statistics may be difficult.
The Alpha architecture implemented on your machine determines
which statistics can be collected and the number of
counters available for collecting multiple statistics at
the same time. The implementation is indicated by the
Alpha chip number, which can be displayed with the show
config console command before booting Tru64 UNIX, or,
after booting, by using the psrinfo -v command, or by
calling getsysinfo (GSI_PROC_TYPE). Also, if the uprofile
command is run without arguments, it will show how many
counters and what statistics are available on your
machine.
All of the chips in the EV4 family (21064 [EV4], 21064A
[EV45], 21066/21068 [LCA4]) have two performance counter
registers, each of which can be separately programmed. The
statistics that each counter can collect are shown in the
following table:
------------------------------
Counter0Stats Counter1Stats
------------------------------
0disabled 1disabled
issues dcache
pipedry icache
loads dualissues
pipefrozen mispredicts
branches floatops
cycles intops
PALcycles stores
nonissues novictims
victims
------------------------------
All of the chips in the EV5 family (21164 [EV5], 21164A
[EV56], and 21164PC [PCA56]) have three performance
counter registers, each of which can be separately programmed.
Some of the counters are common to all EV5 implementations,
some are specific to EV5 and EV56, and some
are specific to PCA56.
The statistics that each of the common EV5 counters can
collect are shown in the following table:
--------------------------------------------------
Counter0Stats Counter1Stats Counter2Stats
--------------------------------------------------
0disabled 1disabled 2disabled
cycles0 nonissues longstalls
issues splitissue pcmispredicts
pipedry branchmispredicts
replay icachemisses
singleissues itbmisses
dualissues dcacheldmisses
tripleissues dtbmisses
quadissues ldsmerged
flowchanges ldureplays
intops fullreplays
floatops externalinput
loads cycles2
stores memorybarriers
icacheacc lockedloads
dcacheacc
--------------------------------------------------
The statistics that each of the EV5- and EV56-specific
counters can collect are shown in the following table:
-----------------------------------
Counter1Stats Counter2Stats
-----------------------------------
scacheacc scachemisses
scachereads scachereadmisses
scachewrites1 scachewritemisses
scachevictim scachesharedwrites
bcacheref scachewrites2
bcachevictim bcachemisses
sysreqs systeminvalidates
systemreadrequests
-----------------------------------
The statistics that each of the PCA56-specific counters
can collect are shown in the following table:
------------------------------------------
Counter1Stats Counter2Stats
------------------------------------------
bcachereads bcachedreads
bcachedreadhits bcachereadhits
bcachedreadfills bcachereadfills
bcachewrites bcachewritehits
bcachecleanwritehits bcachewritefills
bcachevictims sysreadflushhits
readmisstwo sysreadflushmisses
readmissthree
------------------------------------------
The EV6 chip has two performance counter registers, each
of which can be separately programmed. The statistics that
each of the EV6-specific counters can collect are shown in
the following table:
------------------------------
Counter0Stats Counter1Stats
------------------------------
0disabled 1disabled
cycles0 cycles1
retinst retcondbranch
retdtb1miss
retdtb2miss
retitbmiss
retunaltrap
replay
------------------------------
The default is to gather cycle statistics in the 0th
counter and to disable other counters.
The EV67 chip has two kinds of performance counters: traditional
aggregate counters and profile-me counters. The
traditional aggregate statistics that each of the
EV67-specific counters can collect are shown in the following
table. Any one statistic or statistic combination
may be selected.
------------------------------
Counter0Stats Counter1Stats
------------------------------
0disabled 1disabled
cycles0 replay
retinst cycles1
retinst bcachemisses
------------------------------
If no aggregate statistics are selected, one profile-me
statistic may be selected:
-----------------------------------------------------------------------------
Profile-me Statistics
-----------------------------------------------------------------------------
2disabled abort abort_per_ret arith_trap
cbr_taken cbr_taken_per_ret cycles cycles_per_ret
delay delay_per_ret dstream_fault dtb_miss
dtb_miss_per_ret dtb_miss3 dtb_miss4 early_kill
early_kill_per_ret fp_disabled freq icache_miss
icache_miss_per_ret icache_parity inflt_bcache inflt_replays
inflt_retires interrupt istream_accvio itb_miss
ldst_order ldst_unalign map_stall map_stall_per_ret
mispredict mispre- opcdec replay_trap
dict_per_ret
replay_trap_per_ret retire trap trap_per_ret
valid
-----------------------------------------------------------------------------
The default is to gather cycle statistics in the 0th
counter and to disable other counters.
For descriptions of the statistics for all EV4, EV5, and
EV6 implementations, refer to pfm(7).
You can disable any counter by specifying 0disabled, 1disabled,
or 2disabled as the counter statistic. You can use
this feature to isolate specific event types, such as
loads, without extraneous data being generated. You cannot
disable all counters at the same time, choose two statistics
for the same counter, or disable a counter once its
statistic is specified.
When you specify no counter statistics, uprofile and kprofile
count cycles on counter 0 by default, and display
(through prof) a profile in terms of seconds used by each
procedure in the program, except for any shared libraries.
For noncycle statistics, the displayed profile shows the
number of samples recorded, the sampling interval (events
per second), and the total number of events that this
implies. Most noncycle statistics of the EV5 family CPUs
are recorded about six cycles after the instruction that
triggered the sample. So, when using prof's -asm or
-lines option, the samples should be associated with one
of the previously executed few instructions of lines. The
icacheacc, icachemisses, and dtbmisses statistics are usually
attributed precisely.
To perform a detailed analysis of short sections of kernel
code, use the kprofile command with triggered mode
(invoked with the -t option). When you use this mode,
kprofile performs all of the required setup for enabling
the counters as normal, but does not invoke them. You can
insert counter start or stop commands into the kernel code
to be instrumented as follows:
Turn counters on: wrperfmon (PFOPT, 1) Turn counters off:
wrperfmon (0)
You can turn the counters on and off repeatedly to collect
data over many iterations or multiple sections of code.
The macro PFOPT is defined in <sys/pfcntr.h>.
The interrupt load that profiling places on the system may
affect performance, but usually the effect is insignificant.
The kernel in use must have the pfm pseudo-device configured
into it. To do this, use one of the following methods:
Add the following line to the kernel configuration
file, and rebuild the kernel. Do not use this method if
CPU hot-swap is supported by the system, because it does
not allow pfm to be easily unconfigured, as required for a
hot-swap; instead, use the sysconfig method below.
pseudo-device pfm Enter the following command from
the root account. Do not configure pfm if CPU hot-swap is
anticipated. # sysconfig -c pfm
If pfm is configured, the CPU hot-swap procedure
requires that it be unconfigured, using the following
command, before any CPU is swapped: # sysconfig
-u pfm
The autosysconfig program can be used to automatically
load the configurable pfm device at each system
startup.
The format of the data files produced by uprofile in Tru64
UNIX is different from the format produced in versions of
DIGITAL UNIX prior to Version 4.0. The Tru64 UNIX data
files include the names of selected statistics in profile
displays. To convert these data files to the industrystandard
format, at the expense of losing the names of the
statistics, use the pdtostd command.
The EV4 victim and novictim statistics rely on the external
performance counter pin connections as described in
the EV4 chip specification. The DEC 3000/400, /500, /600,
and /800 workstations have these connections. Attempts to
display either of these statistics on other platforms
(while allowed) will typically generate empty data.
The uprofile command is only supported on EV4 Pass 3 or
later processors. Attempts to use it on a Pass 2 processor
will gather PC samples for every process running on the
system.
Using kprofile to generate statistics for a single command
is only possible on EV4 Pass 3 or later processors.
Attempts to do this on a Pass 2 processor will gather
statistics for the entire system, as if no command had
been specified.
Using kprofile with triggered mode also requires an EV4
Pass 3 or later processor and cannot be performed with
per-process monitoring.
Only one tool can use the performance counters at a time.
A message similar to "the counter device is busy" indicates
that some other tool is using the performance counters
(or has used them but not cleaned up properly). If
you are sure no one else is using the performance counters,
running uprofile/kprofile with superuser privilege
will attempt to reset the busy status and proceed.
The performance counter device file. The statistics
file(s) generated by uprofile. The statistics file(s)
generated by kprofile. The statistics file(s) generated
with the -pids option. The default kernel to profile.
Introduction: prof_intro(1)
pdtostd(1), pfm(7), prof(1), runon(1), psrinfo(1), sysconfig(8), autosysconfig(8), processor_sets(4)
Programmer's Guide
uprofile(1)
[ Back ] |