PROF(1) PROF(1)
prof - analyze SpeedShop performance data
prof [options] [speedshop_data_file | pixie_counts_file] ...
prof [options] executable-name [speedshop_data_file | pixie_counts_file]
...
prof analyzes one or more data files generated by the SpeedShop
performance tools and produces a report. (Note that most reports are
formatted with long lines, and should be viewed in a window that is 135
characters wide, and printed in wide format.) The second form is used to
analyze data files generated by the SpeedShop performance tools if the
target program is not in the same directory as the data files (in which
case <executable-name> should be the path to the target program).
Multiple files can be included only if they are recorded from the same
executable with the same experiment type. The default listing for all
experiments lists functions in descending order of the appropriate
exclusive (meaning from within the function, rather than included from
calls it makes) performance metric. (See below, FUNCTION LIST.) Options
allow sorting by calls or by inclusive metrics, for those experiments
where the recorded data supports them. Where applicable, the
-b[utterfly] option also produces a listing of callers and callees for
each function, with attribution percentages and time or counts. (See
below, BUTTERFLY LIST.) Addition listings may also be generated; see
below.
The current implementation supports the following SpeedShop experiments:
usertime (callstack profiling, user+system time trigger) causes the
program to be interrupted every 30 milliseconds during its running time
(user or system mode, but not including any wait time), and to record the
callstack at each interrupt. It can show both inclusive and exclusive
time. usertime data is statistical in nature, and will show some
variance from run to run.
totaltime (callstack profiling, wall-clock time trigger) causes the
program to be interrupted every 30 milliseconds of wall-clock time during
the run, and to record the callstack at each interrupt. It can show both
inclusive and exclusive time. totaltime data is statistical in nature,
and will show some variance from run to run.
[f]pcsamp[x] asks the kernel to look at the user PC every 10
milliseconds, and record a histogram of the value of the program counter
at each clock tick, using 16-bit bins, one for each PC value. It can
only show exclusive data, that is data about where the program counter
was, not the callstack to get there. The [f] prefix specifies 1
millisecond profiling, instead of 10 milliseconds. The [x] suffix
specifies 32-bit count bins, instead of 16-bit. *pcsamp* data is
statistical in nature, and will show some variance from run to run.
Page 1
PROF(1) PROF(1)
*_hwc asks the kernel to look at the user PC every time the hardware
performance counter specified by the experiment overflows, and record a
histogram of the value of the program counter at overflow. These
experiments can only be run on R10000 machines; other machines do not
have the hardware performance counters. There are a number of these
experiments defined; see speedshop(1). They can only show exclusive
data, that is data about where the program counter was, not the callstack
to get there. The particular counter used and its overflow value are
specified in the experiment. Some of the Hardware Performance counter
sampling experiments are statistical in nature, and will show some
variance from run to run; others are exact, provided that the program
executes the exact same sequence of instructions.
Among the interesting counter prefixes to use are:
cy_ for cycle counting;
gi_ for graduated instructions;
ic_ for primary instruction cache misses;
sic_ for secondary instruction cache misses;
dc_ for primary data cache misses;
sc_ for secondary data cache misses;
tlb_ for TLB misses;
gfp_ for graduated floating-point instructions; and
fsc_ for failed store-conditional instructions.
*_hwctime (callstack profiling, R10K hardware counter overflow trigger)
causes the program to be interrupted at every N overflows of the
particular counter and to record the callstack at each interrupt. The
value of N depends on the particular counter chosen. It can take the
same set of prefixes as above, but since callstacks are recorded, it can
show both inclusive and exclusive time. *_hwctime data is statistical in
nature, and will show some variance from run to run.
ideal causes the code to be instrumented to count the number of times
each basic block is executed. (A basic block is a region of the program
that can be entered only at the beginning and exited only at the end.)
The data recorded also contains counts for all function pointer calls.
From this data, a machine model is used to compute the exclusive time
(cycles) spent in each function.
Inclusive time computations, performed when the -b[utterfly] flag is
specified to prof, calculates the exclusive time for each routine as
above, and then propagates the time to each caller of each routine in
proportion to the number of calls. For example, if sin(x) takes 1000
cycles, and its callers - procedures foo() and bar() - call it 25 and 75
times respectively, 250 cycles are attributed to foo() and 750 to bar().
By propagating cycles this way, __start() should end up with all the
cycles counted in the program.
Note that the propagation according to the number of calls may not be
reasonable for some routines, and may lead to misleading reports. For
example, if a matrix-multiply routine is substituted for sin(x) in the
above example, and bar's calls are for 2X2 matrices, while foo's calls
Page 2
PROF(1) PROF(1)
are for 100X100 matrices, the attribution distributes 3/4 of the time to
bar, whereas nearly all the time really should be attributed to foo.
fpe asks the floating-point exception library to trace all floating-point
exceptions, with their callstacks. These experiments show listings
similar to those for usertime experiments, except the data reported is a
count of floating-point exceptions, rather than projected CPU time.
io asks for a trace of all IO calls made by the program. These
experiments will show calls attributed to functions, and with
-b[utterfly], can show which functions made which calls.
prof writes an analysis of the performance data to stdout. The first
thing in the list is a summary of the experiment, and description of the
environment in which it was recorded. That is followed by a header that
summarized the particular data recorded. Following that is the function
list; if -b[utterfly] is given, and the data supports it, a butterfly
list will be presented.
If -h[eavy] or -l[ines] is given, for *pcsamp*, *hwc, and ideal
experiments, a report of data at the source line level is appended,
sorted by the performance metric computed on a line basis, or by
functions, and then by line numbers within a function, respectively. For
other experiments, these options are ignored.
If -basicblocks is given, for ideal experiments only, A report of data at
the basic-block level is appended. If -archinfo is given, also for ideal
experiments only, a summary report of register usage, instruction usage,
and various other statistics is appended. For other experiments, these
options are ignored.
If -dsolist is given, a list of the DSOs used by the program is appended.
If -usage is given, a summary of the resources used by the program is
appended.
The default output for any experiment is a function list, sorted in the
order of exclusive values of the primary metric--the performance cost of
the function, as computed from the data recorded. For many experiments,
this metric is a time, printed in the report in seconds; for others, it
is a count of events-FPEs, IO calls, or R10K counter overflow counts.
The report begins with a legend line, naming each of the columns of data.
Each line in the list has an index; if -b[utterfly] is not specified, the
index will be in numerical order; if it is specified, the index will be
in order of functions as sorted by the inclusive metric. The index
serves as a poor-person's hyperlink through the butterfly, and between
the butterfly and the function list.
Page 3
PROF(1) PROF(1)
To the left of the index, is the primary metric, followed by its
representation as a percentage of the whole program, followed by the
cumulative percentage up to that point in the list. The next two columns
are inclusive value of the metric., and its representation as a
percentage.
Following that may be additional columns of data, as appropriate to the
particular data recorded. Finally, rightmost on the line will be the
name of the function, and the DSO and source file basenames.
The butterfly list is a set of records that show the callers and callees
of each function. The list is sorted in the order of inclusive values of
the primary metric. For each function, its callers are shown above it,
and its callees below. The center line is for the function itself, and
shows the index at the beginning and end of the line. The second and
third columns are the inclusive percentages of the primary metric and its
absolute value. The next two columns of the center line are the
exclusive percentage and the exclusive value of the metric, followed by
the function identification.
Callees are shown below the function, with attribution percentages and
values lined up below the self percentages and values of the center
function. In absolute mode (the default), the percentages, including the
center node's self value, should add up to the inclusive percentage of
the center node; in relative mode, specified with -rel[ative] flag, the
percentages should add up to 100 %. In either mode, the attribution
values should add up to the inclusive time of the central function. The
attribution data for callees is followed by the inclusive value for that
caller and callsite, followed by the callsite identification, with an
address, a source file, and a line number. For callees, the source file
will be that of the central function.
The callers are shown above the central function, also with attribution
percentages and values, but these have a different meaning, and are
aligned with the central nodes inclusive values to so indicate. The
attribution percentage and value in a caller's line represents the
percentage and value, respectively, of the central function's metric that
was attributed to that callsite. In absolute mode, the percentages
should add up to the central function's percentage; in relative mode,
they should add up to 100%. The attribution values should add to the
central function's values. The attribution data for callers is followed
by the inclusive value for that caller and callsite, followed by the
callsite identification, with an address, a source file, and a line
number. For callers, the source file will be that of the calling
function.
When invoked with the -h[eavy] or -l[ines] arguments, a list of linelevel
data is produced. When -h[eavy] is used, the line list is sorted
by the primary metric associated with each line, with lines from the
various DSOs and source files intermixed. When -l[ines] is used, the
Page 4
PROF(1) PROF(1)
lines are sorted based on the function they came from, with all lines
with non-zero values from each function printed before any lines from the
next most important function.
For each line, the leftmost column contains the value for the primary
metric computed from the particular experiment, followed by its
representation as a percentage, followed by the cumulative percentage
over all lines printed thus far. These data are followed by any
additional metrics for the particular experiment, and then by the name of
the function, its DSO, file and line number.
When invoked with the -basicblocks argument (applicable ideal experiments
only), a list of all the basic blocks in the program that were executed
is generated. It is preceded by a header, and a column header line, and
then the list of basic blocks, in order of total cycles.
Each basic block is printed with its index, the number of cycles per
execution of the block, the count of executions, the total cycles, and
the total cycles represented as a percentage. Trailing that on each
line, is the function, and the address of the beginning of the line, with
the DSO, source file, and source line number.
ARCHITECTURAL INFORMATION REPORT [Toc] [Back] When invoked with the -archinfo argument (applicable ideal experiments
only), a list of various metrics concerning execution of the program is
printed. The report consists of a header, and a number of subsections,
each with the appropriate headers.
For integer registers, counts are printed of the number of times each
register was used, and its percentage, the number of times each register
was used as a base register, and its percentage, and the number of times
each register is used as a destination register, and its percentage. For
floating point registers, the same data, less the base count statistics,
is printed.
Following the register usage statistics, a number of counts of
instruction types or sequences are printed, each with a description,
followed by a list of all the different instructions used, sorted by the
number of times each was executed. For each instruction, the dynamic
count of executions, and its representation as a percentage is printed.
That is followed by the cumulative execution percentage, and the count of
the number of distinct instructions of that type that were executed one
or more times, and that count represented as a percentage.
When invoked with the -dsolist argument, a report of summary information
about the DSOs in the execution of the program is printed. For each DSO,
it gives the name, a count of instructions, functions, source-files, and
source-lines, the high and low addresses, and the full pathname to the
DSO. If the DSO is ignored in computation, either because it belongs to
the SpeedShop runtime, and the -showss flag is not provided, or because
Page 5
PROF(1) PROF(1)
the DSO was excluded by being listed as with a -xdso argument, or the DSO
was excluded by virtue of its not being listed with a -dso argument, the
DSO is flagged as ignored.
When invoked with the -usage argument, a report of summary usage data, as
measured by the kernel during the run, is produced. The data consists of
both per-process (per-file) and per-system metrics. Included are realtime
(wall clock time), user, system and wait times. The per-process
data is shown as a sum of the data from the experiment files; the realtime,
and the system-wide statistics are printed as maxima over the data
in each file. Included are the accounting timers, giving the time spent
in each of the various process states. The sum of the accounting timers
should be approximately equal to the elapsed real-time, since the process
must always be in one of the states. There is some skew in the reading
of the data, so some discrepancy should be expected. Other summary
statistics include bytes read and written, page faults, context switches,
system calls, and process-size statistics.
DISASSEMBLY AND SOURCE LISTS
When invoked with the -dis argument, an assembly listing of the program
is generated. If the -dis argument is given as well, the source code for
the disassmebly is intermixed with it.
-calipers [n1,]n2
flag causes prof to compute the data between caliper points n1 and
n2, rather than for the entire experiment. If n1 >= n2, an error is
reported, otherwise if n1 is negative it is set to the beginning of
the experiment, and if n2 is greater than the maximum recorded, it
is set to the maximum. If n1 is omitted, zero is assumed.
-b[utterfly]
or -gprof causes prof to print a report showing the callers and
callees of each function, with inclusive time attributed to each.
For ideal experiments, the attribution is based on a heuristic,
while for the various callstack sampling/tracing experiments the
attribution is precise (although usertime and totaltime, as well as
some *_hwctime experiments are statistical in nature). This flag is
ignored for experiments where the data does not support inclusive
calculations.
-p[threads] <pthread_id1>,<pthread_id2>,...,<pthread_idn>
For usertime, totaltime, *_hwctime, io, and fpe experiments on
applications that use pthreads (on Irix 6.5 or later), analyze data
only for the specified pthreads. This flag is ignored for other
experiments.
-u[sage]
Print a report on system statistics and timers.
Page 6
PROF(1) PROF(1)
-dis[assemble]
Disassemble and annotate the analyzed object code with cycle times
or number of PC samples. This option can be used when generating
reports for ideal, pcsamp, or prof_hwc experiments.
-S(-source)
Disassemble and annotate the analyzed object code with cycle times
(or PC samples) and interleave the source code. This option can be
used when generating reports for ideal, pcsamp, or *_hwc
experiments.
-h[eavy]
Reports the most heavily used lines in descending order of use.
This option can be used when generating reports for ideal, pcsamp,
or prof_hwc experiments. It is ignored for other experiments.
-l[ines]
Like -h[eavy], but group lines by procedure, with procedures sorted
in descending order of use. Within a procedure, lines are listed in
source order. This option can be used when generating reports for
ideal, pcsamp, or prof_hwc experiments. It is ignored for other
experiments.
-[no]cordfb
Disables or enables cord feedback file generation for the executable
only. Cord feedback is used to arrange procedures in the binary in
an optimal ordering, to improve both paging and instruction-cache
performance. Users can use cord(1) (or, in the near-term future,
ld(1)) to actually do the procedure-ordering.
-cordfball
Enables cord feedback for the executable and all DSOs.
-feedback
Produces files with information that can be used to (a) tell the
compiler how to optimize compilation of the program next time and
(b) arrange procedures in the binary in an optimal ordering. Users
can invoke "cc(1) -fb <cfb-filename>" to use the compilationoptimization
feedback file for subsequent compilations. To disable
cord feedback, while producing compiler feedback, use the options
-feedback -nocordfb together.
Procedures are normally ordered by their measured invocation counts;
if -gprof is also specified, procedures are ordered using call graph
counts (that capture caller-callee relationships as well), rather
than invocation counts. The cord feedback file is named <a.out>.fb
or <lib*so*>.fb. The -feedback option produces cord feedback for
the executable only. To get cord feedback for all the DSOs as well,
use the options -feedback -cordfball together.
This option also produces a file with information that the compiler
system can use to recompile, optimizing by using measured branch
frequencies, etc. The feedback file is produced for the executable
only. It is named <a.out>.cfb and is a binary file. It may be
Page 7
PROF(1) PROF(1)
dumped using the fbdump(1) routine. This option can only be used
with pixie (ideal-time) data files, and the data should be recorded
on a binary that was compiled -O0. Recording data on binaries with
higher optimization will generate a feedback file that does not have
the appropriate correspondence between source lines and machine
code.
-ws Generate a working-set file for the current caliper setting, for the
executable only.
-wsall
Generate a working-set file for the current caliper setting, for the
executable and all the non-ignored DSOs.
-showss
Enables display of functions from the SpeedShop runtime. Normally
such functions are suppressed from the reports and computations. In
addition, some statistics for prof's own memory usage will be
printed.
-dsolist
List all the DSO's in the program and their start and end text
addresses.
-rel[ative]
Show percentage attribution in a butterfly report relative to the
central function. The default is to show percentages as absolute
percentages over the whole run.
-inclusive
Sort function list by inclusive data, rather than by exclusive data.
This option can only be used when generating reports for those
experiments which have inclusive data; it is ignored for others.
-calls
Sort function list by procedure calls rather than by time. This
option can only be used when generating reports for ideal
experiments, or for basic block counting data obtained with pixie.
-q[uit] n
Truncates the -h[eavy], -l[ines] and -b[utterfly] listings after
the first n procedures or lines have been listed.
-q[uit] n%
Truncates the -h[eavy], -l[ines] and -b[utterfly] listings after
those procedures or lines up to the one which takes more than n
percent of the total.
-q[uit] ncum%
Truncates the -h[eavy], -l[ines] and -b[utterfly] listings after
those procedures or lines up to the one which brings the cumulative
Page 8
PROF(1) PROF(1)
total to n percent. (For -b[utterfly], it behaves the same as -quit
n%.) For example, "-q[uit] 15" truncates each part of the listing
after 15 lines of text, "-q[uit] 15%" truncates each part after the
first line that represents less than 15 percent of the whole, and
"-q[uit] 15cum%" truncates each part after the line that brought the
cumulative percentage above 15 percent.
-dislimit n
Disassemble only those basic blocks with frequency >= n.
-nh Suppress various header blocks from the output.
SELECTIVITY OPTIONS
-dso dso_name
Report only on the named DSO. Only the basename of the DSO need be
specified not the full pathname to the DSO; the .so suffix is
required. Multiple instances of the -dso flag can be given, and the
executable is considered a DSO, like any other. All the DSOs from
an experiment can be listed with the -dsolist flag.
-xdso dso_name
Exclude the named DSO from any reports. Only the basename of the
DSO need be specified not the full pathname to the DSO; the .so
suffix is required. Multiple instances of the -xdso flag can be
given.
-o[nly] procedure_name
If you use one or more -o[nly] options, the profile listing includes
only the named procedures, rather than the entire program. If any
option uses an uppercase "O" for -O[nly], prof uses only the named
procedures, rather than the entire program, as the base upon which
it calculates percentages.
-e[xclude] procedure_name
If you use one or more -e[xclude] options, the profiler omits the
specified procedure from the listing. If any option uses an
uppercase "E" for -E[xclude] , prof also omits that procedure from
the base upon which it calculates percentages.
CPU OPTIONS
prof normally uses the scheduling model for the processor on which it is
being run to perform the analysis. The user can override the default
with any of the following options:
{-r10000 | -r8000 | -r5000 | -r4000 | -r3000}
Note that these options are only meaningful for ideal time and
pixie-counts data.
Page 9
PROF(1) PROF(1)
-clock megahertz
Set the CPU clock speed to megahertz MHz. Alters the appropriate
parts of the listing to reflect the clock speed. The default value
is the clock speed of the machine on which the experiment was
performed.
-cycle nanosecond
Set cycle time to nanosecond ns. This is the same as -clock
1000/nanosecond.
-debug:debug_flags
Where debug_flags can be combinations of the following.
GPROF_FLAG 0x00000001
COUNTS_FLAG 0x00000002
SAMPLE_FLAG 0x00000004
MISS_FLAG 0x00000008
FEEDBACK_FLAG 0x00000010
CORD_FLAG 0x00000020
USERPC_FLAG 0x00000040
MDEBUG_FLAG 0x00000080
BEAD_FLAG 0x00000100
prof prints warnings and fatal errors on stderr. With inclusive cycle
counting, prof prints a list of functions at the end which are called but
not defined. Any functions starting with _rld listed here is normal
behavior. They appear because rld is not instrumented.
One way to sanity-check inclusive cycle counts is to look at the
percentage cycles for __start(). If the value is anything less than 98-
99%, the inclusive report is suspect. Look for other warnings which
indicate that prof didn't take into account certain procedures.
There are a number of known cases when prof fails to list cycles of a
procedure in the inclusive listing. The reasons can be one of the
following:
- init & fini sections, mips stubs are not part of any procedure.
- calls to procedures which don't use a jump and link are
not recognized as procedure calls.
Page 10
PROF(1) PROF(1)
- executions of global procedures with same name in different DSOs.
In this case, only one of them is listed.
All these exceptions are listed at the end of the -gprof listing, under a
separate section.
With both pc-sampling and basic block counting, prof may report times for
procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF
stands for Dummy Function, and indicates cycles spent in parts of text
which are not in any function, for example, init and fini sections, and
MIPS.stubs sections. For pc-sampling, and hardware counter overflow pcsampling,
a dummy function, *DF*_Other is used to report sampling hits in
an overflow bin; these hits include time spent in rld, and any other text
regions not listed in any DSO.
If any of the object files linked into the application have been stripped
of line-number information (with ld -x for example), prof warns about the
affected procedures. The instruction counts for such procedures are
shown as a procedure total, not on a per-basic-block basis. Where a line
number would normally appear in a report on a function without line
numbers question marks appear instead.
prof does not take into account interactions between basic blocks in
ideal time runs. prof computes the cycles for one execution of a basic
block, assuming all registers are free at entry to the block, and it
multiplies this count by the number of times that basic block is
executed. In real programs, a block may be entered with a result not yet
ready in a register, or with a function unit busy, so the cycle count
computed could be either higher or lower than the correct value.
Extending the computations to include inter-block state would be
prohibitively expensive.
When run on a program using shared libraries, prof sometimes combines the
times of real but anonymous procedures in a shared library into the
preceding (according the library memory layout) externally visible
function. The times reported for your procedures are not affected by
this attribution-error, which is normally minor.
prof cannot be run on programs which have been stripped.
Compiler optimization level 3 does procedure inlining. This can result
in extremely misleading profiles since the time spent in the inlined
procedure shows up in the profile as time spent in the procedure into
which it was inlined. It is generally better to use compiler
optimization level 2 or less when gathering an execution profile.
Fortran alternate entry point times are attributed to the main
function/subroutine, since there is no general way for prof to separate
the times for the alternate entries.
Page 11
PROF(1) PROF(1)
speedshop(1), ssrun(1), ssdump(1), sscord(1), ssorder(1), pixie(1),
fbdump(1)
PPPPaaaaggggeeee 11112222 [ Back ]
|