prof_intro - Introduction to application profilers, profiling,
optimization, and performance analysis
Tru64 UNIX supports four approaches to performance
improvement: Automatic and profile-directed optimizations.
For example: pixie -update a.out data/* cc -non_shared -O3
-spike -feedback a.out *.c Manual design and code optimizations.
For example: hiprof -all -display program
data/* | more hiprof -flat -all -display program data/* |
more uprofile -heavy program data/* | more Minimizing system-resource
usage. For example: third -display program
data/* | more Verifying significance of test cases. For
example: pixie -testcoverage program data/* | more
One approach might be enough, but more might be beneficial
if no single approach addresses all aspects of a program's
performance. The following sections describe each approach
and the tools provided by Tru64 UNIX to support them.
AUTOMATIC AND PROFILE-DIRECTED OPTIMIZATIONS [Toc] [Back] Techniques
Automatic and profile-directed optimizations are the simplest
approaches to improving application performance.
Some degree of automatic optimization can be achieved by
using the compiler's and linker's optimization options.
These can help in the generation of minimal instruction
sequences that make best use of the CPU architecture and
cache memory.
However, the compiler and linker can improve their optimizations
if they are given information on which instructions
are executed most often when the program is run with
its normal input data and environment. While the default
optimizations give improved performance for most common
situations, the optimizers can do even better if they can
tune the program in favor of the heavily used instruction
sequences as determined from a sample run.
Tru64 UNIX helps you provide the optimizers with this
information on processing hot-spots by allowing a profiler's
results to be fed back into a recompilation. This
customized, profile-directed optimization can be used in
conjunction with automatic optimization.
Tools and Examples [Toc] [Back]
The cc compiler command's automatic optimization options
are selected with -O, -fast, -inline, -spike, and other
related options. See cc(1) for details and Chapter 10 of
the Programmer's Guide for more information on the many
options and tradeoffs available.
For example, this command selects a high degree of optimization
in both the compiler and the linker: cc
-non_shared -O3 -spike *.c
The pixie profiler provides profile information that the
cc command's -feedback and -spike options can use to tune
the generated instruction sequences to the demands placed
on the program by particular sets of input data.
The steps, shown in the following example, consist of (1)
preparing the program for profile-directed optimization,
(2) creating an instrumented version of the program and
running it to collect profiling statistics, and (3) feeding
that information back to the compiler and linker to
help them optimize the executable code: rm -f program cc
-non_shared -feedback program -o program -O3 *.c pixie
-update program cc -non_shared -feedback program -o program
-O3 -spike *.c
To apply profile-directed optimizations to shared
libraries, generate profile data with an exerciser program,
and store it in the shared library prior to recompiling
with that feedback. For example: rm -f libexample.so
cc -feedback libexample.so -o libexample.so -shared
-O3 lib*.c cc -o exerciser exerciser.c -L. -lexample pixie
-L. -incobj libexample.so -run exerciser prof -pixie
-update libexample.so exerciser.Counts cc -spike -feedback
libexample.so -o libexample.so -shared -O3 lib*.c
MANUAL DESIGN AND CODE OPTIMIZATIONS [Toc] [Back] Techniques
The effectiveness of the automatic optimizations described
previously is limited by the efficiency of the algorithms
that the program uses. A program's performance can be further
improved by manually optimizing its algorithms and
data structures. Such optimizations may include reducing
complexity from N-squared to log-N, avoiding copying of
data, and reducing the amount of data used. It may also
extend to tuning the algorithm to the architecture of the
particular machine it will be run on - for example, processing
large arrays in small blocks such that each block
remains in the data cache for all processing, instead of
the whole array being read into the cache for each processing
phase.
Tru64 UNIX supports manual optimization with its profiling
tools, which identify the parts of the application that
use most CPU resources - CPU cycles, cache misses, and so
on. By evaluating different profiles of a program, you can
identify which parts of the program use most CPU resources
and you can then redesign or recode algorithms in those
parts to use less resources. The profiles also make this
exercise more cost-effective by helping you to focus on
the most demanding code instead of the least demanding
code.
Tools and Examples [Toc] [Back]
A call-graph profile shows how much CPU time is used by
each procedure, and how much is used by all of the other
procedures that it calls. This can show which phases or
subsystems in a program spend most of the total CPU time,
which can help in gaining a general understanding of the
program's performance.
The hiprof profiler instruments the program and records a
call graph while the instrumented program executes. The
hiprof profiler does not require that the program be compiled
in any particular way, but the names of local (for
example, static) procedures will be hidden if the cc command's
default -g0 option was used, and procedures will be
hidden if they are inlined. For example: cc -g1 -O2 -o
program *.c hiprof -all -display program data/* | more
By default, hiprof uses a low-frequency sampling technique.
It can profile all of the code executed by the program,
including all selected libraries, though its call
graph excludes procedures in threads-related system
libraries. It can also provide detailed profiles at the
level of source lines or machine instructions.
For non-threaded programs, hiprof can alternatively count
the number of machine cycles used or page faults that
occur during program execution. In these modes, the CPU
time or page-faults count reported for the instrumented
routines includes that for the uninstrumented routines
that they call. This can summarize the costs and reduce
the run-time overhead, but note that the machine-cycle
counter wraps if no instrumented procedure is called at
least every few seconds.
The cc compiler's -pg option uses the same sampling technique
as hiprof. This technique is supported in a very
similar way on different vendors' UNIX systems. For example:
cc -g1 -O2 -pg -o program *.c ./program data/* gprof
program gmon.out | more
However, hiprof may be preferred because the -pg option
has some disadvantages: The program needs to be specially
compiled with the -pg option. Only a few of the archive
libraries that are provided with the operating system were
compiled to generate a call-graph profile. Only the executable
is profiled. Shared libraries are not.
The optional dxprof command provides a graphical display
of various call-graph profiles.
A good performance-improvement strategy may start with a
procedure-level profile of the whole program (perhaps with
a call graph too, to give the big picture), but it will
often progress to detailed profiling of individual sourcelines
and instructions.
The uprofile profiler uses a sampling technique to generate
a profile of the CPU time or events such as cache
misses associated with each procedure or source-line or
instruction. The sampling frequency depends on the processor
type and the statistic being sampled, but for CPU time
it is on the order of a millisecond. The profiler
achieves this without modifying the target program at all
by using hardware counters that are built into the Alpha
CPU. Running the uprofile command with no arguments
yields a list of all the kinds of events that a particular
machine can profile, depending on the nature of its architecture.
The default is to profile machine cycles, resulting
in a CPU-time profile. The following example shows how
to display a profile of the source lines that experienced
the top 90% of data cache misses on an EV56 Alpha: cc -g1
-O2 -o program *.c uprofile -h -q 90cum% dcacheldmisses
program data/* | more
This technique has the advantage of very low run-time
overhead. Also, the detailed information it can provide on
the costs of executing individual instructions or source
lines is essential in identifying exactly which operation
in a procedure is slowing down the program.
The disadvantages of uprofile are that only executables
can be profiled, the results can be skewed unless all processors
have the same cycle speed, only one program can be
profiled with the hardware counters at one time, threads
can not be profiled individually, and the Alpha EV6 architecture's
execution of instructions out of sequence can
significantly reduce the accuracy of fine-grained profiles.
If hiprof's -flat option is used, its default sampling
technique can provide the same fine-grain profiles (CPU
time only) and low intrusiveness as uprofile. Also, it is
accurate even with mixed processor cycle speeds, and it
can profile all of a program's shared libraries as well as
its individual threads. For example: hiprof -flat -h -all
program data/* | more
The cc compiler's -p option uses the same low-frequency
sampling technique as hiprof. It is common to many UNIX
systems, and (on Tru64 UNIX) it is able to profile all the
shared libraries used by a program. The program needs to
be relinked with the -p option, but it does not need to be
recompiled from source, so long as the original compilation
used an acceptable debug level, such as the -g1 compiler
option. For example, to profile individual instructions
of a program: cc -p -o program *.o setenv PROFFLAGS
'-all -stride 1' ./program data/* prof -all -asm -quit 5%
program mon.out | more
The pixie tool can also profile source lines and instructions
(including shared libraries), but note that when it
displays counts of "Cycles", it is actually reporting
counts of instructions executed, not machine cycles. For
example: cc -g1 -O2 -o program *.c pixie -all -lines -quit
20 program data/* | more
The optional dxprof command provides a graphical display
of profiles collected by either pixie or the cc command's
-p option.
MINIMIZING SYSTEM RESOURCE USAGE [Toc] [Back] Techniques
The preceding techniques can improve an application's use
of just the CPU. Further performance improvements can be
made by improving the efficiency with which the application
uses the other components of the computer system:
heap memory, disk files, network connections, and so on.
As with CPU profiling, the first phase of a resource usage
improvement process is to monitor how much memory, data
I/O and disk space, elapsed time, and so on, is used. Then
the throughput of the computer can be increased or tuned
in ways that help the program, or the program's design can
be tuned to make better use of the computer resources that
are available. For example: Reduce the size of the data
files that the program reads and writes. Use memory-map
files instead of regular I/O. Allocate memory incrementally
on demand instead of allocating at start-up the maximum
that could be required. Fix heap leaks, and do not
leave allocated memory unused. See the System Configuration
and Tuning manual for a broader discussion of analyzing
and tuning a Tru64 UNIX system.
Tools and Examples [Toc] [Back]
The Tru64 UNIX base system commands ps u, swapon -s, and
vmstat 3 can show the currently active processes' usage of
system resources such as CPU time, physical and virtual
memory, swap space, page faults, and so on.
The optional pview command provides a graphical display of
similar information for the processes that comprise an
application.
The time commands provided by the Tru64 UNIX system and
command shells provide an easy way to measure the total
elapsed time and CPU time for a program and its descendants.
The collect tool is an optional, low overhead, system performance
monitor.
Many other related commands are described in the System
Configuration and Tuning manual.
The third command reports heap memory leaks in a program,
by instrumenting it with the Third Degree memory-usage
checker, running it, and displaying a log of leaks
detected at program exit. For example: third -display program
data/* | more
If you are interested only in leaks occurring during the
normal operation of the program, not during startup or
shutdown, you can specify additional places to check for
previously unreported leaks. For example, the pre-shutdown
leak report will give this information: third -display
-after startup -before shutdown program data/* | more
Third Degree can also detect various kinds of bugs that
may be affecting the correctness or performance of a program.
See the Programmer's Guide for further details on
debugging and leak-detection.
The optional dxheap command provides a graphical display
of Third Degree's heap and bug reports.
The optional mview command provides a graphical analysis
of heap usage over time. This view of a program's heap can
clearly show the presence (if not the cause) of significant
leaks or other undesireable trends such as wasted
memory.
VERIFYING SIGNIFICANCE OF TEST CASES [Toc] [Back] Techniques
Most of the preceding profiling techniques are effective
only if you profile and optimize or tune the parts of the
program that are executed in the scenarios whose performance
is important. Careful selection of the data used for
the profiled test runs is often sufficient, but you may
want a quantitative analysis of which code was and was not
executed in a given set of tests.
Tools and Examples [Toc] [Back]
The pixie command's -t[estcoverage] option reports lines
of code that were not executed in a given test run. For
example: pixie -t program data/* | more
Conversely, pixie's -p[rocedure], -h[eavy], and -a[sm]
options show which procedures, source lines, and instructions
were executed.
If multiple test runs are needed to build up a typical
scenario, the prof command can be run separately on a set
of profile data files: pixie -pids program ./program.pixie
data1/* ./program.pixie data2/* prof -pixie -t program
program.Counts.*
Optimizing: cc(1), spike(1)
Profiling: hiprof(1), pixie(1), third(1), uprofile(1)
System Monitoring: collect(8), ps(1), swapon(1),
vmstat(1)
Graphical tools, available from the Graphical Program
Analysis subset of the Tru64 UNIX Associated Products
installation media, or as part of the Enterprise Toolkit
for Windows/NT desktops with Microsoft's Visual Studio 97:
dxheap(1), dxprof(1), mview(1), pview(1)
Programmer's Guide
System Configuration and Tuning
prof_intro(1)
[ Back ] |