perfex - IRIX

· Home

+ man pages

-> Linux

-> FreeBSD

-> OpenBSD

-> NetBSD

-> Tru64 Unix

-> HP-UX 11i

-> IRIX

· Linux HOWTOs

· FreeBSD Tips

· *niX Forums

man pages->IRIX man pages -> perfex (1)


PERFEX(1)							     PERFEX(1)

NAME [Toc] [Back]

     perfex - Command line interface to	processor event	counters

SYNOPSIS [Toc] [Back]

     perfex [-a	| -e event0 [-e	event1]] [-mp |	-s | -p] [-x] [-k] [-y]	[-t]
     [-T] [-o file] [-c	file] command

DESCRIPTION [Toc] [Back]

     The given command is executed; after it is	complete, perfex prints	the
     values of various hardware	performance counters.  The counts returned are
     aggregated	over all processes that	are descendants	of the target command,
     as	long as	their parent process controls the child	through	wait (see
     wait(2)).

     The R10000	event counters are different from R12000 event counters.  See
     the r10k_counters(5) man page for differences.  For R10000	CPUs, the
     integers event0 and event1	index the following table:
	  0 = Cycles
	  1 = Issued instructions
	  2 = Issued loads
	  3 = Issued stores
	  4 = Issued store conditionals
	  5 = Failed store conditionals
	  6 = Decoded branches.	 (This changes meaning in 3.x
		 versions of R10000.  It becomes resolved branches).
	  7 = Quadwords	written	back from secondary cache
	  8 = Correctable secondary cache data array ECC errors
	  9 = Primary (L1) instruction cache misses
	  10 = Secondary (L2) instruction cache	misses
	  11 = Instruction misprediction from secondary	cache way prediction table
	  12 = External	interventions
	  13 = External	invalidations
	  14 = Virtual coherency conditions.  (This changes meaning in 3.x
		 versions of R10000.  It becomes ALU/FPU forward progress
		 cycles.  On the R12000, this counter is always	0).
	  15 = Graduated instructions
	  16 = Cycles
	  17 = Graduated instructions
	  18 = Graduated loads
	  19 = Graduated stores
	  20 = Graduated store conditionals
	  21 = Graduated floating point	instructions
	  22 = Quadwords written back from primary data	cache
	  23 = TLB misses
	  24 = Mispredicted branches
	  25 = Primary (L1) data cache misses
	  26 = Secondary (L2) data cache misses
	  27 = Data misprediction from secondary cache way prediction table
	  28 = External	intervention hits in secondary cache (L2)
	  29 = External	invalidation hits in secondary cache
	  30 = Store/prefetch exclusive	to clean block in secondary cache



									Page 1






PERFEX(1)							     PERFEX(1)



	  31 = Store/prefetch exclusive	to shared block	in secondary cache

     For R12000	CPUs, the integers event0 and event1 index the following
     table:
	  0 = Cycles
	  1 = Decoded instructions
	  2 = Decoded loads
	  3 = Decoded stores
	  4 = Miss handling table occupancy
	  5 = Failed store conditionals
	  6 = Resolved conditional branches
	  7 = Quadwords	written	back from secondary cache
	  8 = Correctable secondary cache data array ECC errors
	  9 = Primary (L1) instruction cache misses
	  10 = Secondary (L2) instruction cache	misses
	  11 = Instruction misprediction from secondary	cache way prediction table
	  12 = External	interventions
	  13 = External	invalidations
	  14 = ALU/FPU progress	cycles.	 (This counter in current versions of R12000
		 is always 0).
	  15 = Graduated instructions
	  16 = Executed	prefetch instructions
	  17 = Prefetch	primary	data cache misses
	  18 = Graduated loads
	  19 = Graduated stores
	  20 = Graduated store conditionals
	  21 = Graduated floating-point	instructions
	  22 = Quadwords written back from primary data	cache
	  23 = TLB misses
	  24 = Mispredicted branches
	  25 = Primary data cache misses
	  26 = Secondary data cache misses
	  27 = Data misprediction from secondary cache way prediction table
	  28 = State of	intervention hits in secondary cache (L2)
	  29 = State of	invalidation hits in secondary cache
	  30 = Store/prefetch exclusive	to clean block in secondary cache
	  31 = Store/prefetch exclusive	to shared block	in secondary cache

BASIC OPTIONS [Toc] [Back]

     -e	event	       Specify an event	to be counted.

		       2, 1, or	0 event	specifiers may be given, the default
		       events being to count cycles.  Events may also be
		       specified by setting one	or both	of the environment
		       variables T5_EVENT0 and T5_EVENT1. Command line event
		       specifiers, if present, override	the environment
		       variables. The order of events specified	is not
		       important.  The counts, together	with an	event
		       description, are	written	to stderr unless redirected
		       with the	-o option. Two events that must	be counted on
		       the same	hardware counter (see r10k_counters(5))	will
		       cause a conflicting counters error.



									Page 2






PERFEX(1)							     PERFEX(1)



     -a		       Multiplexes over	all events, projecting totals. Ignores
		       event specifiers.

		       The option -a produces counts for all events by
		       multiplexing over 16 events per counter.	The OS does
		       the switching round robin at clock interrupt
		       boundaries. The resulting counts	are normalized by
		       multiplying by 16 to give an estimate of	the values
		       they would have had for exclusive counting. Due to the
		       equal-time nature of the	multiplexing, events present
		       in large	enough numbers to contribute significantly to
		       the execution time will be fairly represented. Events
		       concentrated in a few short regions (for	instance,
		       instruction cache misses) may not be projected very
		       accurately.


     -mp	       Report per-thread counts	for multiprocessing programs
		       as well as (default) totals.

		       By default, perfex aggregates the counts	of all the
		       child threads and reports this number for each selected
		       event. The -mp option causes the	counters for each
		       thread to be collected at thread	exit time and printed
		       out; the	counts aggregated across all threads are
		       printed next.  The per-thread counts are	labeled	by
		       process ID (pid).



     -o	file	       Redirects perfex	output to the specified	file.

		       In the -mp case,	the file name includes the pid of the
		       sproc child thread.


     -s		       Starts (or stops) counting when a SIGUSR1 (or SIGUSR2)
		       signal is received by a perfex process.


     -p	period	       Profiles	(samples) the counters with the	given period.

		       This option causes perfex to wait until it (i.e., the
		       perfex process) receives	a SIGUSR1 before it starts
		       counting	(for the child process,	the target). It	will
		       stop counting if	it receives a SIGUSR2. Repeated	cycles
		       of this will aggregate counts. If no SIGUSR2 is
		       received	(the usual case), the counting will continue
		       until the child exits.  Note that counting for
		       descendants of the child	will not be affected, meaning
		       counting	for mp programs	cannot be controlled with this
		       option.



									Page 3






PERFEX(1)							     PERFEX(1)



     -x		       Counts at exception level (as well as the default user
		       level).

		       Exception level includes	time spent on behalf of	the
		       user during, for	example, TLB refill exceptions.	 Other
		       counting	modes (kernel, supervisor) are available
		       through the OS ioctl interface (see r10k_counters(5) ).


     -k		       Counts at kernel	level (as well as user and exception
		       level, if set), program superuser privileges.

EXAMPLE [Toc] [Back]

     To	collect	instruction and	data secondary cache miss counts on a program
     normally executed by

	% bar <	bar.in > bar.out

      would be accomplished by

	% perfex -e 26 -e 10 bar < bar.in > bar.out .

COST ESTIMATE OPTIONS [Toc] [Back]

     -y	  Report statistics and	ranges of estimated times per event.

	  Without the -y option, perfex	reports	the counts recorded by the
	  event	counters for the events	requested. Since they are simply raw
	  counts, it is	difficult to know by inspection	which events are
	  responsible for significant portions of the job's run	time. The -y
	  option associates time cost with some	of the event counts.

	  The reported times are approximate.  Due to the superscalar nature
	  of the R10000	and R12000 CPUs, and their ability to hide latency,
	  stating a precise cost for a single occurrence of many of the	events
	  is not possible. Cache misses, for example, can be overlapped	with
	  other	operations, so there is	a wide range of	times possible for any
	  cache	miss.

	  To account for the fact that the cost	of many	events cannot be known
	  precisely, perfex -y reports a range of time costs for each event.
	  "Maximum," "minimum,"	and "typical" time costs are reported. Each is
	  obtained by consulting an internal table that	holds the maximum,
	  minimum, and typical costs for each event, and multiplying this cost
	  by the count for the event. Event costs are usually measured in
	  terms	of machine cycles, and so the cost of an event generally
	  depends on the clock speed of	the processor, which is	also reported
	  in the output.

	  The maximum value contained in the table corresponds to the worst



									Page 4






PERFEX(1)							     PERFEX(1)



	  case cost of a single	occurrence of the event. Sometimes this	can be
	  a very pessimistic estimate. For example, the	maximum	cost for
	  graduated floating-point instructions	assumes	that all such
	  instructions are double precision reciprocal square roots, since
	  that is the most costly floating-point instruction.

	  Due to the latency-hiding capabilities of the	CPUs, the minimum cost
	  of virtually any event could be zero,	since most events can be
	  overlapped with other	operations. To avoid simply reporting minimum
	  costs	of 0, which would be of	no practical use, the minimum time
	  reported by perfex -y	corresponds to the "best case" cost of a
	  single occurrence of the event. The best case	cost is	obtained by
	  running the maximum number of	simultaneous occurrences of that event
	  and averaging	the cost. For example, two floating-point instructions
	  can complete per cycle, so the best case cost	on the R10000 is 0.5
	  cycles per floating-point instruction.

	  The typical cost falls somewhere between minimum and maximum and is
	  meant	to correspond to the cost one would expect to see in average
	  programs. For	example, to measure the	typical	cost of	a cache	miss,
	  stride-1 accesses to an array	too big	to fit in cache	were timed,
	  and the number of cache misses generated was counted.	The same
	  number of stride-1 accesses to an in-cache array were	then timed.
	  The difference in times corresponds to the cost of the cache misses,
	  and this was used to calculate the average cost of a cache miss.
	  This typical cost is lower than the worst case in which each cache
	  miss cannot be overlapped, and it is higher than the best case, in
	  which	several	independent, and hence,	overlapping, cache misses are
	  generated.  (Note that on Origin systems, this methodology yields
	  the time for secondary cache misses to local memory only.)
	  Naturally, these typical costs are somewhat arbitrary.  If they do
	  not seem right for the application being measuring by	perfex,	they
	  can be replaced by user-supplied values. See the -c option below.

	  perfex -y prints the event counts and	associated cost	estimates
	  sorted from most costly to least costly. While resembling a
	  profiling output, it is not a	true profile. The event	costs reported
	  are only estimates. Furthermore, since events	do overlap with	each
	  other, the sum of the	estimated times	will usually exceed the
	  program's run	time.  This output should only be used to identify
	  which	events are responsible for significant portions	of the
	  program's run	time and to get	a rough	idea of	what those costs might
	  be.

	  With this in mind, the built-in cost table does not make an attempt
	  to provide detailed costs for	all events. Some events	provide
	  summary or redundant information. These events are assigned minimum
	  and typical costs of 0, so that they sort to the bottom of the
	  output.  The maximum costs are set to	1 cycle, so that you can get
	  an indication	of the time corresponding to these events.  Issued
	  instructions and graduated instructions are examples of such events.
	  In addition to these summary or redundant events, detailed cost



									Page 5






PERFEX(1)							     PERFEX(1)



	  information has not been provided for	a few other events, such as
	  external interventions and external invalidations, since it is
	  difficult to assign costs to these asynchronous events. The built-in
	  cost values may be overridden	by user-supplied values	using the -c
	  option.

	  In addition the event	counts and cost	estimates, perfex -y also
	  reports a number of statistics derived from the typical costs. The
	  meaning of many of the statistics is self-evident (for example,
	  graduated instructions/cycle). The following are statistics whose
	  definitions require more explanation.	 These are available with both
	  R10000 and R12000 CPUs.


     Data mispredict/Data secondary cache hits

	  This is the ratio of the counts for data misprediction from
	  secondary cache way prediction table and secondary data cache
	  misses.


     Instruction mispredict/Instruction	secondary cache	hits

	  This is the ratio of the counts for instruction misprediction	from
	  secondary cache way prediction table and secondary instruction cache
	  misses.


     Primary cache line	reuse

	  The is the number of times, on average, that a primary data cache
	  line is used after it	has been moved into the	cache. It is
	  calculated as	graduated loads	plus graduated stores minus primary
	  data cache misses, all divided by primary data cache misses.


     Secondary Cache Line Reuse

	  The is the number of times, on average, that a secondary data	cache
	  line is used after it	has been moved into the	cache. It is
	  calculated as	primary	data cache misses minus	secondary data cache
	  misses, all divided by secondary data	cache misses.

     Primary Data Cache	Hit Rate

	  This is the fraction of data accesses	that are satisfied from	a
	  cache	line already resident in the primary data cache. It is
	  calculated as	1.0 - (primary data cache misses divided by the	sum of
	  graduated loads and graduated	stores).






									Page 6






PERFEX(1)							     PERFEX(1)



     Secondary Data Cache Hit Rate

	  This is the fraction of data accesses	that are satisfied from	a
	  cache	line already resident in the secondary data cache. It is
	  calculated as	1.0 - (secondary data cache misses divided by primary
	  data cache misses).

     Time accessing memory/Total time

	  This is the sum of the typical costs of graduated loads, graduated
	  stores, primary data cache misses, secondary data cache misses, and
	  TLB misses, divided by the total program run time. The total program
	  run time is calculated by multiplying	cycles by the time per cycle
	  (the inverse of the processor's clock	speed).

     Primary-to-secondary bandwidth used (MB/s,	average	per process)

	  This is the amount of	data moved between the primary and secondary
	  data caches, divided by the total program run	time. The amount of
	  data moved is	calculated as the sum of the number of primary data
	  cache	misses multiplied by the primary cache line size and the
	  number of quadwords written back from	primary	data cache multiplied
	  by the size of a quadword (16	bytes).	 For multiprocess programs,
	  the resulting	figure is a per-process	average, since the counts
	  measured by perfex are aggregates of the counts for all the threads.
	  You must multiply by the number of threads to	get the	total program
	  bandwidth.

     Memory bandwidth used (MB/s, average per process)

	  This is the amount of	data moved between the secondary data cache
	  and main memory, divided by the total	program	run time. The amount
	  of data moved	is calculated as the sum of the	number of secondary
	  data cache misses multiplied by the secondary	cache line size	and
	  the number of	quadwords written back from secondary data cache
	  multiplied by	the size of a quadword (16 bytes).  For	multiprocess
	  programs, the	resulting figure is a per-process average, since the
	  counts measured by perfex are	aggregates of the counts for all the
	  threads. You must multiply by	the number of threads to get the total
	  program bandwidth.

     MFLOPS (MB/s, average per process)

	  This is the ratio of the graduated floating-point instructions and
	  the total program run	time. Note that	while a	multiply-add carries
	  out two floating-point operations, it	only counts as one
	  instruction, so this statistic may underestimate the number of
	  floating-point operations per	second.	For multiprocess programs, the
	  resulting figure is a	per-process average, since the counts measured
	  by perfex are	aggregates of the counts for all the threads. You must
	  multiply by the number of threads to get the total program rate.




									Page 7






PERFEX(1)							     PERFEX(1)



     The following statistics are computed only	on R12000 CPUs:

     Cache misses in flight per	cycle (average)
	  This is the count of event 4 (Miss Handling Table (MHT) population)
	  divided by cycles.  It can range between 0 and 5 and represents the
	  average number of cache misses of any	kind that are outstanding per
	  cycle.

     Prefetch miss rate
	  This is the count of event 17	(prefetch primary data cache misses)
	  divided by the count of event	16 (executed prefetch instructions).
	  A high prefetch miss rate (about 1) is desirable, since prefetch
	  hits are wasting instruction bandwidth.

     A statistic is only printed if counts for the events which	define it have
     been gathered.


     -c	file
	  Load a cost table from file (requires	that -y	is specified).

	  This option allows you to override the internal event	costs used by
	  the -y option. file contains the list	of event costs that are	to be
	  overridden. This file	must be	in the same format as the output
	  produced by the -c option. Costs may be specied in units of "clks"
	  (machine cycles) or "nsec" (nanoseconds). You	can override all or
	  only a subset	of the default costs.

	  You can also use the file /etc/perfex.costs to override event	costs.
	  If this file exists, any costs listed	in it will override those
	  built	into perfex. Costs supplied with the -c	option will override
	  those	provided by the	/etc/perfex.costs file.


     -t	  Print	the cost table used for	perfex -y cost estimates to stdout.

	  These	internal costs can be overridden by specifying different
	  values in the	file /etc/perfex.costs or by using the -c file option.
	  Both file and	/etc/perfex.costs must use the format as provided by
	  the -t option. It is recommended that	you capture this output	to a
	  file and edit	it to create a suitable	file for /etc/perfex.costs or
	  the -c option. You do	not have to specify costs for every event,
	  however.  Lines corresponding	to events with values you do not wish
	  to override may simply be deleted from the file.

MIXED CPU OPTION [Toc] [Back]

     The following is an option	for systems with both R10000 and R12000	CPUs.







									Page 8






PERFEX(1)							     PERFEX(1)



     -T	  Allows experienced users to use perfex on a system of	mixed CPUs.

     Although perfex cannot verify it, the specification of this option	means
     that you have used	either dplace(1) or some other means to	ensure that
     the program is using either all R10000 CPUs or all	R12000 CPUs.

     When used with this option, the -y	option will not	produce	cost estimates
     due to the	fact that the cost estimation cannot know which	type of	CPU is
     actually targeted.	 Nothing prevents you, however,	from loading a cost
     table with	-c.  This cost table could be directly dumped from a pureR10000
 or pure-R12000 system, depending on	which CPU flavor the program
     is	running.

CHANGE IN BEHAVIOR OF DEFAULT EVENTS [Toc] [Back]

     Because of	limitations of ABI/API compliance with Irix version 6.5/R10000
     in	the operating system counter interface,	it is only possible to count
     cycles and	graduated instructions on counter 0.  Accordingly, when	the
     R12000 user specifies an event in the range 0-15 to perfex, either
     through a -e argument or environment variables, cycles cannot be counted
     simultaneously with that event as they can	on the R10000.	(perfex	only
     multiplexes events	for the	-a option, never for individually specified
     events).  In these	cases perfex will count	event 16 (executed prefetch
     instructions) as the second event.

     For similar reasons, perfex no longer remaps events 0, 15,	16, and	17 to
     fit them on two (R10000) counters,	since that would induce	a different
     behavior for identical arguments on R10000	and R12000 systems. It would
     create problems when mixed-CPU systems are	supported.  To be specific,
     prior to 6.5.3 a user could specify:
     % perfex -e 0 -e 15 a.out

     This would	execute	as if the user had specified:
     % perfex -e 0 -e 17 a.out

     or
     % perfex -e 15 -e 16 a.out

     After Irix	version	6.5.3, this argument combination is an error, and the
     user must decide which of the equivalent (for R10000 only)	forms to use.
     It	is the lack of equivalence for R12000 that makes this regression
     necessary.

FILES [Toc] [Back]

     /etc/perfex.costs

DEPENDENCIES [Toc] [Back]

     perfex only works on an R10000 or R12000 system.  Programs	running	on
     mixed R1000 and R12000 CPUs are not supported, although specifying	the -T
     option will permit	you to verify that only	CPUs of	the same type are
     being used.  Usually, perfex prints an informative	message	and fails on



									Page 9






PERFEX(1)							     PERFEX(1)



     mixed CPU systems.

     For the -mp option, only binaries linked-shared are currently supported;
     this is due to a dependency on libperfex.so.  The options -s and -mp are
     currently mutually	exclusive.

LIMITATIONS [Toc] [Back]

     The signal	control	interface (-s) can control only	the immediate target
     process, not any of its descendants.  This	makes it unusable with
     multiprocess targets in their parallel regions.

Contents

NAME [Toc] [Back]

SYNOPSIS [Toc] [Back]

DESCRIPTION [Toc] [Back]

BASIC OPTIONS [Toc] [Back]

EXAMPLE [Toc] [Back]

COST ESTIMATE OPTIONS [Toc] [Back]

MIXED CPU OPTION [Toc] [Back]

CHANGE IN BEHAVIOR OF DEFAULT EVENTS [Toc] [Back]

FILES [Toc] [Back]

DEPENDENCIES [Toc] [Back]

LIMITATIONS [Toc] [Back]

SEE ALSO [Toc] [Back]