sched_stat - Displays CPU usage and process-scheduling
statistics for SMP and NUMA platforms
/usr/sbin/sched_stat [-l] [-s] [-f] [-u] [-R] [command
[cmd_arg]...]
Prints the count of calls that are not multiprocessor safe
and therefore funneled to the master CPU. For example:
Funnelling counts
unix master calls 11174 resulting blocks 2876
The impact of funneled calls on the master CPU
needs to be taken into account when evaluating
statistics for the master CPU. Prints scheduler
load-balancing statistics. For example:
Scheduler Load Balancing
|
5-second averages
steal idle desired |
current interrupt RT
cpu trys steals steals load |
load % %
-----+-------------------------------------------------------------------------
0 | 288 3 20609 0.000
0.000 0.454 0.156
1 | 615 6 21359 0.000
0.000 0.002 0.203
2 | 996 4 20135 0.000
0.001 0.000 0.237
3 | 1302 4 16195 0.000
0.001 0.000 0.330
6 | 5 0 3029 0.000
0.000 0.000 0.034
. . .
In the displayed table, each row contains per-CPU
information as follows: The number identifier of
the CPU. The number of attempts made to steal processes/threads
from other CPUs when the CPU was not
idle. The number of processes/threads actually
stolen from other CPUs when the CPU was not idle.
The number of processes/threads stolen from other
CPUs when the CPU was idle. The number of time
slices that should be used on this CPU for running
timeshare threads. This information is calculated
by comparing the current load, interrupt %, and RT
% statistics obtained for this CPU with those
obtained for other CPUs in the same PAG.
When current load is less than desired load, the
scheduler will attempt to migrate timeshare threads
to this CPU in order to better balance the timeshare
workload among CPUs in the same PAG.
See DESCRIPTION for information about PAGs. Over
the last five seconds, the average number of time
slices used to run timeshare threads on this CPU.
Over the last five seconds, the average percentage
of time slices that this CPU spent in interrupt
context. Over the last five seconds, the average
precentage of time slices that this CPU used to run
threads according to FIFO or round-robin policy.
Prints information about CPU locality in two
tables: Shows the order-of-preference (in terms of
memory affinity) that exists between a CPU and different
RADs. Order-of-preference indicates, for a
given home RAD, the ranking of other RADs in terms
of increasing physical distance from that home RAD.
If a process or thread needs more memory or needs
to be scheduled on a RAD other than its home RAD,
the kernel automatically searches RADs for additional
memory or CPU cycles in the order of preference
shown in this table. Shows the distance (number
of hops) between different RADs and, by association,
between CPUs. The information in this table
is coarser-grained than in the preceding Radtab
table and more relevant to NUMA programming
choices. For example, the expression RAD_DIST_LOCAL
+ 2 indicates RADs that are no more than two hops
from a thread's home RAD.
For example (a small, switchless mesh NUMA system):
Radtab (rads in order of preference)
CPU # Preference 0 1
2 3
-------------------- 0 0 1
2 3 1 1 0 3 2 2
2 3 0 1 3 3 2 1 0
Hoptab (hops indexed by rad)
CPU # To rad # 0 1
2 3
-------------------- 0 0 1
1 2 1 1 0 2 1 2
1 2 0 1 3 2 1 1 0
In these tables, the CPU identifiers are listed
across the top from left to right and the RAD identifiers
are listed on the left from top to bottom.
For example if a process running on CPU 2 needs
additional memory, Radtab indicates that the kernel
will search for that memory first in RAD 2, then in
RAD 3, then in RAD 0, and last in RAD 1. Hoptab
shows the basis of this preference in that RAD 2 is
CPU 2's local RAD, RADs 0 and 3 are one hop away,
and RAD 1 is two hops away.
The -R option is useful only on NUMA platforms,
such as GS1280 and ES80 AlphServer systems, in
which memory latency times varies from one RAD to
another. The information in these tables is less
useful for GS80, GS160 and GS320 AlphaServer systems
because both coarse and finer-grained memory
affinity is the same from any CPU in one RAD to any
CPU in another RAD; however, the displays can tell
you which CPUs are in which RAD.
Make sure that you both maximize the size of your
terminal emulator window and minimize the font size
before using the -R option; otherwise, line-wrapping
will render the tables very difficult to read
on systems that have many CPUs. Prints schedulingdispatch
(processor-usage) statistics for each CPU.
For example:
Scheduler Dispatch Statistics
cpu 0 local global idle
remote | total percent
---------------------------------------------------------------------------
hot 60827 12868 19158991
0 | 19232686 91.6 warm 78
21 1542019 0 | 1542118 7.3
cold 315 27289 184784
7855 | 220243 1.0
---------------------------------------------------------------------------
total 61220 40178 20885794
7855 | 20995047 percent 0.3 0.2
99.5 0.0
cpu 1 local global idle
remote | total percent
---------------------------------------------------------------------------
hot 33760 11788 16412544
0 | 16458092 89.5 warm 66
24 1707014 0 | 1707104 9.3
cold 201 26191 203513
0 | 229905 1.2
---------------------------------------------------------------------------
. . .
These statistics show the count and percentage of
thread context switches (times that the kernel
switches to a new thread) for the following categories:
Threads scheduled from the CPU's Local Run
Queue Threads scheduled from the Global Run Queue
of the PAG to which the CPU belongs Threads scheduled
from the Idle CPU Queue of the PAG to which
the CPU belongs Threads stolen from Global or Local
Run Queues in another PAG
Note that these statistics do not count CPU time
slices that were used to re-run the same thread.
Each SMP unit (or RAD on a NUMA system) has a Processor
Affinity Group (PAG). Each PAG contains the
following queues:
A Global Run Queue from which processes or threads
are scheduled on the first available CPU One or
more Local Run Queues from which processes or
threads are scheduled on a specific CPU A queue
that contains idle CPUs
A thread that is handed to an idle CPU goes
directly to that CPU without first being placed on
the other queues.
If there is insufficient work queued locally to
keep the PAG's CPUs busy, threads are stolen first
from the Global and then the Local Run Queues in a
remote PAG.
For each of these categories, statistics are
grouped into hot, warm, and cold subcategories.
The hot statistics show context switches to threads
that last ran on the CPU only a very short time
before. The warm statistics show context switches
to threads that last ran on the CPU a somewhat
longer time before. The cold statistics indicate
context switches to threads that never ran on the
CPU before. These statistics are a measure of how
well cache affinity is being maintained; that is,
how likely the data used by threads when they last
ran is still in the cache when the threads are
rescheduled. You cannot evaluate this information
without knowledge of the type of work being done on
the system; maintenance of cache affinity can be
very important on systems (or processor sets) that
are dedicated to running certain applications (such
as those doing high performance technical computing)
but is less critical for systems serving a
variety of applications and users. Prints processor-usage
statistics for each CPU. For example:
Processor Usage
cpu user nice system idle widle | scalls
intr csw tbsyc
-----+-------------------------------+------------------------------------------
0 | 0.0 0.0 0.7 99.2 0.1 | 3327337
50861486 41885424 317108
1 | 0.0 0.0 0.4 99.5 0.1 | 3514438
0 36710149 268667
2 | 0.0 0.0 0.4 99.5 0.1 | 3182064
0 37384120 257749
3 | 0.0 0.0 0.4 99.5 0.1 | 3528519
0 36468319 249492
6 | 0.0 0.0 0.1 99.9 0.0 | 668892
11664 11793053 352294
7 | 0.0 0.0 0.1 99.9 0.0 | 772821
0 9341527 352319
8 | 0.0 0.0 0.0 100.0 0.0 | 529050
11724 5717059 347267
9 | 0.0 0.0 0.0 100.0 0.0 | 492386
0 6603681 351509
. . .
In this table: The number identifier of the CPU.
The percentage of time slices spent running threads
in user context. The percentage of time slices in
which lower-priority threads were scheduled. These
are user-context threads whose priority was explicitly
lowered by using an interface such as the nice
command or the class-scheduling software. The percentage
of time slices spent running threads in
system context. This work includes servicing of
interrupts and system calls that are made on behalf
of user processes. An unusually high percentage in
the system category might indicate a system bottleneck.
Running kprofile and lockinfo provides more
specific information about where system time is
being spent. See uprofile(1) and lockinfo(8),
respectively, for information about these
utilities. The percentage of time slices in which
no threads were scheduled. The percentage of time
slices in which available threads were blocked by
pending I/O and the CPU was idle. If this count is
unusually high, it suggests that a bottleneck in an
I/O channel might be causing suboptimal performance.
The count of system calls that were serviced.
The count of interrupts that were serviced.
The count of thread context switches (thread
scheduling changes) that completed. The number of
times that the translation buffer was synchronized.
The command to be executed by sched_stat. Any arguments
to the preceding command.
The command and cmd_arg operands are used to limit the
length of time in which sched_stat gathers statistics.
Typically, sleep is specified for command and some number
of seconds is specified for cmd_arg.
If you do not specify a command to specify an time interval
for statistics gathering, the statistics will reflect
what has occurred since the system was last booted.
The sched_stat utility helps you determine how well the
system load is distributed among CPUs, what kinds of jobs
are getting (or not getting) sufficient cycles on each
CPU, and how well cache affinity is being maintained for
these jobs.
Answers to the following questions influence how a process
and its threads are scheduled: Is the request to be serviced
multiprocessor-safe?
If not, the kernel funnels the request to the master
CPU. The master CPU must reside in the default
processor set (which contains all system CPUs if
none were assigned to user-defined processor sets)
and is typically CPU 0; however, some platforms
permit CPUs other than CPU 0 to be the master CPU.
Few requests generated by software distributed with
the operating system need to be funneled to the
master CPU and most of these are associated with
certain device drivers. However, if the system runs
many third-party drivers, the number of requests
that must be funneled to the master CPU might be
higher. What is the job priority?
Job priority influences how frequently a thread is
scheduled. Realtime requests and interrupts have
higher priority than time-share jobs, which include
the majority of user-mode threads. So, if a significant
number of CPU cycles are spent servicing
realtime requests and interrupts, there are fewer
cycles available for time-share jobs.
Default priority for time-share jobs can also be
changed by using the nice command, the runclass
command, or through class-scheduling software. On a
busy system, cache affinity is less likely to be
maintained for a thread from a time-share job whose
priority was lowered because more time is likely to
elapse between rescheduling operations for each
thread. Conversely, cache affinity is more likely
to be maintained for threads of a higher-priority
time-share job because less time elapses between
rescheduling operations. Note that the scheduler
always prioritizes the need for low response
latency (as demanded by interrupts and real-time
requests) higher than maintenance of cache affinity,
regardless of the priority assigned to a timeshare
job. Are there user-defined restrictions
that limit where a process may run?
If so, the kernel must schedule all threads of that
process on CPUs in the restricted set. In some
cases, user-defined restrictions are explicit RAD
or CPU bindings specified either in an application
or by a command (such as runon) that was used to
launch the program or reassign one of its threads.
The set of CPUs where the kernel can schedule a
thread is also influenced by the presence of userdefined
processor sets. If the process was not
explicitly started in or reassigned to a userdefined
processor set, the kernel must run it and
all of its threads only on CPUs in the default processor
set. Are any CPUs idle?
The scheduler is very aggressive in its attempts to
steal jobs from other CPUs to run on an idle CPU.
This means that the scheduler will migrate processes
or threads across RAD boundaries to give an
idle CPU work to do unless one of the preceding
restrictions is in place to prevent that. For example,
the scheduler does not cross processor set
boundaries when stealing work from another CPU,
even when a CPU is idle. In general, keeping CPUs
busy with work has higher priority than maintaining
memory or cache affinity during load-balancing
operations.
Explicit memory-allocation advice provided in application
code influences scheduling only to the extent that the
preceding factors do not override that advice. However,
explicit memory-allocation advice does make a difference
(and thereby can improve performance) when CPUs in the
processor set where the program is running are kept busy
but are not overloaded.
To gather statistics with sched_stat, you typically follow
these steps: Start up a system workload and wait for it to
get to a steady state. Start sched_stat with sleep as the
specified command and some number of seconds as the specified
cmd_arg. This causes sched_stat to gather statistics
for the length of time it takes the sleep command to execute.
For example, the following command causes sched_stat to
collect statistics for 60 seconds and then print a report:
# /usr/sbin/sched_stat sleep 60
If you include options on the command line, only statistics
for the specified options are reported.
If you specify the command without any options, all
options except for -R are assumed. (See the descriptions
of the -f, -l, -s, and -u options in the OPTIONS section.)
Running the sched_stat command has minimal impact on system
performance.
The sched_stat utility is subject to change, without
advance notice, from one release to another. The utility
is intended mainly for use by other software applications
included in the operating system product, kernel developers,
and software support representatives. Therefore,
sched_stat should be used only interactively; any customer
scripts or programs written to depend on its output data
or display format might be broken by changes in future
versions of the utility or by patches that might be
applied to it.
Success. An error occurred.
The pseudo driver that is opened by the sched_stat utility
for RAD-related statistics gathering.
Commands: iostat(1), netstat(1), nice(1), renice(1), runclass(1), runon(1), uprofile(1), vmstat(1), advfsstat(8),
collect(8), lockinfo(8), nfsstat(8), sys_check(8)
Others: numa_intro(3), class_scheduling(4), processor_sets(4)
sched_stat(8)
[ Back ] |