numa_intro - Introduction to NUMA support
NUMA, or Non-Uniform Memory Access, refers to a hardware
architectural feature in modern multiprocessor platforms
that attempts to address the increasing disparity between
requirements for processor speed and bandwidth and the
bandwidth capabilities of memory systems, including the
interconnect between processors and memory. NUMA systems
address this problem by grouping resources--processors,
I/O buses, and memory--into building blocks that balance
an appropriate number of processors and I/O buses with a
local memory system that delivers the necessary bandwidth.
The local building blocks are combined into a larger system
by means of a system-level interconnect with a platform-specific
topology.
The local processor and I/O components on a particular
building block can access their own "local" memory with
the lowest possible latency for a particular system
design. The local building block can in turn access the
resources (processors, I/O, and memory) of remote building
blocks at the cost of increased access latency and
decreased global access bandwidth. The term "Non-Uniform
Memory Access" refers to the difference in latency between
"local" and "remote" memory accesses that can occur on a
NUMA platform.
Overall system throughput and individual application performance
is optimized on a NUMA platform by maximizing the
ratio of local resource accesses to remote accesses. This
is achieved by recognizing and preserving the "affinity"
that processes have for the various resources on the system
building blocks. For this reason, the building blocks
are called "Resource Affinity Domains" or RADs.
RADs are supported only on a class of platforms known as
Cache Coherent NUMA, or CC NUMA, where all memory is
accessible and cache coherent with respect to all processors
and I/O buses. The Tru64 UNIX operating system
includes enhancements to optimize system throughput and
application performance on CC NUMA platforms for legacy
applications as well as those that use NUMA-aware APIs.
System enhancements to support NUMA are discussed in the
following subsections. Along with system performance monitoring
and tuning facilities, these enhancements allow
the operating system to make a "best effort" to optimize
the performance of any given collection of applications or
application components on a CC-NUMA platform.
NUMA Enhancements to Basic UNIX Algorithms and Default Behav-
iors
For NUMA, modifications to basic UNIX algorithms (scheduling,
memory allocation, and so forth) and to default
behaviors maximize local accesses transparently to applications.
These modifications, which include the following,
directly benefit legacy and non-NUMA-aware applications
that were designed for uniprocessors or Uniform Memory
Access Symmetric Multiprocessors but run on CC NUMA platforms:
Topology-aware placement of data
The operating system attempts to allocate memory
for application (and kernel) data on the RAD closest
to where the data will be accessed; or, for
data that is globally accessed, the operating system
may allocate memory across the available RADs.
When there is insufficient free memory on optimal
RADs, the memory allocations for data may "overflow"
onto nearby RADs. Replication of read-only
code and data
The operating system will attempt to make a local
copy of read-only text, such as shared library and
program code. Kernel code and kernel read-only data
are replicated on all RADs at boot time. If insufficient
free local memory is available, the operating
system may choose to utilize a remote copy
rather than wait for free local memory. Memory
affinity-aware scheduling
The operating system scheduler takes "cache affinity"
into account when choosing a processor to run
a process thread on multiprocessor platforms. Cache
affinity assumes that a process thread builds a
"memory footprint" in a particular processor's
cache. On CC NUMA platforms, the scheduler also
takes into account the fact that processes will
have memory allocated on particular RADs, and will
attempt to keep processes running on processors
that are in the same RAD as their memory footprints.
Load balancing
To minimize the requirement for remote memory allocation
(overflow), the scheduler will take into
account memory availability on a RAD as well as the
processor load average for the RAD. Although these
two factors may at times conflict with one another,
the scheduler will attempt to balance the load so
that processes run where there are memory pages as
well as processor cycles available. This balancing
involves both the initial selection of a RAD at
process creation and migration of processes or
individual pages in response to changing loads as
processes come and go or their resource requirements
or access patterns change.
NUMA Enhancements to Application Programming Interfaces [Toc] [Back]
Application programmers can use new or modified library
routines to further increase local accesses on CC NUMA
platforms. Using these APIs, programmers can write new
applications or modify old ones to provide additional
information to the operating system or to take explicit
control over process, thread, memory object placement, or
some combination of these.
Following are tables that list the NUMA library routines
that deal with RADs and RAD sets, processes and threads,
memory management, CPUs and CPU sets, and NUMA Scheduling
Groups. Routines are listed alphabetically in each table,
and some routines are listed in more than one table.
For information about NUMA types, structures, and symbolic
values, see numa_types(4). For information about NUMA
Scheduling Groups, see numa_scheduling_groups(4).
RADs and RAD Sets [Toc] [Back]
-----------------------------------------------------------------------
Function Purpose Library Reference Page
-----------------------------------------------------------------------
nloc() Returns the RAD libnuma nloc(3)
set that is a
specified distance
from a resource.
rad_attach_pid() Attaches a process libnuma rad_attach_pid(3)
to a RAD (assigns
a home RAD but
allows execution
on other RADs).
rad_bind_pid() Binds a process to libnuma rad_attach_pid(3)
a RAD (assigns a
home RAD and
restricts execution
to the home
RAD).
rad_foreach() Scans a RAD set libnuma rad_foreach(3)
for members and
returns the first
member found.
rad_get_cur- Returns the libnuma rad_get_current_home()
caller's home RAD. rent_home(3)
rad_get_cpus() Returns the set of libnuma rad_get_num(3)
CPUs that are in a
RAD.
rad_get_freemem() Returns a snapshot libnuma rad_get_num(3)
of the free memory
pages that are in
a RAD.
rad_get_info() Returns informa- libnuma rad_get_num(3)
tion about a RAD,
including its
state (online or
offline) and the
number of CPUs and
memory pages it
contains.
rad_get_max() Returns the number libnuma rad_get_num(3)
of RADs in the
system. **
rad_get_num() Returns the number libnuma rad_get_num(3)
of RAD's in the
caller's partition.
**
rad_get_physmem() Returns the number libnuma rad_get_num(3)
of memory pages
assigned to a RAD.
rad_get_state() Reserved for libnuma rad_get_num(3)
future use. (Currently,
RAD state
is always set to
RAD_ONLINE.)
radaddset() Adds a RAD to a libnuma radsetops(3)
RAD set.
radandset() Performs a logical libnuma radsetops(3)
AND operation on
two RAD sets,
storing the result
in a RAD set.
radcopyset() Copies the con- libnuma radsetops(3)
tents of one RAD
set to another RAD
set.
radcountset() Returns the mem- libnuma radsetops(3)
bers of a RAD set.
raddelset() Removes a RAD from libnuma radsetops(3)
a RAD set.
raddiffset() Finds the logical libnuma radsetops(3)
difference between
two RAD sets,
storing the result
in another RAD
set.
rademptyset() Initializes a RAD libnuma radsetops(3)
set such that no
RADs are included.
radfillset() Initializes a RAD libnuma radsetops(3)
set such that it
includes all RADs.
radisemptyset() Tests whether a libnuma radsetops(3)
RAD set is empty.
radismember() Tests whether a libnuma radsetops(3)
RAD belongs to a
given RAD set.
radorset() Performs a logical libnuma radsetops(3)
OR operation on
two RAD sets,
storing the result
in another RAD
set.
radsetcreate() Allocates a RAD libnuma radsetops(3)
set and sets it to
empty.
radsetdestroy() Releases the mem- libnuma radsetops(3)
ory allocated for
a RAD set.
radxorset() Performs a logical libnuma radsetops(3)
XOR operation on
two RAD sets,
storing the result
in another RAD
set.
-----------------------------------------------------------------------
** On a partitioned system, the system and the partition
are equivalent. In this case, the operating system
returns information only for the partition in which it is
installed.
Processes and Threads [Toc] [Back]
----------------------------------------------------------------------------------
Function Purpose Library Reference Page
----------------------------------------------------------------------------------
nfork() Creates a child pro- libnuma nfork(3)
cess that is an exact
copy of its parent
process. See also the
table entry for
rad_fork().
nmadvise() Tells the system what libnuma nmadvise(3)
behavior to expect
from a process with
respect to referencing
mapped files and
shared memory
regions.
nsg_attach_pid() Attaches a process to libnuma nsg_attach_pid(3)
a NUMA scheduling
group.
nsg_detach_pid() Detaches a process libnuma nsg_attach_pid(3)
from a NUMA scheduling
group.
pthread_nsg_attach() Attaches a thread to libpthread pthread_nsg_attach(3)
a NUMA scheduling
group.
pthread_nsg_detach() Detaches a thread libpthread pthread_nsg_detach(3)
from a NUMA scheduling
group.
pthread_rad_attach() Attaches a thread to libpthread pthread_rad_attach(3)
a RAD set.
pthread_rad_bind() Attaches a thread to libpthread pthread_rad_attach(3)
a RAD set and
restricts its execution
to the home RAD.
pthread_rad_detach() Detaches a thread libpthread pthread_rad_detach(3)
from a RAD set.
rad_attach_pid() Attaches a process to libnuma rad_attach_pid(3)
a RAD (assigns a home
RAD but allows execution
on other RADs).
rad_bind_pid() Binds a process to a libnuma rad_attach_pid(3)
RAD (assigns a home
RAD and restricts
execution to the home
RAD).
rad_fork() Creates a child pro- libnuma rad_fork(3)
cess on a RAD that
optionally does not
inherit the RAD
assignment of its
parent. See also the
table entry for
nfork().
----------------------------------------------------------------------------------
Memory Management [Toc] [Back]
----------------------------------------------------------------------
Function Purpose Library Reference Page
----------------------------------------------------------------------
memalloc_attr() Returns the memory libnuma memalallocation
policy for loc_attr(3)
a RAD set specified
by its virtual
address.
nacreate() Sets up an arena for libc amalloc(3)
memory allocation for
use with the amalloc()
function.. An
arena is used in multithreaded
programs
when there is a need
for thread-specific
heap memory allocation.
nmadvise() Tells the system what libnuma nmadvise(3)
behavior to expect
from a process with
respect to referencing
mapped files and
shared memory
regions.
nmmap() Maps an open file (or libnuma nmmap(3)
anonymous memory)
onto the address
space for a process
by using a specified
memory allocation
policy.
nshmget() Returns or creates libnuma nshmget(3)
the ID for a shared
memory region.
----------------------------------------------------------------------
CPUs and CPU Sets [Toc] [Back]
-----------------------------------------------------------------------
Function Purpose Library Reference Page
-----------------------------------------------------------------------
cpu_foreach() Enumerates the members libc cpu_foreach(3)
of a CPU set.
cpu_get_current() Returns the identifier libc cpu_get_curof
the current CPU on rent(3)
which the calling process
is running.
cpu_get_info() Returns CPU informa- libc cpu_get_info(3)
tion for the system.
**
cpu_get_max() Returns the number of libc cpu_get_info(3)
CPU slots available in
the caller's partition.
**
cpu_get_num() Returns the number of libc cpu_get_info(3)
available CPUs.
cpu_get_rad() Returns the RAD iden- libnuma cpu_get_rad(3)
tifier for a CPU.
cpuaddset() Adds a CPU to a CPU libc cpusetops(3)
set.
cpuandset() Performs a logical AND libc cpusetops(3)
operation on the contents
of two CPU sets,
storing the result in
a third CPU set.
cpucopyset() Copies the contents of libc cpusetops(3)
one CPU set to another
CPU set.
cpucountset() Returns the number of libc cpusetops(3)
CPUs in a CPU set.
cpudelset() Deletes a CPU from a libnuma cpusetops(3)
CPU set.
cpudiffset() Finds the logical dif- libnuma cpusetops(3)
ference between two
CPU sets, storing the
result in a third CPU
set.
cpuemptyset() Initializes a CPU set libnuma cpusetops(3)
such that it includes
no CPUs.
cpufillset() Initializes a CPU set libnuma cpusetops(3)
such that it includes
all CPUs.
cpuisemptyset() Tests whether a CPU libnuma cpusetops(3)
set is empty.
cpuismember() Tests whether a CPU is libnuma cpusetops(3)
a member of a particular
CPU set.
cpuorset() Performs a logical OR libnuma cpusetops(3)
operation on the contents
of two CPU sets,
storing the result in
a third CPU set.
cpusetcreate() Allocates a CPU set libnuma cpusetops(3)
and sets it to empty.
cpusetdestroy() Releases the memory libnuma cpusetops(3)
allocated to a CPU
set.
cpuxorset() Performs a logical XOR libnuma cpusetops(3)
operation on the contents
of two CPU sets,
storing the result in
a third CPU set.
-----------------------------------------------------------------------
** On a partitioned system, the system and the partition
are equivalent. In this case, the operating system
returns information only for the partition in which it is
installed.
NUMA Scheduling Groups [Toc] [Back]
---------------------------------------------------------------------------------
Function Purpose Library Reference Page
---------------------------------------------------------------------------------
nsg_attach_pid() Attaches a process libnuma nsg_attach_pid(3)
to a NUMA scheduling
group.
nsg_destroy() Removes a NUMA libnuma nsg_destroy(3)
scheduling group and
deallocates its
structures.
nsg_detach_pid() Detaches a process libnuma nsg_attach_pid(3)
from a NUMA scheduling
group.
pthread_nsg_attach() Attaches a thread to libpthread pthread_nsg_attach(3)
a NUMA scheduling
group.
pthread_nsg_detach() Detaches a thread libpthread pthread_nsg_detach(3)
from a NUMA scheduling
group.
nsg_get() Returns the status libnuma nsg_get(3)
of a NUMA scheduling
group.
nsg_get_nsgs() Returns a list of libnuma nsg_get_nsgs(3)
NUMA scheduling
groups that are
active.
nsg_get_pids() Returns a list of libnuma nsg_get_pids(3)
processes attached
to a NUMA scheduling
group.
nsg_init() Looks up (and possi- libnuma nsg_init(3)
bly creates) a NUMA
scheduling group.
nsg_set() Sets group ID, user libnuma nsg_set(3)
ID, and permissions
for a NUMA scheduling
group.
pthread_nsg_get() Returns a list of libpthread pthread_nsg_get(3)
threads attached to
a NUMA scheduling
group.
---------------------------------------------------------------------------------
NUMA Enhancements to System Utilities and Deamons [Toc] [Back]
A number of system commands display RAD-specific information
or perform RAD-specific operations. The following
list briefly describes the NUMA options supported by system
utilities and daemons: The runon -r command executes
an application on a specific RAD. The vmstat -r command
displays virtual memory statistics for a specific RAD.
The netstat -R command displays network routing tables for
each RAD. The ps -o RAD command includes RAD binding in
the information displayed about processes running on the
system. The hwmgr -view hier command displays the RAD
location of CPUs and devices. In this case, in place of a
RAD identifier, the command identifies the contruct in
hardware that corresponds to a RAD. When run on a GS80,
GS160, or GS320 AlphaServer platform, the command shows
the hierarchy of CPUs and devices within QBBs. When run on
an ES80 or GS1280 AlphaServer platform, the command shows
the hierarchy of CPUs and devices within PIDs (processing
unit IDs). The sched_stat -R command also displays the
RAD location of system CPUs. In addition, this command
shows the relative distance (number of hops) between CPUs.
The -t and -u options on the nfsd command allow customization
of the number of TCP and UCP server threads, respectively,
that are spawned per RAD. This feature allows the
NFS server to automatically scale the number of TCP and
UCP server threads according to the size of the system.
The -r option on the inetd command allows customization of
the RAD locations on which to start Internet server child
daemons. By default, one child deamon is started on each
RAD. The route -R command of the kdbx kernel debugger
displays network route tables for all RADs.
NUMA Overview
The NUMA Overview is a web-only document that includes a
complete NUMA programming example. Starting with Tru64
UNIX Version 5.1, this web-only document can be accessed
through the version-specific web pages for Tru64 UNIX documentation.
Links to documentation sets for different
product versions are available at the following URL:
http://www.Tru64UNIX.com-
paq.com/docs/pub_page/doc_list.html
numa_intro(3)
[ Back ] |