migration(5) migration(5)
migration - dynamic memory migration
This document describes the dynamic memory migration system available in
Origin systems.
Introduction [Toc] [Back]
Dynamic page migration is a mechanism that provides adaptive memory
locality for applications running on a NUMA machine such as the Origin
systems. The Origin hardware implements a competitive algorithm based on
comparing remote memory access counters to a local memory access counter;
when the difference between the numbers of remote and local accesses goes
beyond a preset threshold, an interrupt is generated to inform the
operating system that a physical memory page is currently experiencing
excessive remote accesses.
Within the interrupt handler the operating system makes a final decision
whether to migrate the page or not. If it decides to migrate the page,
the migration is executed immediately. The system may decide not to
execute the migration due to enforcement of a migration control policy or
due to lack of resources.
Page migration can also be explicitly requested by users, and in
addition, it is used to assist the memory coalescing algorithms for
multiple page size support.
Migration Modules [Toc] [Back]
The migration subsystem is composed of the following modules:
- Detection Module. This module monitors memory accesses issued by nodes
in the system to each physical memory page. In Origin systems this
module is mostly implemented in hardware. This detection module informs
the Migration Control Module that a page is experiencing excessive
remote accesses via an interrupt sent to the page's home node.
- Migration Engine Module. This module carries out data movement from a
current physical memory page to a new page in the node issuing the
remote accesses.
- Migration Control Module. This module decides whether the page should
be migrated or not, based on migration control policies, defined by
parameters such as migration threshold, bounce detection and
prevention, dampening factor, and others.
- Migration Control Periodic Operations Module. This module executes all
periodic operations needed for the Migration Control Module.
Page 1
migration(5) migration(5)
- Memory Management Control Interface Module (MMCI Module). This module
provides an interface for users to tune the migration policy associated
with an address space.
Migration Detection Module [Toc] [Back]
The basic goal of memory migration is to minimize memory access latency.
In a NUMA system where local memory access latency is smaller then remote
memory access latency, we can achieve this latency minimization goal by
moving the data to the node where most memory references are going to be
issued from.
It would be great to be able to move data to the node where it is going
to be needed right before it is referenced. Unfortunately, we cannot
predict the future. However, common programs usually have some amount of
temporal and spatial locality, which allows us to heuristically predict
future behavior based on recent past behavior.
The usual procedure used to predict future memory accesses to a page is
to count the memory references to this page issued by each node in the
system. If the accumulated number of remote references becomes
considerably greater than the number of accumulated local references,
then it may be beneficial to migrate the page to the remote node issuing
the references, especially if this remote node will continue accessing
this same page for a long time.
Origin systems have counters that continuously monitor all memory
accesses issued by each node in the system to each physical memory page.
In a 64-node Origin (128 processors), we have 64 memory access counters
for every 4-KB low level physical page (4 KB is the size of a low level
physical page size; software page sizes start at 16KB for Origin
systems). For every memory access, the counter associated with the node
issuing the reference is incremented; at the same time, this counter is
compared to the counter that keeps track of local accesses, and if the
remote counter exceeds the local counter by a threshold, an interrupt is
generated advising the Operating System about the existence of a page
with excessive remote accesses.
Upon reception of the interrupt, the Migration Control Module in the
Operating System decides whether to migrate the page or not.
The threshold that determines how large the difference between remote and
local counters needs to be in order for the interrupt to be generated is
stored in a per-node hardware register, which is initialized by the
Migration Control Module. The default system threshold defined in
/var/sysgen/mtune/numa by the tunable variables
numa_migr_default_threshold and numa_migr_threshold_reference (see
Migration Tunables below), and the threshold specified by users as a
parameter of a migration policy (mmci(5)), are not directly stored into
this register due to the fact that different pages on the same node may
have different migration thresholds. These thresholds are used to
initialize the reference counters when a page is initialized.
Page 2
migration(5) migration(5)
Migration Engine Module [Toc] [Back]
This module transparently moves a page from one physical frame to
another. The migration engine first verifies the availability of all
resources needed to realize the migration of a page. If all resources are
not available, the operation is cancelled.
The data transfer operation may be done using a processor or a
specialized Block Transfer Engine. Translation lookaside buffer (TLB)
shootdowns may be done using inter-processor interrupts or special
hardware known as poison bits, available only as an option on special
Origin systems running IRIX 6.5 or later. TLB shootdowns are needed in
order to avoid the use of stale translations that may be pointing to the
physical memory page that contained the data before migration took place.
Normally, a TLB shootdown operation is performed by sending interrupts to
all processors in the system with a TLB that may have stale translation
entries. On systems with poison bits, this global TLB shootdown is not
needed: along with the data transfer operation, hardware bits are
automatically set to indicate that the page is now stale (poisonous); if
a processor tries to access this stale page via a stale translation, the
memory management hardware generates a special Bus Error which causes the
TLB with the stale translation to be updated. Effectively, poison bits
allow for the implementation of a lazy TLB shootdown algorithm.
The vehicle used for the data transfer operation may be selected by the
system administrator via a tunable variable in /var/sysgen/mtune/numa:
numa_migr_vehicle. Poison bit based TLB shootdowns are enabled whenever
the data transfer vehicle is the Block Transfer Engine and the hardware
is equipped with the optional poison bits.
Migration Control Module [Toc] [Back]
This module decides whether a page should be migrated or not after
receiving a notification (via an interrupt) from the Migration Detection
Module alerting that a page is experiencing excessive remote accesses.
This decision is based on applicable migration control policies and
resource availability.
The basic idea behind controlling migration is that it is not always a
good idea to migrate a page when the memory reference counters are
telling us that a page is experiencing excessive remote accesses; the
page may be bouncing back and forth due to poor application behavior, the
counters may have accumulated too much past knowledge, making them unfit
to predict near future behavior, the destination node may have little
free memory, or the path needed to do the migration may be too busy.
The Migration Control Module applies a series of filters to a reference
counter notification or migration request, as enumerated below. All
tunables mentioned in this list are found in /var/sysgen/mtune/numa.
Page 3
migration(5) migration(5)
Node Distance Filter This filter rejects all migration requests
where the distance between the source and the
destination is less than
numa_migr_min_distance in
/var/sysgen/mtune/numa. All rejected requests
result in the page being frozen in order to
prevent this request from being re-issued too
soon.
Memory Pressure Filter This filter rejects migration requests to
nodes where physical memory is low. The
threshold for low memory is defined by the
tunable numa_migr_memory_low_threshold, which
defines the minimum percentage of physical
memory that needs to be available in order
for a page to be migrated there. This filter
can be enabled and disabled using the tunable
numa_migr_memory_low_enabled.
Traffic Control Filter Experimental filter intended to throttle
migration down when the Craylink Interconnect
traffic reaches peak levels. Experiments have
shown that this filter is unnecessary for
Origin 2000 systems.
Bounce Control Filter Sometimes pages may start bouncing due to
poor application behavior or simple page
level false sharing. This filter detects and
freezes bouncing pages. The detection is done
by keeping a count of the number of
migrations per page in a counter that is aged
(periodically decremented by a system
daemon). If the count ever goes above a
threshold, it is considered to be bouncing
and is then frozen. Frozen pages start
melting immediately, so after a period of
time, they are unfrozen and migratable again.
Note the the melting procedure is gradual,
not instantaneous. The bounce control filter
relies on operations executed periodically by
the Migration Control Periodic Operations
Module described below, for a) aging of the
migration counters and b) melting of frozen
pages. The period of these bounce control
periodic operations is defined by the tunable
numa_migr_bounce_control_interval. The
default value for this tunable is 0, which
translates into a period such that 4 physical
pages are operated on per tick (10[ms]
interval). Freezing can be enabled and
disabled using the tunable
numa_migr_freeze_enabled, and the freezing
Page 4
migration(5) migration(5)
threshold can be set using the tunable
numa_migr_freeze_threshold. This threshold is
specified as a percentage of the maximum
effective freezing threshold value, which is
7 for Origin 2000 systems. Melting can be
enabled and disabled using the tunable
numa_migr_melt_enabled, and the melting
threshold can be set using the tunable
numa_migr_melt_threshold. The melting
threshold is expressed as a percentage of the
maximum effective melting threshold value,
which is 7 for Origin 2000 systems.
Migration Dampening Filter This filter minimizes the amount of migration
due to quick temporary remote memory
accesses, such as those that occur when
caches are loaded from a cold state, or when
they are reloaded with a new context. We
implement this dampening flter using a perpage
migration request counter that is
incremented every time we receive a migration
request interrupt, and aged (periodically
decremented) by the Migration Control
Periodic Operations Module. We migrate a page
only if the counter reaches a value greater
than some dampening threshold. This will
happen only for applications that
continuously generate remote accesses to the
same page during some interval of time. If
the application experiences just a short
transitory sequence of remote accesses, it is
very unlikely that the migration request
counter will reach the threshold value. This
filter can be enabled and disabled using the
tunable numa_migr_dampening_enabled, and the
migration request count threshold can be set
using the tunable numa_migr_dampening_factor.
The memory reference counters are re-initialized to their startup values
after every reference counter interrupt.
Migration Control Periodic Operations Module [Toc] [Back]
The Migration Control Module relies on several periodic operations. These
operations are listed below:
- Bounce Control Operations. Age migration counter for freezing and
melting.
Page 5
migration(5) migration(5)
_ Unpegging. Reset memory reference counters that have reached a
saturation level.
- Queue Control Operations. Age queued outstanding migration requests.
Experimental, always disabled for production systems.
- Traffic Control Operations. Sample the state of the Craylink
interconnect and correspondingly adjust the per-node migration
threshold. Experimental, always disabled for production systems.
These operations are executed in a loop, triggered once every
mem_tick_base_period, a tunable that defines the migration control
periodic period in terms of system ticks (a system tick is equivalent to
10 [ms] on Origin systems running IRIX 6.5). This loop of operations may
be enabled and disabled using the tunable mem_tick_enabled. If migration
is enabled or users are allowed to use migration, this loop must be
enabled.
In order to minimize interference with user processes, we limit the
number of pages operated on in a loop to a few pages, trying to limit the
time used to less than 20 [us]. Administrators can adjust the time
dedicated to these periodic operations via the following tunables:
+ mem_tick_base_period
+ numa_migr_unpegging_control_interval
+ numa_migr_traffic_control_interval
+ numa_migr_bounce_control_interval
Description of Periodic Operations [Toc] [Back]
The following list describes the Bounce Control Periodic Operations in
detail:
Aging Migration Counters In order to detect bouncing we keep track of
the number of migrations per page using a
counter that is periodically decremented
(aged). When the counter goes beyond a
threshold, we consider the page to be
bouncing and freeze it.
Aging Migration Request Counters
In order to avoid excessive migration or
bouncing due to short, transitory remote
memry access sequences we have a migration
dampening filter that needs to count several
migration requests within a limited period of
time before it actually lets a real page
migration take place. The time factor is
introduced in the filter by aging the
migration request counters.
Page 6
migration(5) migration(5)
Melting Frozen Pages When a page is frozen we want to eventually
unfreeze it so that it becomes migratable
again. This behavior is desirable because the
events that cause a page to be frozen are
usually temporary. As part of the periodic
operations, we increment a counter per page
to keep track of how long the page has been
frozen. When the counter goes above a
threshold, meaning that the page has been
frozen for a sufficient time, we unfreeze the
page, thereby making it migratable again.
The Unpegging Periodic Operation consists of scanning all the memory
reference counters looking for those counters that have pegged due to
reaching their maximum count. When a pegged counter is found, all
counters associated with that page are restarted.
The current implementation of the Migration Control module does not
execute Queue Control Periodic Operations or Traffic Control Periodic
Operations.
Page Migration Tunables
This is a list of all the memory migration tunables in
/var/sysgen/mtune/numa that define the default memory migration policy
used by the system.
* numa_migr_default_mode. This tunable defines the default migration
mode. It can take the following values:
0: MIGR_DEFMODE_DISABLED
Migration is completely disabled, users cannot use migration.
1: MIGR_DEFMODE_ENABLED
Migration is always enabled, users cannot disable migration.
2: MIGR_DEFMODE_NORMOFF
Migration is normally off, users can enable migration for
an application.
3: MIGR_DEFMODE_NORMON
Migration is normally on, users can disable migration for
an application.
4: MIGR_DEFMODE_LIMITED
Migration is normally off for machine configurations with
a maximum Craylink distance less than numa_migr_min_maxradius
(defined below). Migration is normally on otherwise. Users
can override this mode.
Page 7
migration(5) migration(5)
* numa_migr_default_threshold. This threshold defines the minimum
difference between the local and any remote counter needed to
generate a migration request interrupt.
if ((remote_counter - local_counter) >=
((numa_migr_threshold_reference_value / 100) *
numa_migr_default_threshold)) {
send_migration_request_intr();
}
* numa_migr_threshold_reference. This parameter defines the pegging
value for the memory reference counters. It is machine
configuration dependent. For Origin 2000 systems, it can take the
following values:
0: MIGR_THRESHREF_STANDARD = Threshold reference is 2048 (11 bit
counters) Maximum threshold allowed
for systems with STANDARD DIMMS. This
is the default.
1: MIGR_THRESHREF_PREMIUM = Threshold reference is 524288 (19-bit
counters) Maximum threshold allowed
for systems with *all* PREMIUM SIMMS.
* numa_migr_vehicle. This tunable defines what device the system
should use to migrate a page. The value 0 selects the Block
Transfer Engine (BTE) and a value of 1 selects the processor. When
the BTE is selected, and the system is equipped with the optional
poison bits, the system automatically uses Lazy TLB Shootdown
Algorithms.
* numa_migr_min_maxradius. This tunable is used if
numa_migr_default_mode has been set to mode 4
(MIGR_DEFMODE_LIMITED). For this mode, migration is normally off for
machine configurations with a maximum Craylink distance less than
numa_migr_min_maxradius Migration is normally on otherwise.
* numa_migr_auto_migr_mech. This tunable defines the migration
execution mode for memory reference counter triggered migrations: 0
for immediate and 1 for delayed. Only the Immediate Mode (0) is
currently available.
* numa_migr_user_migr_mech. This tunables defines the migration
execution mode for user requested migrations: 0 for immediate and 1
for delayed. Only the Immediate Mode (0) is currently available.
Page 8
migration(5) migration(5)
* numa_migr_coaldmigr_mech . This tunables defines the migration
execution mode for memory coalescing migrations: 0 for immediate and
1 for delayed. Only the Immediate Mode (0) is currently available.
* numa_refcnt_default_mode. Extended counters are used in application
profiling (see refcnt(5)) and to control automatic memory migration.
This tunable defines the default extended reference counter mode. It
can take the following values:
0: REFCNT_DEFMODE_DISABLED
Extended reference counters are disabled, users cannot access
the extended reference counters (refcnt(5)). In this case
automatic memory migration will not be performed regardless of
any other settings.
1: REFCNT_DEFMODE_ENABLED
Extended reference counters are always enabled, users cannot
disable them.
2: REFCNT_DEFMODE_NORMOFF
Extended reference counters are normally disabled, users can
disable or enable the counters for an application.
3: REFCNT_DEFMODE_NORMON
Extended reference counters are normally enabled, users can
disable or enable the counters for an application.
* numa_refcnt_overflow_threshold This tunable defines the count at
which the hardware reference counters notify the operating system of
a counter overflow in order for the count to be transferred into the
(software) extended reference counters. It is expresses as a
percentage of the threshold reference value defined by
numa_migr_threshold_reference.
* numa_migr_min_distance Minimum distance required by the Node
Distance Filter in order to accept a migration request.
* numa_migr_memory_low_enabled Enable or disable the Memory Pressure
Filter.
* numa_migr_memory_low_threshold Threshold at which the Memory
Pressure Filter starts rejecting migration requests to a node. This
threshold is expressed as a percentage of the total amount of
physical memory in a node.
* numa_migr_freeze_enabled Enable or disable the freezing operation in
the Bounce Control Filter.
Page 9
migration(5) migration(5)
* numa_migr_freeze_threshold Threshold at which a page is frozen. This
tunable is expressed as a percent of the maximum count supported by
the migration counters (7 for Origin 2000).
* numa_migr_melt_enabled Enable or disable the melting operation in
the Bounce Control Filter.
* numa_migr_melt_threshold When a migration counter goes below this
threshold a page is unfrozen. This tunable is expressed as a
percent of the maximum count supported by the migration counters (7
for Origin 2000).
* numa_migr_bounce_control_interval This tunable defines the period
for the loop that ages the migration counters and the dampening
counters. It is expressed in terms of number of mem_ticks. The
mem_tick unit is defined by mem_tick_base_period below. If it is
set to 0, we process 4 pages per mem_tick. In this case, the actual
period depends on the amount of physical memory present in a node.
* numa_migr_dampening_enabled Enable or disable migration dampening.
* numa_migr_dampening_factor The number of migration requests needed
for a page before migration is actually executed. It is expressed as
a percentage of the maximum count supported by the migration-request
counters (3 for Origin 2000).
* mem_tick_enabled Enable or disabled the loop that executes the
Migration Control Periodic Operation.
* mem_tick_base_period Number of 10[ms] system ticks in one mem_tick.
* numa_migr_unpegging_control_enabled Enable or disable the unpegging
periodic operation
* numa_migr_unpegging_control_interval This tunable defines the period
for the loop that unpegs the hardware memory reference counters. It
is expressed in terms of number of mem_ticks. The mem_tick unit is
defined by mem_tick_base_period above. If it is set to 0, we process
8 pages per mem_tick. In this case, the actual period depends on the
amount of physical memory present in a node.
* numa_migr_unpegging_control_threshold Hardware memory reference
counter value at which we consider the counter to be pegged. It is
expressed as a percent of the maximum count defined by
numa_migr_threshold_reference.
* numa_migr_traffic_control_enabled Enable or disable the Traffic
Control Filter. This is an experimental module, and therefore it
should always be disabled.
Page 10
migration(5) migration(5)
* numa_migr_traffic_control_interval Traffic control period.
Experimental module.
* numa_migr_traffic_control_threshold Traffic control threshold for
kicking the batch migration of enqueued migration requests.
Experimental module.
/var/sysgen/mtune/numa
numa(5), replication(5), mtune(4), refcnt(5), mmci(5), nstats(1), sn(1).
PPPPaaaaggggeeee 11111111 [ Back ]
|