Page 1
MP(3C) MP(3C)
mp: mp_block, mp_blocktime, mp_create, mp_destroy, mp_my_threadnum,
mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock,
mp_suggested_numthreads, mp_unsetlock, mp_barrier, mp_in_doacross_loop,
mp_set_slave_stacksize - C multiprocessing utility functions
void mp_block()
void mp_unblock()
void mp_blocktime(iters)
int iters
void mp_setup()
void mp_create(num)
int num
void mp_destroy()
int mp_numthreads()
void mp_set_numthreads(num)
int num
int mp_my_threadnum()
int mp_is_master()
void mp_setlock()
void mp_unsetlock()
void mp_barrier()
int mp_in_doacross_loop()
void mp_set_slave_stacksize(size)
int size
unsigned int mp_suggested_numthreads(num)
unsigned int num
These routines give some measure of control over the parallelism used in
C programs. They should not be needed by most users, but will help to
tune specific applications.
Page 2
MP(3C) MP(3C)
mp_block puts all slave threads to sleep via blockproc(2). This frees
the processors for use by other jobs. This is useful if it is known that
the slaves will not be needed for some time, and the machine is being
shared by several users. Calls to mp_block may not be nested; a warning
is issued if an attempt to do so is made.
mp_unblock wakes up the slave threads that were previously blocked via
mp_block. It is an error to unblock threads that are not currently
blocked; a warning is issued if an attempt is made to do so.
It is not necessary to explicitly call mp_unblock. When a parallel
region is entered, a check is made, and if the slaves are currently
blocked, a call is made to mp_unblock automatically.
mp_blocktime controls the amount of time a slave thread waits for work
before giving up. When enough time has elapsed, the slave thread blocks
itself. This automatic blocking is independent of the user level
blocking provided by the mp_block/mp_unblock calls. Slave threads that
have blocked themselves will be automatically unblocked upon entering a
parallel region. The argument to mp_blocktime is the number of times to
spin in the wait loop. By default, it is set to 10,000,000. This takes
about .25 seconds on a 200MHz processor. As a special case, an argument
of 0 disables the automatic blocking, and the slaves will spin wait
without limit. The environment variable MP_BLOCKTIME may be set to an
integer value. It acts like an implicit call to mp_blocktime during
program startup.
mp_destroy deletes the slave threads. They are stopped by forcing them
to call exit(2). In general, doing this is discouraged. mp_block can be
used in most cases.
mp_create creates and initializes threads. It creates enough threads so
that the total number is equal to the argument. Since the calling thread
already counts as one, mp_create will create one less than its argument
in new slave threads.
mp_setup also creates and initializes threads. It takes no arguments.
It simply calls mp_create using the current default number of threads.
Normally the default number is equal to the number of cpu's currently on
the machine. If the user has not called either of the thread creation
routines already, then mp_setup is invoked automatically when the first
parallel region is entered. If the environment variable MP_SETUP is set,
then mp_setup is called during initialization, before any user code is
executed.
mp_numthreads returns the number of threads that would participate in an
immediately following parallel region. If the threads have already been
created, then it returns the current number of threads. If the threads
have not been created, then it returns the current default number of
threads. The count includes the master thread. Knowing this count can be
useful in optimizing certain kinds of parallel loops by hand, but this
function has the side-effect of freezing the number of threads to the
Page 3
MP(3C) MP(3C)
returned value. As a result, this routine should be used sparingly. To
determine the number of threads without this side-effect, see the
description of mp_suggested_numthreads below.
mp_set_numthreads sets the current default number of threads to the
specified value. Note that this call does not directly create the
threads, it only specifies the number that a subsequent mp_setup call
should use. If the environment variable MP_SET_NUMTHREADS is set, it
acts like an implicit call to mp_set_numthreads during program startup.
For convenience when operating among several machines with different
numbers of cpus, MP_SET_NUMTHREADS may be set to an expression involving
integer literals, the binary operators + and -, the binary functions min
and max, and the special symbolic value ALL which stands for "the total
number of available cpus on the current machine." Thus, something simple
like
setenv MP_SET_NUMTHREADS 7
would set the number of threads to seven. This may be a fine choice on
an 8 cpu machine, but would be very bad on a 4 cpu machine. Instead, use
something like
setenv MP_SET_NUMTHREADS "max(1,all-1)"
which sets the number of threads to be one less than the number of cpus
on the current machine (but always at least one). If your configuration
includes some machines with large numbers of cpus, setting an upper bound
is a good idea. Something like:
setenv MP_SET_NUMTHREADS "min(all,4)"
will request (no more than) 4 cpus.
For compatibility with earlier releases, NUM_THREADS is supported as a
synonym for MP_SET_NUMTHREADS.
mp_my_threadnum returns an integer between 0 and n-1 where n is the value
returned by mp_numthreads. The master process is always thread 0. This
is occasionally useful for optimizing certain kinds of loops by hand.
mp_is_master returns 1 if called by the master process, 0 otherwise.
mp_setlock provides convenient (though limited) access to the locking
routines. The convenience is that no set up need be done; it may be
called directly without any preliminaries. The limitation is that there
is only one lock. It is analogous to the ussetlock(3P) routine, but it
takes no arguments and does not return a value. This is useful for
serializing access to shared variables (e.g. counters) in a parallel
region. Note that it will frequently be necessary to declare those
variables as volatile to ensure that the optimizer does not assign them
to a register.
mp_unsetlock is the companion routine for mp_setlock. It also takes no
arguments and does not return a value.
mp_barrier provides a simple interface to a single barrier(3P). It may
be used inside a parallel loop to force a barrier synchronization to
occur among the parallel threads. The routine takes no arguments,
Page 4
MP(3C) MP(3C)
returns no value, and does not require any initialization.
mp_in_doacross_loop answers the question "am I currently executing inside
a parallel loop." This is needful in certain rare situations where you
have an external routine that can be called both from inside a parallel
loop and also from outside a parallel loop, and the routine must do
different things depending on whether it is being called in parallel or
not.
mp_set_slave_stacksize sets the stacksize (in bytes) to be used by the
slave processes when they are created (via sprocsp(2)). The default size
is 16MB. Note that slave processes only allocate their local data onto
their stack, shared data (even if allocated on the master's stack) is not
counted.
mp_suggested_numthreads uses the supplied value as a hint about how many
threads to use in subsequent parallel regions, and returns the previous
value of the number of threads to be employed in parallel regions. It
does not affect currently executing parallel regions, if any. The
implementation may ignore this hint depending on factors such as overall
system load. This routine may also be called with the value 0, in which
case it simply returns the number of threads to be employed in parallel
regions without the side-effect present in mp_numthreads.
Pragmas or directives
The MIPSpro C (and C++) compiler allows you to apply the capabilities of
a Silicon Graphics multiprocessor computer to the execution of a single
job. By coding a few simple directives, the compiler splits the job into
concurrently executing pieces, thereby decreasing the wall-clock run time
of the job.
Directives enable, disable, or modify a feature of the compiler.
Essentially, directives are command line options specified within the
input file instead of on the command line. Unlike command line options,
directives have no default setting. To invoke a directive, you must
either toggle it on or set a desired value for its level. The following
directives can be used in C (and C++) programs when compiled with the -mp
option.
#pragma parallel
This pragma denotes the start of a parallel region. The syntax for
this pragma has a number of modifiers, but to run a single loop in
parallel, the only modifiers you usually use are shared, and local.
These options tell the multiprocessing compiler which variables to
share between all threads of execution and which variables should be
treated as local.
In C, the code that comprises the parallel region is delimited by
curly braces ({ }) and immediately follows the parallel pragma and
Page 5
MP(3C) MP(3C)
its modifiers.
The syntax for this pragma is:
#pragma parallel shared (variables)
#pragma local (variables) optional modifiers
{code}
The parallel pragma has four modifiers: shared, local, if, and
numthreads.
Their definitions ares:
shared ( variable_names )
Tells the multiprocessing C compiler the names of all the
variables that the threads must share.
local ( variable_names )
Tells the multiprocessing C compiler the names of all the
variables that must be private to each thread. (When PCA sets up
a parallel region, it does this for you.)
if ( integer_valued_expr )
Lets you set up a condition that is evaluated at run time to
determine whether or not to run the statement(s) serially or in
parallel. At compile time, it is not always possible to judge how
much work a parallel region does (for example, loop indices are
often calculated from data supplied at run time). Avoid running
trivial amounts of code in parallel because you cannot make up
the overhead associated with running code in parallel. PCA will
also generate this condition as appropriate. If the if condition
is false (equal to zero), then the statement(s) runs serially.
Otherwise, the statement(s) run in parallel.
numthreads(expr)
Tells the multiprocessing C compiler the number of available
threads to use when running this region in parallel. (The default
is all the available threads.)
In general, you should never have more threads of execution than
you have processors, and you should specify numthreads with the
MP_SET_NUMTHREADS environmental variable at run time If you want
to run a loop in parallel while you run some other code, you can
use this option to tell the multiprocessing C compiler to use
only some of the available threads.
The expression expr should evaluate to a positive integer.
Page 6
MP(3C) MP(3C)
For example, to start a parallel region in which to run the
following code in parallel:
for (idx=n; idx; idx--) {
a[idx] = b[idx] + c[idx];
}
you must write:
#pragma parallel shared( a, b, c ) shared(n) local( idx )
or:
#pragma parallel
#pragma shared( a, b, c )
#pragma shared(n)
#pragma local(idx)
before the statement or compound statement (code in curly braces,
{ }) that comprises the parallel region.
Any code within a parallel region but not within any of the
explicit parallel constructs ( pfor, independent, one processor,
and critical ) is termed local code. Local code typically
modifies only local data and is run by all threads.
#pragma pfor
The pfor is contained within a parallel region. Use #pragma pfor to
run a for loop in parallel only if the loop meets all of these
conditions:
All the values of the index variable can be computed
independently of the iterations.
All iterations are independent of each other - that is, data used
in one iteration does not depend on data created by another
iteration. A quick test for independence: if the loop can be run
backwards, then chances are good the iterations are independent.
The loop control variable cannot be a field within a
class/struct/union or an array element.
The number of times the loop must be executed is determined once,
upon entry to the loop, and is based on the loop initialization,
loop test, and loop increment statements.
Page 7
MP(3C) MP(3C)
If the number of times the loop is actually executed is different
from what is computed above, the results are unpredictable. This
can happen if the loop test and increment change during the
execution of the loop, or if there is an early exit from within
the for loop. An early exit or a change to the loop test and
increment during execution may have serious performance
implications.
The test or the increment should not contain expressions with
side effects.
The chunksize, if specified, is computed before the loop is
executed, and the behavior is unpredictable if its value changes
within the loop.
If you are writing a pfor loop for the multiprocessing C++
compiler, the index variable i can be declared within the for
statement via
int i = 0;
The draft for the C++ standard states that the scope of the index
variable declared in a for statement extends to the end of the
for statement, as in this example:
#pragma pfor for (int i = 0, ...)
The C++ compiler doesn't enforce this; in fact, with this
compiler the scope extends to the end of the enclosing block. Use
care when writing code so that the subsequent change in scope
rules for i (in later compiler releases) do not affect the user
code.
If the code after a pfor is not dependent on the calculations made in
the pfor loop, there is no reason to synchronize the threads of
execution before they continue. So, if one thread from the pfor
finishes early, it can go on to execute the serial code without
waiting for the other threads to finish their part of the loop.
The #pragma pfor directive takes several modifiers; the only one that
is required is iterate. #pragma pfor tells the compiler that each
iteration of the loop is unique. It also partitions the iterations
among the threads for execution.
The syntax for #pragma pfor is:
#pragma pfor iterate ( ) optional_modifiers
for ...
{ code ... }
The pfor pragma has several modifiers. Their syntax is:
Page 8
MP(3C) MP(3C)
iterate (index variable=expr1; expr2; expr3 )
local(variable list)
lastlocal (variable list)
reduction (variable list)
affinity (variable) = thread (expression)
schedtype (type)
chunksize (expr)
Where:
iterate (index variable=expr1; expr2; expr3 )
Gives the multiprocessing C compiler the information it needs to
identify the unique iterations of the loop and partition them to
particular threads of execution.
index variable is the index variable of the for loop you want
to run in parallel.
expr1 is the starting value for the loop index.
expr2 is the number of iterations for the loop you want to
run in parallel.
expr3 is the increment of the for loop you want to run in
parallel.
local (variable list)
Specifies variables that are local to each process. If a variable
is declared as local, each iteration of the loop is given its own
uninitialized copy of the variable. You can declare a variable as
local if its value does not depend on any other iteration of the
loop and if its value is used only within a single iteration. In
effect the local variable is just temporary; a new copy can be
created in each loop iteration without changing the final answer.
lastlocal (variable list)
Specifies variables that are local to each process. Unlike with
the local clause, the compiler saves only the value of the
logically last iteration of the loop when it exits.
reduction (variable list)
Specifies variables involved in a reduction operation. In a
reduction operation, the compiler keeps local copies of the
variables and combines them when it exits the loop. An element of
the reduction list must be an individual variable (also called a
scalar variable) and cannot be an array or struct. However, it
can be an individual element of an array. When the reduction
modifier is used, it appears in the list with the correct
Page 9
MP(3C) MP(3C)
subscripts.
One element of an array can be used in a reduction operation,
while other elements of the array are used in other ways. To
allow for this, if an element of an array appears in the
reduction list, the entire array can also appear in the share
list.
The two types of reductions supported are sum(+) and product(*).
The compiler confirms that the reduction expression is legal by
making some simple checks. The compiler does not, however, check
all statements in the do loop for illegal reductions. You must
ensure that the reduction variable is used correctly in a
reduction operation.
affinity (variable) = thread (expression)
The effect of thread-affinity is to execute iteration "i" on the
thread number given by the user-supplied expression (modulo the
number of threads). Since the threads may need to evaluate this
expression in each iteration of the loop, the variables used in
the expression (other than the loop induction variable) must be
declared shared and must not be modified during the execution of
the loop. Violating these rules may lead to incorrect results.
If the expression does not depend on the loop induction variable,
then all iterations will execute on the same thread, and will not
benefit from parallel execution.
schedtype (type)
Tells the multiprocessing C compiler how to share the loop
iterations among the processors. The schedtype chosen depends on
the type of system you are using and the number of programs
executing. You can use the following valid types to modify
schedtype:
simple (the default)
tells the run time scheduler to partition the iterations
evenly among all the available threads.
runtime
Tells the compiler that the real schedule type will be
specified at run time.
dynamic
Tells the run time scheduler to give each thread chunksize
iterations of the loop. chunksize should be smaller than
Page 10
MP(3C) MP(3C)
(number of total iterations)/(number of threads). The
advantage of dynamic over simple is that dynamic helps
distribute the work more evenly than simple.
Depending on the data, some iterations of a loop can take
longer to compute than others, so some threads may finish
long before the others. In this situation, if the iterations
are distributed by simple, then the thread waits for the
others. But if the iterations are distributed by dynamic, the
thread doesn't wait, but goes back to get another chunksize
iteration until the threads of execution have run all the
iterations of the loop.
interleave
Tells the run time scheduler to give each thread chunksize
iterations (described below) of the loop, which are then
assigned to the threads in an interleaved way.
gss (guided self-scheduling)
Tells the run time scheduler to give each processor a varied
number of iterations of the loop. This is like dynamic, but
instead of a fixed chunksize, the chunk size iterations begin
with big pieces and end with small pieces.
If I iterations remain and P threads are working on them, the
piece size is roughly: I/(2P) + 1
Programs with triangular matrices should use gss.
chunksize (expr)
Tells the multiprocessing C/C++ compiler how many iterations
to define as a chunk when you use the dynamic or interleave
modifier (described above).
expr should be positive integer, and should evaluate to the
following formula:
number of iterations / X
where X is between twice and ten times the number of threads.
Select twice the number of threads when iterations vary
slightly. Reduce the chunk size to reflect the increasing
variance in the iterations. Performance gains may diminish
after increasing X to ten times the number of threads.
Page 11
MP(3C) MP(3C)
#pragma one processor
A #pragma one processor directive causes the statement that follows
it to be executed by exactly one thread.
The syntax of this pragma is:
#pragma one processor
{ code }
#pragma critical
Sometimes the bulk of the work done by a loop can be done in
parallel, but the entire loop cannot run in parallel because of a
single data-dependent statement. Often, you can move such a statement
out of the parallel region. When that is not possible, you can
sometimes use a lock on the statement to preserve the integrity of
the data.
In the multiprocessing C/C++ compiler, use the critical pragma to put
a lock on a critical statement (or compound statement using { }).
When you put a lock on a statement, only one thread at a time can
execute that statement. If one thread is already working on a
critical protected statement, any other thread that wants to execute
that statement must wait until that thread has finished executing it.
The syntax of the critical pragma is:
#pragma critical (lock_variable)
{ code }
The statement(s) after the critical pragma will be executed by all
threads, one at a time. The lock variable lock_variable is an
optional integer variable that must be initialized to zero. The
parentheses are required. If you don't specify a lock variable, the
compiler automatically supplies one. Multiple critical constructs
inside the same parallel region are considered to be independent of
each other unless they use the same explicit lock variable.
#pragma independent
Running a loop in parallel is a class of parallelism sometimes called
fine-grained parallelism or homogeneous parallelism. It is called
homogeneous because all the threads execute the same code on
different data. Another class of parallelism is called coarse-
grained parallelism or heterogeneous parallelism. As the name
suggests, the code in each thread of execution is different.
Page 12
MP(3C) MP(3C)
Ensuring data independence for heterogeneous code executed in
parallel is not always as easy as it is for homogeneous code executed
in parallel. (Ensuring data independence for homogeneous code is not
a trivial task.)
The independent pragma has no modifiers. Use this pragma to tell the
multiprocessing C/C++ compiler to run code in parallel with the rest
of the code in the parallel region.
The syntax for #pragma independent is:
#pragma independent
{ code }
Synchronization Directives [Toc] [Back]
To account for data dependencies, it is sometimes necessary for threads
to wait for all other threads to complete executing an earlier section of
code. Two sets of directives implement this coordination: #pragma
synchronize and #pragma enter/exit gate.
#pragma synchronize
A #pragma synchronize tells the multiprocessing C/C++ compiler that
within a parallel region, no thread can execute the statements that
follows this pragma until all threads have reached it. This
directive is a classic barrier construct.
The syntax for this pragma is:
#pragma synchronize
#pragma enter gate
#pragma exit gate
You can use two additional pragmas to coordinate the processing of
code within a parallel region. These additional pragmas work as a
matched set. They are #pragma enter gate and #pragma exit gate.
A gate is a special barrier. No thread can exit the gate until all
threads have entered it. This construct gives you more flexibility
when managing dependencies between the work-sharing constructs
within a parallel region.
The syntax of the enter gate pragma is:
Page 13
MP(3C) MP(3C)
#pragma enter gate
For example, construct D may be dependent on construct A, and
construct F may be dependent on construct B. However, you do not
want to stop at construct D because all the threads have not cleared
B. By using enter/exit gate pairs, you can make subtle distinctions
about which construct is dependent on which other construct.
Put this pragma after the work-sharing construct that all threads
must clear before the #pragma exit gate of the same name.
The syntax of the exit gate pragma is:
#pragma exit gate
Put this pragma before the work-sharing construct that is dependent
on the preceding #pragma enter gate. No thread enters this worksharing
construct until all threads have cleared the work-sharing
construct controlled by the corresponding #pragma enter gate.
#pragma page_place
The syntax of this pragma is:
#pragma page_place (addr, size, threadnum)
where addr is the starting address, size is the size in bytes, and
threadnum is the thread.
On a system with physically distributed shared memory, for example,
Origin2000), you can explicitly place all data pages spanned by the
virtual address range [addr, addr + size-1] in the physical memory
of the processor corresponding to the specified thread.
cc(1), f77(1), mp(3f), sync(3c), sync(3f), MIPSpro Power C Programmer's
Guide, MIPSpro C Language Reference Manual, MIPSpro FORTRAN 77
Programmer's Guide
Page 14
Page 1
MP(3F) MP(3F)
mp: mp_block, mp_blocktime, mp_create, mp_destroy, mp_my_threadnum,
mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock,
mp_suggested_numthreads,mp_unsetlock, mp_barrier, mp_in_doacross_loop,
mp_set_slave_stacksize - FORTRAN multiprocessing utility routines
subroutine mp_block()
subroutine mp_unblock()
subroutine mp_blocktime(iters)
integer iters
subroutine mp_setup()
subroutine mp_create(num)
integer num
subroutine mp_destroy()
integer function mp_numthreads()
subroutine mp_set_numthreads(num)
integer num
integer function mp_my_threadnum()
integer function mp_is_master()
subroutine mp_setlock()
integer function mp_suggested_numthreads(num)
integer num
subroutine mp_unsetlock()
subroutine mp_barrier()
logical function mp_in_doacross_loop()
subroutine mp_set_slave_stacksize(size)
integer size
These routines give some measure of control over the parallelism used in
FORTRAN jobs. They should not be needed by most users, but will help to
tune specific applications.
Page 2
MP(3F) MP(3F)
mp_block puts all slave threads to sleep via blockproc(2). This frees
the processors for use by other jobs. This is useful if it is known that
the slaves will not be needed for some time, and the machine is being
shared by several users. Calls to mp_block may not be nested; a warning
is issued if an attempt to do so is made.
mp_unblock wakes up the slave threads that were previously blocked via
mp_block. It is an error to unblock threads that are not currently
blocked; a warning is issued if an attempt is made to do so.
It is not necessary to explicitly call mp_unblock. When a FORTRAN
parallel region is entered, a check is made, and if the slaves are
currently blocked, a call is made to mp_unblock automatically.
mp_blocktime controls the amount of time a slave thread waits for work
before giving up. When enough time has elapsed, the slave thread blocks
itself. This automatic blocking is independent of the user level
blocking provided by the mp_block/mp_unblock calls. Slave threads that
have blocked themselves will be automatically unblocked upon entering a
parallel region. The argument to mp_blocktime is the number of times to
spin in the wait loop. By default, it is set to 10,000,000. This takes
about .25 seconds on a 200MHz processor. As a special case, an argument
of 0 disables the automatic blocking, and the slaves will spin wait
without limit. The environment variable MP_BLOCKTIME may be set to an
integer value. It acts like an implicit call to mp_blocktime during
program startup.
mp_destroy deletes the slave threads. They are stopped by forcing them
to call exit(2). In general, doing this is discouraged. mp_block can be
used in most cases.
mp_create creates and initializes threads. It creates enough threads so
that the total number is equal to the argument. Since the calling thread
already counts as one, mp_create will create one less than its argument
in new slave threads.
mp_setup also creates and initializes threads. It takes no arguments.
It simply calls mp_create using the current default number of threads.
Unless otherwise specified, the default number is equal to the number of
cpu's currently on the machine, or 8, whichever is less. If the user has
not called either of the thread creation routines already, then mp_setup
is invoked automatically when the first parallel region is entered. If
the environment variable MP_SETUP is set, then mp_setup is called during
FORTRAN initialization, before any user code is executed.
mp_numthreads returns the number of threads that would participate in an
immediately following parallel region. If the threads have already been
created, then it returns the current number of threads. If the threads
have not been created, then it returns the current default number of
threads. The count includes the master thread. Knowing this count can
be useful in optimizing certain kinds of parallel loops by hand, but this
function has the side-effect of freezing the number of threads to the
Page 3
MP(3F) MP(3F)
returned value. As a result, this routine should be used sparingly. To
determine the number of threads without this side-effect, see the
description of mp_suggested_numthreads below.
mp_set_numthreads sets the current default number of threads to the
specified value. Note that this call does not directly create the
threads, it only specifies the number that a subsequent mp_setup call
should use. If the environment variable MP_SET_NUMTHREADS is set, it
acts like an implicit call to mp_set_numthreads during program startup.
For convenience when operating among several machines with different
numbers of cpus, MP_SET_NUMTHREADS may be set to an expression involving
integer literals, the binary operators + and -, the binary functions min
and max, and the special symbolic value ALL which stands for "the total
number of available cpus on the current machine." Thus, something simple
like
setenv MP_SET_NUMTHREADS 7
would set the number of threads to seven. This may be a fine choice on
an 8 cpu machine, but would be very bad on a 4 cpu machine. Instead, use
something like
setenv MP_SET_NUMTHREADS "max(1,all-1)"
which sets the number of threads to be one less than the number of cpus
on the current machine (but always at least one). If your configuration
includes some machines with large numbers of cpus, setting an upper bound
is a good idea. Something like:
setenv MP_SET_NUMTHREADS "min(all,4)"
will request (no more than) 4 cpus.
For compatibility with earlier releases, NUM_THREADS is supported as a
synonym for MP_SET_NUMTHREADS.
mp_my_threadnum returns an integer between 0 and n-1 where n is the value
returned by mp_numthreads. The master process is always thread 0. This
is occasionally useful for optimizing certain kinds of loops by hand.
mp_is_master returns 1 if called by the master process, 0 otherwise.
mp_setlock provides convenient (though limited) access to the locking
routines. The convenience is that no set up need be done; it may be
called directly without any preliminaries. The limitation is that there
is only one lock. It is analogous to the ussetlock(3P) routine, but it
takes no arguments and does not return a value. This is useful for
serializing access to shared variables (e.g. counters) in a parallel
region. Note that it will frequently be necessary to declare those
variables as VOLATILE to ensure that the optimizer does not assign them
to a register.
mp_suggested_numthreads uses the supplied value as a hint about how many
threads to use in subsequent parallel regions, and returns the previous
value of the number of threads to be employed in parallel regions. It
does not affect currently executing parallel regions, if any. The
implementation may ignore this hint depending on factors such as overall
system load. This routine may also be called with the value 0, in which
Page 4
MP(3F) MP(3F)
case it simply returns the number of threads to be employed in parallel
regions without the side-effect present in mp_numthreads.
mp_unsetlock is the companion routine for mp_setlock. It also takes no
arguments and does not return a value.
mp_barrier provides a simple interface to a single barrier(3P). It may
be used inside a parallel loop to force a barrier synchronization to
occur among the parallel threads. The routine takes no arguments,
returns no value, and does not require any initialization.
mp_in_doacross_loop answers the question "am I currently executing inside
a parallel loop." This is needful in certain rare situations where you
have an external routine that can be called both from inside a parallel
loop and also from outside a parallel loop, and the routine must do
different things depending on whether it is being called in parallel or
not.
mp_set_slave_stacksize sets the stacksize (in bytes) to be used by the
slave processes when they are created (via sprocsp(2)). The default size
is 16MB. Note that slave processes only allocate their local data onto
their stack, shared data (even if allocated on the master's stack) is not
counted.
Directives [Toc] [Back]
The MIPSpro Fortran 77 compiler allows you to apply the capabilities of a
Silicon Graphics multiprocessor computer to the execution of a single
job. By coding a few simple directives, the compiler splits the job into
concurrently executing pieces, thereby decreasing the wall-clock run time
of the job.
Directives enable, disable, or modify a feature of the compiler.
Essentially, directives are command line options specified within the
input file instead of on the command line. Unlike command line options,
directives have no default setting. To invoke a directive, you must
either toggle it on or set a desired value for its level.
Directives placed on the first line of the input file are called global
directives. The compiler interprets them as if they appeared at the top
of each program unit in the file. Use global directives to ensure that
the program is compiled with the correct command line options. Directives
appearing anywhere else in the file apply only until the end of the
current program unit. The compiler resets the value of the directive to
the global value at the start of the next program unit. (Set the global
value using a command line option or a global directive.)
Some command line options act like global directives. Other command line
options override directives. Many directives have corresponding command
line options. If you specify conflicting settings in the command line and
a directive, the compiler chooses the most restrictive setting. For
Page 5
MP(3F) MP(3F)
Boolean options, if either the directive or the command line has the
option turned off, it is considered off. For options that require a
numeric value, the compiler uses the minimum of the command line setting
and the directive setting.
The Fortran compiler accepts directives that cause it to generate code
that can be run in parallel. The compiler directives look like Fortran
comments: they begin with a C in column one. If multiprocessing is not
turned on, these statements are treated as comments. This allows the
identical source to be compiled with a single-processing compiler or by
Fortran without the multiprocessing option. The directives are
distinguished by having a $ as the second character. The following
directives are supported:
C$DOACROSS, C$&, C$, C$MP_SCHEDTYPE, C$CHUNK, and C$COPYIN.
C$DOACROSS
The essential compiler directive for multiprocessing is C$DOACROSS. This
directive directs the compiler to generate special code to run iterations
of a DO loop in parallel. The C$DOACROSS directive applies only to the
next statement (which must be a DO loop). The Fortran compiler does not
support direct nesting of C$DOACROSS loops. The C$DOACROSS directive has
the form
C$DOACROSS [clause [ [,] clause ...]
where valid values for the optional clause are
[IF (logical_expression)]
[{LOCAL | PRIVATE} (item[,item ...])]
[{SHARE | SHARED} (item[,item ...])]
[{LASTLOCAL | LAST LOCAL} (item[,item ...])]
[REDUCTION (item[,item ...])]
[MP_SCHEDTYPE=mode ]
[CHUNK=integer_expression]
The preferred form of the directive uses the optional commas between
clauses. This section discusses the meaning of each clause.
IF
Page 6
MP(3F) MP(3F)
The IF clause determines whether the loop is actually executed in
parallel. If the logical expression is TRUE, the loop is executed in
parallel. If the expression is FALSE, the loop is executed serially.
LOCAL, SHARE, LASTLOCAL
These clauses specify lists of variables used within parallel loops.
A variable can appear in only one of these lists. To make the task of
writing these lists easier, there are several defaults. The loopiteration
variable is LASTLOCAL by default. All other variables are
SHARE by default.
LOCAL Specifies variables that are local to each process. If a
variable is declared as LOCAL, each iteration of the loop is given
its own uninitialized copy of the variable. You can declare a
variable as LOCAL if its value does not depend on any other iteration
of the loop and if its value is used only within a single iteration.
In effect the LOCAL variable is just temporary; a new copy can be
created in each loop iteration without changing the final answer. The
name LOCAL is preferred over PRIVATE. SHARE Specifies variables that
are shared across all processes. If a variable is declared as SHARE,
all iterations of the loop use the same copy of the variable. You can
declare a variable as SHARE if it is only read (not written) within
the loop or if it is an array where each iteration of the loop uses a
different element of the array. The name SHARE is preferred over
SHARED.
LASTLOCAL Specifies variables that are local to each process.Unlike
with the LOCAL clause, the compiler saves only the value of the
logically last iteration of the loop when it exits. The name
LASTLOCAL is preferred over LAST LOCAL.
LOCAL is a little faster than LASTLOCAL, so if you do not need the
final value, it is good practice to put the DO loop index variable
into the LOCAL list, although this is not required.
Only variables can appear in these lists. In particular, COMMON
blocks cannot appear in a LOCAL list. The SHARE, LOCAL, and
LASTLOCAL lists give only the names of the variables. If any member
of the list is an array, it is listed without any subscripts.
REDUCTION
The REDUCTION clause specifies variables involved in a reduction
operation. In a reduction operation, the compiler keeps local copies
of the variables and combines them when it exits the loop. An element
of the REDUCTION list must be an individual variable (also called a
scalar variable) and cannot be an array. However, it can be an
individual element of an array. In a REDUCTION clause, it would
appear in the list with the proper subscripts.
Page 7
MP(3F) MP(3F)
One element of an array can be used in a reduction operation, while
other elements of the array are used in other ways. To allow for
this, if an element of an array appears in the REDUCTION list, the
entire array can also appear in the SHARE list.
The four types of reductions supported are sum(+), product(*), min(),
and max(). Note that min(max) reductions must use the min(max)
intrinsic functions to be recognized correctly.
The compiler confirms that the reduction expression is legal by
making some simple checks. The compiler does not, however, check all
statements in the DO loop for illegal reductions. You must ensure
that the reduction variable is used correctly in a reduction
operation.
CHUNK, MP_SCHEDTYPE
The CHUNK and MP_SCHEDTYPE clauses affect the way the compiler
schedules work among the participating tasks in a loop. These clauses
do not affect the correctness of the loop. They are useful for tuning
the performance of critical loops.
For the MP_SCHEDTYPE=mode clause, mode can be one of the following:
[SIMPLE | STATIC]
[DYNAMIC]
[INTERLEAVE INTERLEAVED]
[GUIDED GSS]
[RUNTIME]
You can use any or all of these modes in a single program. The CHUNK
clause is valid only with the DYNAMIC and INTERLEAVE modes. SIMPLE,
DYNAMIC, INTERLEAVE, GSS, and RUNTIME are the preferred names for
each mode.
The simple method (MP_SCHEDTYPE=SIMPLE) divides the iterations among
processes by dividing them into contiguous pieces and assigning one
piece to each process.
In dynamic scheduling (MP_SCHEDTYPE=DYNAMIC) the iterations are
broken into pieces the size of which is specified with the CHUNK
clause. As each process finishes a piece, it enters a critical
section to grab the next available piece. This gives good load
balancing at the price of higher overhead.
The interleave method (MP_SCHEDTYPE=INTERLEAVE) breaks the iterations
into pieces of the size specified by the CHUNK option, and execution
Page 8
MP(3F) MP(3F)
of those pieces is interleaved among the processes.
The fourth method is a variation of the guided self-scheduling
algorithm (MP_SCHEDTYPE=GSS). Here, the piece size is varied
depending on the number of iterations remaining. By parceling out
relatively large pieces to start with and relatively small pieces
toward the end, the system can achieve good load balancing while
reducing the number of entries into the critical section.
In addition to these four methods, you can specify the scheduling
method at run time (MP_SCHEDTYPE=RUNTIME). Here, the scheduling
routine examines values in your run-time environment and uses that
information to select one of the other four methods.
If both the MP_SCHEDTYPE and CHUNK clauses are omitted, SIMPLE
scheduling is assumed. If MP_SCHEDTYPE is set to INTERLEAVE or
DYNAMIC and the CHUNK clause are omitted, CHUNK=1 is assumed. If
MP_SCHEDTYPE is set to one of the other values, CHUNK is ignored. If
the MP_SCHEDTYPE clause is omitted, but CHUNK is set, then
MP_SCHEDTYPE=DYNAMIC is assumed.
C$&
Occasionally, the clauses in the C$DOACROSS directive are longer than one
line. Use the C$& directive to continue the directive onto multiple
lines.
For example:
C$DOACROSS share(ALPHA, BETA, GAMMA, DELTA,
C$& EPSILON, OMEGA), LASTLOCAL(I, J, K, L, M, N),
C$& LOCAL(XXX1, XXX2, XXX3, XXX4, XXX5, XXX6, XXX7,
C$& XXX8, XXX9)
C$
The C$ directive is considered a comment line except when
multiprocessing. A line beginning with C$ is treated as a conditionally
compiled Fortran statement. The rest of the line contains a standard
Fortran statement. The statement is compiled only if multiprocessing is
turned on. In this case, the C and $ are treated as if they are blanks.
They can be used to insert debugging statements, or an experienced user
can use them to insert arbitrary code into the multiprocessed version.
C$MP_SCHEDTYPE
The C$MP_SCHEDTYPE=mode directive acts as an implicit MP_SCHEDTYPE clause
for all C$DOACROSS directives in scope. mode is any of the modes listed
under CHUNK and MP_SCHEDTYPE. A C$DOACROSS directive that does not have
Page 9
MP(3F) MP(3F)
an explicit MP_SCHEDTYPE clause is given the value specified in the last
directive prior to the look, rather than the normal default. If the
C$DOACROSS does have an explicit clause, then the explicit value is used.
C$CHUNK
The C$CHUNK=integer_expression directive affects the CHUNK clause of a
C$DOACROSS in the same way that the C$MP_SCHEDTYPE directive affects the
MP_SCHEDTYPE clause for all C$DOACROSS directives in scope. Both
directives are in effect from the place they occur in the source until
another corresponding directive is encountered or the end of the
procedure is reached.
C$COPYIN
It is occasionally desirable to be able to copy values from the master
thread's version of the COMMON block into the slave thread's version.
The special directive C$COPYIN allows this. It has the form
C$COPYIN item [, item -]
Each item must be a member of a local COMMON block. It can be a variable,
an array, an individual element of an array, or the entire COMMON block.
Note: The C$COPYIN directive cannot be executed from inside a parallel
region.
OpenMP Support [Toc] [Back]
The -mp flag enables the processing of the parallel (MP) directives,
including the original SGI/PCF directives (described below) as well as
the OpenMP directives. To disable one or the other set use
-MP:old_mp=OFF or -MP:open_mp=OFF. See the -MP option control group.
For more information about OpenMP support in MIPSpro Fortran 77, please
refer to the MIPSpro Fortran 77 Programmer's Guide. For more information
about OpenMP support in MIPSpro Fortran 90, please refer to the MIPSPro 7
Fortran 90 Commands and Directives Reference Manual. For general
information about OpenMP please refer to the following web page:
http://www.openmp.org/
PCF Directives [Toc] [Back]
Page 10
MP(3F) MP(3F)
In addition to the simple loop-level parallelism offered by C$DOACROSS
and the other directives described above, the compiler supports a more
general model of parallelism. This model is based on the work done by the
Parallel Computing Forum (PCF), which itself formed the basis for the
proposed ANSI-X3H5 standard. The compiler supports this model through
compiler directives, rather than extensions to the source language. For
more information about PCF, please refer to Chapter 5 of the MIPSpro
Fortran 77 Programmer's Guide.
The directives can be used in Fortran 77 programs when compiled with the
-mp option.
C$PAR BARRIER
Ensures that each process waits until all processes reach the
barrier before proceeding.
C$PAR [END] CRITICAL SECTION
Ensures that the enclosed block of code is executed by only one
process at a time by using a lock variable.
C$PAR [END] PARALLEL
Encloses a parallel region, which includes work-sharing
constructs and critical sections.
C$PAR PARALLEL DO
Precedes a single DO loop for which separate iterations are
executed by different processes. This directive is equivalent to
the C$DOACROSS directive.
C$PAR [END] PDO
Separate iterations of the enclosed loop are executed by
different processes. This directive must be inside a parallel
region.
C$PAR [END] PSECTION[S]
Parcels out each block of code in turn to a process.
C$PAR SECTION
Signifies a starting line for an individual section within a
parallel section.
C$PAR [END] SINGLE PROCESS
Ensures that the enclosed block of code is executed by exactly
one process.
C$PAR & Continues a PCF directive onto multiple lines.
Parallel Region [Toc] [Back]
Page 11
MP(3F) MP(3F)
A parallel region encloses any number of PCF constructs. It signifies
the boundary within which slave threads execute. A user program can
contain any number of parallel regions. The syntax of the parallel region
is:
C$PAR PARALLEL [clause [[,] clause]...]
code
C$PAR END PARALLEL
where valid clauses are:
[IF ( logical_expression )]
[{LOCAL | PRIVATE}(item [,item ...])]
[{SHARE | SHARED}(item [,item ...])]
The IF, LOCAL, and SHARED clauses have the same meaning as in the
C$DOACROSS directive.
The preferred form of the directive has no commas between the clauses.
The SHARED clause is preferred over SHARE and LOCAL is preferred over
PRIVATE.
PCF Constructs [Toc] [Back]
The three types of PCF constructs are work-sharing constructs, critical
sections, and barriers. All master and slave threads synchronize at the
bottom of a work-sharing construct. None of the threads continue past the
end of the construct until they all have completed execution within that
construct.
The four work-sharing constructs are: parallel DO, PDO, sections and
single process.
If specified, these constructs (except for the parallel DO construct)
must appear inside of a parallel region. Specifying a parallel DO
construct inside of a parallel region produces a syntax error.
The critical section construct protects a block of code with a lock so
that it is executed by only one thread at a time. Threads do not
synchronize at the bottom of a critical section.
The barrier construct ensures that each process that is executing waits
until all others reach the barrier before proceeding.
Page 12
MP(3F) MP(3F)
Parallel DO [Toc] [Back]
The parallel DO construct is the same as the C$DOACROSS directive and
conceptually the same as a parallel region containing exactly one PDO
construct and no other code. Each thread inside the enclosing parallel
region executes separate iterations of the loop within the parallel DO
construct. The syntax of the parallel DO construct is
C$PAR PARALLEL DO [clause [[,] clause]...]
where clause is defined as the same as for C$DOACROSS.
For the C$PAR PARALLEL DO directive, MP_SCHEDTYPE= is optional; you can
just specify mode.
PDO [Toc] [Back]
Each thread inside the enclosing parallel region executes a separate
iteration of the loop within the PDO construct. The syntax of the PDO
construct, which can only be specified within a parallel region, is:
C$PAR PDO [clause [[,] clause]...]
code
[C$PAR END PDO [NOWAIT]]
where valid values for clause are
[{LOCAL | PRIVATE} (item[,item ...])]
[{LASTLOCAL | LAST LOCAL} (item[,item ...])]
[(ORDERED)]
[ sched ]
[ chunk ]
LOCAL, LASTLOCAL, sched, and chunk have the same me
|