*nix Documentation Project
·  Home
 +   man pages
·  Linux HOWTOs
·  FreeBSD Tips
·  *niX Forums

  man pages->IRIX man pages -> standard/mp (3)              
Title
Content
Arch
Section
 

Contents


									Page 1






MP(3C)									MP(3C)


NAME    [Toc]    [Back]

     mp: mp_block, mp_blocktime, mp_create, mp_destroy,	mp_my_threadnum,
     mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock,
     mp_suggested_numthreads, mp_unsetlock, mp_barrier,	mp_in_doacross_loop,
     mp_set_slave_stacksize - C	multiprocessing	utility	functions

SYNOPSIS    [Toc]    [Back]

     void mp_block()

     void mp_unblock()

     void mp_blocktime(iters)
     int iters

     void mp_setup()

     void mp_create(num)
     int num

     void mp_destroy()

     int mp_numthreads()

     void mp_set_numthreads(num)
     int num

     int mp_my_threadnum()

     int mp_is_master()

     void mp_setlock()

     void mp_unsetlock()

     void mp_barrier()

     int mp_in_doacross_loop()

     void mp_set_slave_stacksize(size)
     int size

     unsigned int mp_suggested_numthreads(num)
     unsigned int num

DESCRIPTION    [Toc]    [Back]

     These routines give some measure of control over the parallelism used in
     C programs.  They should not be needed by most users, but will help to
     tune specific applications.






									Page 2






MP(3C)									MP(3C)



     mp_block puts all slave threads to	sleep via blockproc(2).	 This frees
     the processors for	use by other jobs.  This is useful if it is known that
     the slaves	will not be needed for some time, and the machine is being
     shared by several users.  Calls to	mp_block may not be nested; a warning
     is	issued if an attempt to	do so is made.

     mp_unblock	wakes up the slave threads that	were previously	blocked	via
     mp_block.	It is an error to unblock threads that are not currently
     blocked; a	warning	is issued if an	attempt	is made	to do so.

     It	is not necessary to explicitly call mp_unblock.	 When a	parallel
     region is entered,	a check	is made, and if	the slaves are currently
     blocked, a	call is	made to	mp_unblock automatically.

     mp_blocktime controls the amount of time a	slave thread waits for work
     before giving up.	When enough time has elapsed, the slave	thread blocks
     itself.  This automatic blocking is independent of	the user level
     blocking provided by the mp_block/mp_unblock calls.  Slave	threads	that
     have blocked themselves will be automatically unblocked upon entering a
     parallel region.  The argument to mp_blocktime is the number of times to
     spin in the wait loop.  By	default, it is set to 10,000,000.  This	takes
     about .25 seconds on a 200MHz processor.  As a special case, an argument
     of	0 disables the automatic blocking, and the slaves will spin wait
     without limit.  The environment variable MP_BLOCKTIME may be set to an
     integer value.  It	acts like an implicit call to mp_blocktime during
     program startup.

     mp_destroy	deletes	the slave threads.  They are stopped by	forcing	them
     to	call exit(2).  In general, doing this is discouraged.  mp_block	can be
     used in most cases.

     mp_create creates and initializes threads.	 It creates enough threads so
     that the total number is equal to the argument.  Since the	calling	thread
     already counts as one, mp_create will create one less than	its argument
     in	new slave threads.

     mp_setup also creates and initializes threads.  It	takes no arguments.
     It	simply calls mp_create using the current default number	of threads.
     Normally the default number is equal to the number	of cpu's currently on
     the machine.  If the user has not called either of	the thread creation
     routines already, then mp_setup is	invoked	automatically when the first
     parallel region is	entered.  If the environment variable MP_SETUP is set,
     then mp_setup is called during initialization, before any user code is
     executed.

     mp_numthreads returns the number of threads that would participate	in an
     immediately following parallel region.  If	the threads have already been
     created, then it returns the current number of threads.  If the threads
     have not been created, then it returns the	current	default	number of
     threads.  The count includes the master thread. Knowing this count	can be
     useful in optimizing certain kinds	of parallel loops by hand, but this
     function has the side-effect of freezing the number of threads to the



									Page 3






MP(3C)									MP(3C)



     returned value.  As a result, this	routine	should be used sparingly. To
     determine the number of threads without this side-effect, see the
     description of mp_suggested_numthreads below.

     mp_set_numthreads sets the	current	default	number of threads to the
     specified value.  Note that this call does	not directly create the
     threads, it only specifies	the number that	a subsequent mp_setup call
     should use.  If the environment variable MP_SET_NUMTHREADS	is set,	it
     acts like an implicit call	to mp_set_numthreads during program startup.
     For convenience when operating among several machines with	different
     numbers of	cpus, MP_SET_NUMTHREADS	may be set to an expression involving
     integer literals, the binary operators + and -, the binary	functions min
     and max, and the special symbolic value ALL which stands for "the total
     number of available cpus on the current machine."	Thus, something	simple
     like
		 setenv	MP_SET_NUMTHREADS 7
     would set the number of threads to	seven.	This may be a fine choice on
     an	8 cpu machine, but would be very bad on	a 4 cpu	machine.  Instead, use
     something like
		 setenv	MP_SET_NUMTHREADS "max(1,all-1)"
     which sets	the number of threads to be one	less than the number of	cpus
     on	the current machine (but always	at least one).	If your	configuration
     includes some machines with large numbers of cpus,	setting	an upper bound
     is	a good idea.  Something	like:
		 setenv	MP_SET_NUMTHREADS "min(all,4)"
     will request (no more than) 4 cpus.

     For compatibility with earlier releases, NUM_THREADS is supported as a
     synonym for MP_SET_NUMTHREADS.

     mp_my_threadnum returns an	integer	between	0 and n-1 where	n is the value
     returned by mp_numthreads.	 The master process is always thread 0.	 This
     is	occasionally useful for	optimizing certain kinds of loops by hand.

     mp_is_master returns 1 if called by the master process, 0 otherwise.

     mp_setlock	provides convenient (though limited) access to the locking
     routines.	The convenience	is that	no set up need be done;	it may be
     called directly without any preliminaries.	 The limitation	is that	there
     is	only one lock.	It is analogous	to the ussetlock(3P) routine, but it
     takes no arguments	and does not return a value.  This is useful for
     serializing access	to shared variables (e.g.  counters) in	a parallel
     region.  Note that	it will	frequently be necessary	to declare those
     variables as volatile to ensure that the optimizer	does not assign	them
     to	a register.

     mp_unsetlock is the companion routine for mp_setlock.  It also takes no
     arguments and does	not return a value.

     mp_barrier	provides a simple interface to a single	barrier(3P).  It may
     be	used inside a parallel loop to force a barrier synchronization to
     occur among the parallel threads.	The routine takes no arguments,



									Page 4






MP(3C)									MP(3C)



     returns no	value, and does	not require any	initialization.

     mp_in_doacross_loop answers the question "am I currently executing	inside
     a parallel	loop."	This is	needful	in certain rare	situations where you
     have an external routine that can be called both from inside a parallel
     loop and also from	outside	a parallel loop, and the routine must do
     different things depending	on whether it is being called in parallel or
     not.

     mp_set_slave_stacksize sets the stacksize (in bytes) to be	used by	the
     slave processes when they are created (via	sprocsp(2)).  The default size
     is	16MB.  Note that slave processes only allocate their local data	onto
     their stack, shared data (even if allocated on the	master's stack)	is not
     counted.

     mp_suggested_numthreads uses the supplied value as	a hint about how many
     threads to	use in subsequent parallel regions, and	returns	the previous
     value of the number of threads to be employed in parallel regions.	It
     does not affect currently executing parallel regions, if any. The
     implementation may	ignore this hint depending on factors such as overall
     system load.  This	routine	may also be called with	the value 0, in	which
     case it simply returns the	number of threads to be	employed in parallel
     regions without the side-effect present in	mp_numthreads.

     Pragmas or	directives

     The MIPSpro C (and	C++) compiler allows you to apply the capabilities of
     a Silicon Graphics	multiprocessor computer	to the execution of a single
     job. By coding a few simple directives, the compiler splits the job into
     concurrently executing pieces, thereby decreasing the wall-clock run time
     of	the job.

     Directives	enable,	disable, or modify a feature of	the compiler.
     Essentially, directives are command line options specified	within the
     input file	instead	of on the command line.	Unlike command line options,
     directives	have no	default	setting. To invoke a directive,	you must
     either toggle it on or set	a desired value	for its	level.	The following
     directives	can be used in C (and C++) programs when compiled with the -mp
     option.


     #pragma parallel

	 This pragma denotes the start of a parallel region. The syntax	for
	 this pragma has a number of modifiers,	but to run a single loop in
	 parallel, the only modifiers you usually use are shared, and local.
	 These options tell the	multiprocessing	compiler which variables to
	 share between all threads of execution	and which variables should be
	 treated as local.

	 In C, the code	that comprises the parallel region is delimited	by
	 curly braces ({ }) and	immediately follows the	parallel pragma	and



									Page 5






MP(3C)									MP(3C)



	 its modifiers.

	 The syntax for	this pragma is:

	 #pragma parallel shared (variables)
	 #pragma local (variables) optional modifiers
	 {code}

	 The parallel pragma has four modifiers: shared, local,	if, and
	 numthreads.

	 Their definitions ares:

	     shared ( variable_names )

	     Tells the multiprocessing C compiler the names of all the
	     variables that the	threads	must share.

	     local ( variable_names )

	     Tells the multiprocessing C compiler the names of all the
	     variables that must be private to each thread. (When PCA sets up
	     a parallel	region,	it does	this for you.)

	     if	( integer_valued_expr )

	     Lets you set up a condition that is evaluated at run time to
	     determine whether or not to run the statement(s) serially or in
	     parallel. At compile time,	it is not always possible to judge how
	     much work a parallel region does (for example, loop indices are
	     often calculated from data	supplied at run	time). Avoid running
	     trivial amounts of	code in	parallel because you cannot make up
	     the overhead associated with running code in parallel. PCA	will
	     also generate this	condition as appropriate.  If the if condition
	     is	false (equal to	zero), then the	statement(s) runs serially.
	     Otherwise,	the statement(s) run in	parallel.

	     numthreads(expr)

	     Tells the multiprocessing C compiler the number of	available
	     threads to	use when running this region in	parallel. (The default
	     is	all the	available threads.)

	     In	general, you should never have more threads of execution than
	     you have processors, and you should specify  numthreads with the
	     MP_SET_NUMTHREADS environmental variable at run time If you want
	     to	run a loop in parallel while you run some other	code, you can
	     use this option to	tell the multiprocessing C compiler to use
	     only some of the available	threads.

	     The expression expr should	evaluate to a positive integer.




									Page 6






MP(3C)									MP(3C)



	     For example, to start a parallel region in	which to run the
	     following code in parallel:

	     for (idx=n; idx; idx--) {

		a[idx] = b[idx]	+ c[idx];

	     }

	     you must write:

	     #pragma parallel shared( a, b, c )	shared(n) local( idx )

	     or:

	     #pragma parallel

	     #pragma shared( a,	b, c )

	     #pragma shared(n)

	     #pragma local(idx)

	     before the	statement or compound statement	(code in curly braces,
	     { }) that comprises the parallel region.

	     Any code within a parallel	region but not within any of the
	     explicit parallel constructs ( pfor, independent, one processor,
	     and critical ) is termed local code. Local	code typically
	     modifies only local data and is run by all	threads.


     #pragma pfor

	 The pfor is contained within a	parallel region.  Use #pragma pfor to
	 run a for loop	in parallel only if the	loop meets all of these
	 conditions:

	     All the values of the index variable can be computed
	     independently of the iterations.

	     All iterations are	independent of each other - that is, data used
	     in	one iteration does not depend on data created by another
	     iteration.	A quick	test for independence: if the loop can be run
	     backwards,	then chances are good the iterations are independent.

	     The loop control variable cannot be a field within	a
	     class/struct/union	or an array element.

	     The number	of times the loop must be executed is determined once,
	     upon entry	to the loop, and is based on the loop initialization,
	     loop test,	and loop increment statements.



									Page 7






MP(3C)									MP(3C)



	     If	the number of times the	loop is	actually executed is different
	     from what is computed above, the results are unpredictable. This
	     can happen	if the loop test and increment change during the
	     execution of the loop, or if there	is an early exit from within
	     the for loop. An early exit or a change to	the loop test and
	     increment during execution	may have serious performance
	     implications.

	     The test or the increment should not contain expressions with
	     side effects.

	     The chunksize, if specified, is computed before the loop is
	     executed, and the behavior	is unpredictable if its	value changes
	     within the	loop.

	     If	you are	writing	a pfor loop for	the multiprocessing C++
	     compiler, the index variable i can	be declared within the for
	     statement via

	     int i = 0;

	     The draft for the C++ standard states that	the scope of the index
	     variable declared in a for	statement extends to the end of	the
	     for statement, as in this example:

	     #pragma pfor for (int i = 0, ...)

	     The C++ compiler doesn't enforce this; in fact, with this
	     compiler the scope	extends	to the end of the enclosing block. Use
	     care when writing code so that the	subsequent change in scope
	     rules for i (in later compiler releases) do not affect the	user
	     code.

	 If the	code after a pfor is not dependent on the calculations made in
	 the pfor loop,	there is no reason to synchronize the threads of
	 execution before they continue. So, if	one thread from	the pfor
	 finishes early, it can	go on to execute the serial code without
	 waiting for the other threads to finish their part of the loop.

	 The #pragma pfor directive takes several modifiers; the only one that
	 is required is	iterate. #pragma pfor tells the	compiler that each
	 iteration of the loop is unique.  It also partitions the iterations
	 among the threads for execution.

	 The syntax for	#pragma	pfor is:

	 #pragma pfor iterate (	) optional_modifiers
	 for ...
	    { code ... }

	 The pfor pragma has several modifiers.	Their syntax is:




									Page 8






MP(3C)									MP(3C)



	 iterate (index	variable=expr1;	expr2; expr3 )
	 local(variable	list)
	 lastlocal (variable list)
	 reduction (variable list)
	 affinity (variable) = thread (expression)
	 schedtype (type)
	 chunksize (expr)

	 Where:

	     iterate (index variable=expr1; expr2; expr3 )

	     Gives the multiprocessing C compiler the information it needs to
	     identify the unique iterations of the loop	and partition them to
	     particular	threads	of execution.

		 index variable	is the index variable of the for loop you want
		 to run	in parallel.

		 expr1 is the starting value for the loop index.

		 expr2 is the number of	iterations for the loop	you want to
		 run in	parallel.

		 expr3 is the increment	of the for loop	you want to run	in
		 parallel.

	     local (variable list)

	     Specifies variables that are local	to each	process. If a variable
	     is	declared as local, each	iteration of the loop is given its own
	     uninitialized copy	of the variable. You can declare a variable as
	     local if its value	does not depend	on any other iteration of the
	     loop and if its value is used only	within a single	iteration. In
	     effect the	local variable is just temporary; a new	copy can be
	     created in	each loop iteration without changing the final answer.

	     lastlocal (variable list)

	     Specifies variables that are local	to each	process. Unlike	with
	     the local clause, the compiler saves only the value of the
	     logically last iteration of the loop when it exits.

	     reduction (variable list)

	     Specifies variables involved in a reduction operation. In a
	     reduction operation, the compiler keeps local copies of the
	     variables and combines them when it exits the loop. An element of
	     the reduction list	must be	an individual variable (also called a
	     scalar variable) and cannot be an array or	struct.	However, it
	     can be an individual element of an	array. When the	reduction
	     modifier is used, it appears in the list with the correct



									Page 9






MP(3C)									MP(3C)



	     subscripts.

	     One element of an array can be used in a reduction	operation,
	     while other elements of the array are used	in other ways. To
	     allow for this, if	an element of an array appears in the
	     reduction list, the entire	array can also appear in the share
	     list.

	     The two types of reductions supported are sum(+) and product(*).

	     The compiler confirms that	the reduction expression is legal by
	     making some simple	checks.	The compiler does not, however,	check
	     all statements in the do loop for illegal reductions. You must
	     ensure that the reduction variable	is used	correctly in a
	     reduction operation.

	     affinity (variable) = thread (expression)

	     The effect	of thread-affinity is to execute iteration "i" on the
	     thread number given by the	user-supplied expression (modulo the
	     number of threads). Since the threads may need to evaluate	this
	     expression	in each	iteration of the loop, the variables used in
	     the expression (other than	the loop induction variable) must be
	     declared shared and must not be modified during the execution of
	     the loop. Violating these rules may lead to incorrect results.

	     If	the expression does not	depend on the loop induction variable,
	     then all iterations will execute on the same thread, and will not
	     benefit from parallel execution.

	     schedtype (type)

	     Tells the multiprocessing C compiler how to share the loop
	     iterations	among the processors. The schedtype chosen depends on
	     the type of system	you are	using and the number of	programs
	     executing.	 You can use the following valid types to modify
	     schedtype:

		 simple	(the default)

		 tells the run time scheduler to partition the iterations
		 evenly	among all the available	threads.

		 runtime

		 Tells the compiler that the real schedule type	will be
		 specified at run time.

		 dynamic

		 Tells the run time scheduler to give each thread chunksize
		 iterations of the loop. chunksize should be smaller than



								       Page 10






MP(3C)									MP(3C)



		 (number of total iterations)/(number of threads). The
		 advantage of dynamic over simple is that dynamic helps
		 distribute the	work more evenly than simple.

		 Depending on the data,	some iterations	of a loop can take
		 longer	to compute than	others,	so some	threads	may finish
		 long before the others.  In this situation, if	the iterations
		 are distributed by simple, then the thread waits for the
		 others. But if	the iterations are distributed by dynamic, the
		 thread	doesn't	wait, but goes back to get another chunksize
		 iteration until the threads of	execution have run all the
		 iterations of the loop.

		 interleave

		 Tells the run time scheduler to give each thread chunksize
		 iterations (described below) of the loop, which are then
		 assigned to the threads in an interleaved way.

		 gss (guided self-scheduling)

		 Tells the run time scheduler to give each processor a varied
		 number	of iterations of the loop. This	is like	dynamic, but
		 instead of a fixed chunksize, the chunk size iterations begin
		 with big pieces and end with small pieces.

		 If I iterations remain	and P threads are working on them, the
		 piece size is roughly:	 I/(2P)	+ 1

		 Programs with triangular matrices should use gss.

		 chunksize (expr)

		 Tells the multiprocessing C/C++ compiler how many iterations
		 to define as a	chunk when you use the dynamic or interleave
		 modifier (described above).

		 expr should be	positive integer, and should evaluate to the
		 following formula:

		      number of	iterations / X

		 where X is between twice and ten times	the number of threads.
		 Select	twice the number of threads when iterations vary
		 slightly. Reduce the chunk size to reflect the	increasing
		 variance in the iterations.  Performance gains	may diminish
		 after increasing X to ten times the number of threads.








								       Page 11






MP(3C)									MP(3C)



     #pragma one processor

	 A #pragma one processor directive causes the statement	that follows
	 it to be executed by exactly one thread.

	 The syntax of this pragma is:

	 #pragma one processor

	 { code	}


     #pragma critical

	 Sometimes the bulk of the work	done by	a loop can be done in
	 parallel, but the entire loop cannot run in parallel because of a
	 single	data-dependent statement. Often, you can move such a statement
	 out of	the parallel region.  When that	is not possible, you can
	 sometimes use a lock on the statement to preserve the integrity of
	 the data.

	 In the	multiprocessing	C/C++ compiler,	use the	critical pragma	to put
	 a lock	on a critical statement	(or compound statement using { }).
	 When you put a	lock on	a statement, only one thread at	a time can
	 execute that statement.  If one thread	is already working on a
	 critical protected statement, any other thread	that wants to execute
	 that statement	must wait until	that thread has	finished executing it.

	 The syntax of the critical pragma is:

	 #pragma critical (lock_variable)

	 { code	}

	 The statement(s) after	the critical pragma will be executed by	all
	 threads, one at a time. The lock variable lock_variable is an
	 optional integer variable that	must be	initialized to zero. The
	 parentheses are required. If you don't	specify	a lock variable, the
	 compiler automatically	supplies one.  Multiple	critical constructs
	 inside	the same parallel region are considered	to be independent of
	 each other unless they	use the	same explicit lock variable.


     #pragma independent

	 Running a loop	in parallel is a class of parallelism sometimes	called
	 fine-grained parallelism or homogeneous parallelism. It is called
	 homogeneous because all the threads execute the same code on
	 different data.  Another class	of parallelism is called coarse-
	 grained parallelism or	heterogeneous parallelism. As the name
	 suggests, the code in each thread of execution	is different.




								       Page 12






MP(3C)									MP(3C)



	 Ensuring data independence for	heterogeneous code executed in
	 parallel is not always	as easy	as it is for homogeneous code executed
	 in parallel.  (Ensuring data independence for homogeneous code	is not
	 a trivial task.)

	 The independent pragma	has no modifiers. Use this pragma to tell the
	 multiprocessing C/C++ compiler	to run code in parallel	with the rest
	 of the	code in	the parallel region.

	 The syntax for	#pragma	independent is:

	 #pragma independent

	 { code	}


     Synchronization Directives    [Toc]    [Back]

     To	account	for data dependencies, it is sometimes necessary for threads
     to	wait for all other threads to complete executing an earlier section of
     code.  Two	sets of	directives implement this coordination:	#pragma
     synchronize and #pragma enter/exit	gate.


     #pragma synchronize

	  A #pragma synchronize	tells the multiprocessing C/C++	compiler that
	  within a parallel region, no thread can execute the statements that
	  follows this pragma until all	threads	have reached it. This
	  directive is a classic barrier construct.

	  The syntax for this pragma is:

	  #pragma synchronize



     #pragma enter gate

	  #pragma exit gate

	  You can use two additional pragmas to	coordinate the processing of
	  code within a	parallel region. These additional pragmas work as a
	  matched set.	They are #pragma enter gate and	#pragma	exit gate.

	  A gate is a special barrier. No thread can exit the gate until all
	  threads have entered it. This	construct gives	you more flexibility
	  when managing	dependencies between the work-sharing constructs
	  within a parallel region.

	  The syntax of	the enter gate pragma is:




								       Page 13






MP(3C)									MP(3C)



	  #pragma enter	gate

	  For example, construct D may be dependent on construct A, and
	  construct F may be dependent on construct B. However,	you do not
	  want to stop at construct D because all the threads have not cleared
	  B. By	using enter/exit gate pairs, you can make subtle distinctions
	  about	which construct	is dependent on	which other construct.

	  Put this pragma after	the work-sharing construct that	all threads
	  must clear before the	#pragma	exit gate of the same name.

	  The syntax of	the exit gate pragma is:

	  #pragma exit gate

	  Put this pragma before the work-sharing construct that is dependent
	  on the preceding #pragma enter gate. No thread enters	this worksharing
 construct until all threads have cleared the work-sharing
	  construct controlled by the corresponding #pragma enter gate.


     #pragma page_place

	  The syntax of	this pragma is:

	  #pragma page_place (addr, size, threadnum)

	  where	addr is	the starting address, size is the size in bytes, and
	  threadnum is the thread.

	  On a system with physically distributed shared memory, for example,
	  Origin2000), you can explicitly place	all data pages spanned by the
	  virtual address range	[addr, addr + size-1] in the physical memory
	  of the processor corresponding to the	specified thread.

SEE ALSO    [Toc]    [Back]

      
      
     cc(1), f77(1), mp(3f), sync(3c), sync(3f),	MIPSpro	Power C	Programmer's
     Guide, MIPSpro C Language Reference Manual, MIPSpro FORTRAN 77
     Programmer's Guide















								       Page 14

































































									Page 1






MP(3F)									MP(3F)


NAME    [Toc]    [Back]

     mp: mp_block, mp_blocktime, mp_create, mp_destroy,	mp_my_threadnum,
     mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock,
     mp_suggested_numthreads,mp_unsetlock, mp_barrier, mp_in_doacross_loop,
     mp_set_slave_stacksize - FORTRAN multiprocessing utility routines

SYNOPSIS    [Toc]    [Back]

     subroutine	mp_block()

     subroutine	mp_unblock()

     subroutine	mp_blocktime(iters)
     integer iters

     subroutine	mp_setup()

     subroutine	mp_create(num)
     integer num

     subroutine	mp_destroy()

     integer function mp_numthreads()

     subroutine	mp_set_numthreads(num)
     integer num

     integer function mp_my_threadnum()

     integer function mp_is_master()

     subroutine	mp_setlock()

     integer function mp_suggested_numthreads(num)
     integer num

     subroutine	mp_unsetlock()

     subroutine	mp_barrier()

     logical function mp_in_doacross_loop()

     subroutine	mp_set_slave_stacksize(size)
     integer size

DESCRIPTION    [Toc]    [Back]

     These routines give some measure of control over the parallelism used in
     FORTRAN jobs.  They should	not be needed by most users, but will help to
     tune specific applications.






									Page 2






MP(3F)									MP(3F)



     mp_block puts all slave threads to	sleep via blockproc(2).	 This frees
     the processors for	use by other jobs.  This is useful if it is known that
     the slaves	will not be needed for some time, and the machine is being
     shared by several users.  Calls to	mp_block may not be nested; a warning
     is	issued if an attempt to	do so is made.

     mp_unblock	wakes up the slave threads that	were previously	blocked	via
     mp_block.	It is an error to unblock threads that are not currently
     blocked; a	warning	is issued if an	attempt	is made	to do so.

     It	is not necessary to explicitly call mp_unblock.	 When a	FORTRAN
     parallel region is	entered, a check is made, and if the slaves are
     currently blocked,	a call is made to mp_unblock automatically.

     mp_blocktime controls the amount of time a	slave thread waits for work
     before giving up.	When enough time has elapsed, the slave	thread blocks
     itself.  This automatic blocking is independent of	the user level
     blocking provided by the mp_block/mp_unblock calls.  Slave	threads	that
     have blocked themselves will be automatically unblocked upon entering a
     parallel region.  The argument to mp_blocktime is the number of times to
     spin in the wait loop.  By	default, it is set to 10,000,000.  This	takes
     about .25 seconds on a 200MHz processor.  As a special case, an argument
     of	0 disables the automatic blocking, and the slaves will spin wait
     without limit.  The environment variable MP_BLOCKTIME may be set to an
     integer value.  It	acts like an implicit call to mp_blocktime during
     program startup.

     mp_destroy	deletes	the slave threads.  They are stopped by	forcing	them
     to	call exit(2).  In general, doing this is discouraged.  mp_block	can be
     used in most cases.

     mp_create creates and initializes threads.	 It creates enough threads so
     that the total number is equal to the argument.  Since the	calling	thread
     already counts as one, mp_create will create one less than	its argument
     in	new slave threads.

     mp_setup also creates and initializes threads.  It	takes no arguments.
     It	simply calls mp_create using the current default number	of threads.
     Unless otherwise specified, the default number is equal to	the number of
     cpu's currently on	the machine, or	8, whichever is	less.  If the user has
     not called	either of the thread creation routines already,	then mp_setup
     is	invoked	automatically when the first parallel region is	entered.  If
     the environment variable MP_SETUP is set, then mp_setup is	called during
     FORTRAN initialization, before any	user code is executed.

     mp_numthreads returns the number of threads that would participate	in an
     immediately following parallel region.  If	the threads have already been
     created, then it returns the current number of threads.  If the threads
     have not been created, then it returns the	current	default	number of
     threads.  The count includes the master thread.  Knowing this count can
     be	useful in optimizing certain kinds of parallel loops by	hand, but this
     function has the side-effect of freezing the number of threads to the



									Page 3






MP(3F)									MP(3F)



     returned value.  As a result, this	routine	should be used sparingly.  To
     determine the number of threads without this side-effect, see the
     description of mp_suggested_numthreads below.

     mp_set_numthreads sets the	current	default	number of threads to the
     specified value.  Note that this call does	not directly create the
     threads, it only specifies	the number that	a subsequent mp_setup call
     should use.  If the environment variable MP_SET_NUMTHREADS	is set,	it
     acts like an implicit call	to mp_set_numthreads during program startup.
     For convenience when operating among several machines with	different
     numbers of	cpus, MP_SET_NUMTHREADS	may be set to an expression involving
     integer literals, the binary operators + and -, the binary	functions min
     and max, and the special symbolic value ALL which stands for "the total
     number of available cpus on the current machine."	Thus, something	simple
     like
		 setenv	MP_SET_NUMTHREADS 7
     would set the number of threads to	seven.	This may be a fine choice on
     an	8 cpu machine, but would be very bad on	a 4 cpu	machine.  Instead, use
     something like
		 setenv	MP_SET_NUMTHREADS "max(1,all-1)"
     which sets	the number of threads to be one	less than the number of	cpus
     on	the current machine (but always	at least one).	If your	configuration
     includes some machines with large numbers of cpus,	setting	an upper bound
     is	a good idea.  Something	like:
		 setenv	MP_SET_NUMTHREADS "min(all,4)"
     will request (no more than) 4 cpus.

     For compatibility with earlier releases, NUM_THREADS is supported as a
     synonym for MP_SET_NUMTHREADS.

     mp_my_threadnum returns an	integer	between	0 and n-1 where	n is the value
     returned by mp_numthreads.	 The master process is always thread 0.	 This
     is	occasionally useful for	optimizing certain kinds of loops by hand.

     mp_is_master returns 1 if called by the master process, 0 otherwise.

     mp_setlock	provides convenient (though limited) access to the locking
     routines.	The convenience	is that	no set up need be done;	it may be
     called directly without any preliminaries.	 The limitation	is that	there
     is	only one lock.	It is analogous	to the ussetlock(3P) routine, but it
     takes no arguments	and does not return a value.  This is useful for
     serializing access	to shared variables (e.g.  counters) in	a parallel
     region.  Note that	it will	frequently be necessary	to declare those
     variables as VOLATILE to ensure that the optimizer	does not assign	them
     to	a register.

     mp_suggested_numthreads uses the supplied value as	a hint about how many
     threads to	use in subsequent parallel regions, and	returns	the previous
     value of the number of threads to be employed in parallel regions.	It
     does not affect currently executing parallel regions, if any. The
     implementation may	ignore this hint depending on factors such as overall
     system load.  This	routine	may also be called with	the value 0, in	which



									Page 4






MP(3F)									MP(3F)



     case it simply returns the	number of threads to be	employed in parallel
     regions without the side-effect present in	mp_numthreads.

     mp_unsetlock is the companion routine for mp_setlock.  It also takes no
     arguments and does	not return a value.

     mp_barrier	provides a simple interface to a single	barrier(3P).  It may
     be	used inside a parallel loop to force a barrier synchronization to
     occur among the parallel threads.	The routine takes no arguments,
     returns no	value, and does	not require any	initialization.

     mp_in_doacross_loop answers the question "am I currently executing	inside
     a parallel	loop."	This is	needful	in certain rare	situations where you
     have an external routine that can be called both from inside a parallel
     loop and also from	outside	a parallel loop, and the routine must do
     different things depending	on whether it is being called in parallel or
     not.

     mp_set_slave_stacksize sets the stacksize (in bytes) to be	used by	the
     slave processes when they are created (via	sprocsp(2)).  The default size
     is	16MB.  Note that slave processes only allocate their local data	onto
     their stack, shared data (even if allocated on the	master's stack)	is not
     counted.


     Directives    [Toc]    [Back]

     The MIPSpro Fortran 77 compiler allows you	to apply the capabilities of a
     Silicon Graphics multiprocessor computer to the execution of a single
     job. By coding a few simple directives, the compiler splits the job into
     concurrently executing pieces, thereby decreasing the wall-clock run time
     of	the job.

     Directives	enable,	disable, or modify a feature of	the compiler.
     Essentially, directives are command line options specified	within the
     input file	instead	of on the command line.	Unlike command line options,
     directives	have no	default	setting. To invoke a directive,	you must
     either toggle it on or set	a desired value	for its	level.

     Directives	placed on the first line of the	input file are called global
     directives. The compiler interprets them as if they appeared at the top
     of	each program unit in the file. Use global directives to	ensure that
     the program is compiled with the correct command line options. Directives
     appearing anywhere	else in	the file apply only until the end of the
     current program unit. The compiler	resets the value of the	directive to
     the global	value at the start of the next program unit. (Set the global
     value using a command line	option or a global directive.)

     Some command line options act like	global directives. Other command line
     options override directives. Many directives have corresponding command
     line options. If you specify conflicting settings in the command line and
     a directive, the compiler chooses the most	restrictive setting. For



									Page 5






MP(3F)									MP(3F)



     Boolean options, if either	the directive or the command line has the
     option turned off,	it is considered off. For options that require a
     numeric value, the	compiler uses the minimum of the command line setting
     and the directive setting.

     The Fortran compiler accepts directives that cause	it to generate code
     that can be run in	parallel. The compiler directives look like Fortran
     comments: they begin with a C in column one. If multiprocessing is	not
     turned on,	these statements are treated as	comments. This allows the
     identical source to be compiled with a single-processing compiler or by
     Fortran without the multiprocessing option. The directives	are
     distinguished by having a $ as the	second character. The following
     directives	are supported:

     C$DOACROSS, C$&, C$, C$MP_SCHEDTYPE, C$CHUNK, and C$COPYIN.



     C$DOACROSS

     The essential compiler directive for multiprocessing is C$DOACROSS.  This
     directive directs the compiler to generate	special	code to	run iterations
     of	a DO loop in parallel. The C$DOACROSS directive	applies	only to	the
     next statement (which must	be a DO	loop). The Fortran compiler does not
     support direct nesting of C$DOACROSS loops. The C$DOACROSS	directive has
     the form

     C$DOACROSS	[clause	[ [,] clause ...]

     where valid values	for the	optional clause	are

	    [IF	(logical_expression)]

	    [{LOCAL | PRIVATE} (item[,item ...])]

	    [{SHARE | SHARED} (item[,item ...])]

	    [{LASTLOCAL	| LAST LOCAL} (item[,item ...])]

	    [REDUCTION (item[,item ...])]

	    [MP_SCHEDTYPE=mode ]

	    [CHUNK=integer_expression]

     The preferred form	of the directive uses the optional commas between
     clauses. This section discusses the meaning of each clause.


	 IF





									Page 6






MP(3F)									MP(3F)



	 The IF	clause determines whether the loop is actually executed	in
	 parallel. If the logical expression is	TRUE, the loop is executed in
	 parallel. If the expression is	FALSE, the loop	is executed serially.


	 LOCAL,	SHARE, LASTLOCAL

	 These clauses specify lists of	variables used within parallel loops.
	 A variable can	appear in only one of these lists. To make the task of
	 writing these lists easier, there are several defaults. The loopiteration
 variable is LASTLOCAL by default. All other variables are
	 SHARE by default.

	 LOCAL Specifies variables that	are local to each process. If a
	 variable is declared as LOCAL,	each iteration of the loop is given
	 its own uninitialized copy of the variable. You can declare a
	 variable as LOCAL if its value	does not depend	on any other iteration
	 of the	loop and if its	value is used only within a single iteration.
	 In effect the LOCAL variable is just temporary; a new copy can	be
	 created in each loop iteration	without	changing the final answer. The
	 name LOCAL is preferred over PRIVATE. SHARE Specifies variables that
	 are shared across all processes. If a variable	is declared as SHARE,
	 all iterations	of the loop use	the same copy of the variable. You can
	 declare a variable as SHARE if	it is only read	(not written) within
	 the loop or if	it is an array where each iteration of the loop	uses a
	 different element of the array. The name SHARE	is preferred over
	 SHARED.

	 LASTLOCAL Specifies variables that are	local to each process.Unlike
	 with the LOCAL	clause,	the compiler saves only	the value of the
	 logically last	iteration of the loop when it exits. The name
	 LASTLOCAL is preferred	over LAST LOCAL.

	 LOCAL is a little faster than LASTLOCAL, so if	you do not need	the
	 final value, it is good practice to put the DO	loop index variable
	 into the LOCAL	list, although this is not required.

	 Only variables	can appear in these lists. In particular, COMMON
	 blocks	cannot appear in a LOCAL list.	The SHARE, LOCAL, and
	 LASTLOCAL lists give only the names of	the variables. If any member
	 of the	list is	an array, it is	listed without any subscripts.


	 REDUCTION

	 The REDUCTION clause specifies	variables involved in a	reduction
	 operation. In a reduction operation, the compiler keeps local copies
	 of the	variables and combines them when it exits the loop. An element
	 of the	REDUCTION list must be an individual variable (also called a
	 scalar	variable) and cannot be	an array. However, it can be an
	 individual element of an array. In a REDUCTION	clause,	it would
	 appear	in the list with the proper subscripts.



									Page 7






MP(3F)									MP(3F)



	 One element of	an array can be	used in	a reduction operation, while
	 other elements	of the array are used in other ways. To	allow for
	 this, if an element of	an array appears in the	REDUCTION list,	the
	 entire	array can also appear in the SHARE list.

	 The four types	of reductions supported	are sum(+), product(*),	min(),
	 and max(). Note that min(max) reductions must use the min(max)
	 intrinsic functions to	be recognized correctly.

	 The compiler confirms that the	reduction expression is	legal by
	 making	some simple checks. The	compiler does not, however, check all
	 statements in the DO loop for illegal reductions. You must ensure
	 that the reduction variable is	used correctly in a reduction
	 operation.


	 CHUNK,	MP_SCHEDTYPE

	 The CHUNK and MP_SCHEDTYPE clauses affect the way the compiler
	 schedules work	among the participating	tasks in a loop. These clauses
	 do not	affect the correctness of the loop. They are useful for	tuning
	 the performance of critical loops.

	 For the MP_SCHEDTYPE=mode clause, mode	can be one of the following:

	 [SIMPLE | STATIC]

	 [DYNAMIC]

	 [INTERLEAVE INTERLEAVED]

	 [GUIDED GSS]

	 [RUNTIME]

	 You can use any or all	of these modes in a single program. The	CHUNK
	 clause	is valid only with the DYNAMIC and INTERLEAVE modes. SIMPLE,
	 DYNAMIC, INTERLEAVE, GSS, and RUNTIME are the preferred names for
	 each mode.

	 The simple method (MP_SCHEDTYPE=SIMPLE) divides the iterations	among
	 processes by dividing them into contiguous pieces and assigning one
	 piece to each process.

	 In dynamic scheduling (MP_SCHEDTYPE=DYNAMIC) the iterations are
	 broken	into pieces the	size of	which is specified with	the CHUNK
	 clause. As each process finishes a piece, it enters a critical
	 section to grab the next available piece. This	gives good load
	 balancing at the price	of higher overhead.

	 The interleave	method (MP_SCHEDTYPE=INTERLEAVE) breaks	the iterations
	 into pieces of	the size specified by the CHUNK	option,	and execution



									Page 8






MP(3F)									MP(3F)



	 of those pieces is interleaved	among the processes.

	 The fourth method is a	variation of the guided	self-scheduling
	 algorithm (MP_SCHEDTYPE=GSS).	Here, the piece	size is	varied
	 depending on the number of iterations remaining. By parceling out
	 relatively large pieces to start with and relatively small pieces
	 toward	the end, the system can	achieve	good load balancing while
	 reducing the number of	entries	into the critical section.

	 In addition to	these four methods, you	can specify the	scheduling
	 method	at run time (MP_SCHEDTYPE=RUNTIME).  Here, the scheduling
	 routine examines values in your run-time environment and uses that
	 information to	select one of the other	four methods.

	 If both the MP_SCHEDTYPE and CHUNK clauses are	omitted, SIMPLE
	 scheduling is assumed.	If MP_SCHEDTYPE	is set to INTERLEAVE or
	 DYNAMIC and the CHUNK clause are omitted, CHUNK=1 is assumed. If
	 MP_SCHEDTYPE is set to	one of the other values, CHUNK is ignored. If
	 the MP_SCHEDTYPE clause is omitted, but CHUNK is set, then
	 MP_SCHEDTYPE=DYNAMIC is assumed.


     C$&

     Occasionally, the clauses in the C$DOACROSS directive are longer than one
     line. Use the C$& directive to continue the directive onto	multiple
     lines.

     For example:

     C$DOACROSS	share(ALPHA, BETA, GAMMA, DELTA,
     C$&  EPSILON, OMEGA), LASTLOCAL(I,	J, K, L, M, N),
     C$&  LOCAL(XXX1, XXX2, XXX3, XXX4,	XXX5, XXX6, XXX7,
     C$&  XXX8,	XXX9)


     C$

     The C$ directive is considered a comment line except when
     multiprocessing. A	line beginning with C$ is treated as a conditionally
     compiled Fortran statement. The rest of the line contains a standard
     Fortran statement.	The statement is compiled only if multiprocessing is
     turned on.	In this	case, the C and	$ are treated as if they are blanks.
     They can be used to insert	debugging statements, or an experienced	user
     can use them to insert arbitrary code into	the multiprocessed version.


     C$MP_SCHEDTYPE

     The C$MP_SCHEDTYPE=mode directive acts as an implicit MP_SCHEDTYPE	clause
     for all C$DOACROSS	directives in scope. mode is any of the	modes listed
     under CHUNK and MP_SCHEDTYPE.  A C$DOACROSS directive that	does not have



									Page 9






MP(3F)									MP(3F)



     an	explicit MP_SCHEDTYPE clause is	given the value	specified in the last
     directive prior to	the look, rather than the normal default. If the
     C$DOACROSS	does have an explicit clause, then the explicit	value is used.


     C$CHUNK

     The C$CHUNK=integer_expression directive affects the CHUNK	clause of a
     C$DOACROSS	in the same way	that the C$MP_SCHEDTYPE	directive affects the
     MP_SCHEDTYPE clause for all C$DOACROSS directives in scope. Both
     directives	are in effect from the place they occur	in the source until
     another corresponding directive is	encountered or the end of the
     procedure is reached.



     C$COPYIN

     It	is occasionally	desirable to be	able to	copy values from the master
     thread's version of the COMMON block into the slave thread's version.
     The special directive C$COPYIN allows this. It has	the form

     C$COPYIN item [, item -]

     Each item must be a member	of a local COMMON block. It can	be a variable,
     an	array, an individual element of	an array, or the entire	COMMON block.

     Note:  The	C$COPYIN directive cannot be executed from inside a parallel
     region.


     OpenMP Support    [Toc]    [Back]

     The -mp flag enables the processing of the	parallel (MP) directives,
     including the original SGI/PCF directives (described below) as well as
     the OpenMP	directives.  To	disable	one or the other set use
     -MP:old_mp=OFF or -MP:open_mp=OFF.	 See the -MP option control group.
     For more information about	OpenMP support in MIPSpro Fortran 77, please
     refer to the MIPSpro Fortran 77 Programmer's Guide.  For more information
     about OpenMP support in MIPSpro Fortran 90, please	refer to the MIPSPro 7
     Fortran 90	Commands and Directives	Reference Manual.  For general
     information about OpenMP please refer to the following web	page:

		  http://www.openmp.org/




     PCF Directives    [Toc]    [Back]






								       Page 10






MP(3F)									MP(3F)



     In	addition to the	simple loop-level parallelism offered by C$DOACROSS
     and the other directives described	above, the compiler supports a more
     general model of parallelism. This	model is based on the work done	by the
     Parallel Computing	Forum (PCF), which itself formed the basis for the
     proposed ANSI-X3H5	standard. The compiler supports	this model through
     compiler directives, rather than extensions to the	source language.  For
     more information about PCF, please	refer to Chapter 5 of the MIPSpro
     Fortran 77	Programmer's Guide.

     The directives can	be used	in Fortran 77 programs when compiled with the
     -mp option.

     C$PAR BARRIER
	     Ensures that each process waits until all processes reach the
	     barrier before proceeding.

     C$PAR [END] CRITICAL SECTION
	     Ensures that the enclosed block of	code is	executed by only one
	     process at	a time by using	a lock variable.

     C$PAR [END] PARALLEL
	     Encloses a	parallel region, which includes	work-sharing
	     constructs	and critical sections.

     C$PAR PARALLEL DO
	     Precedes a	single DO loop for which separate iterations are
	     executed by different processes. This directive is	equivalent to
	     the C$DOACROSS directive.

     C$PAR [END] PDO
	     Separate iterations of the	enclosed loop are executed by
	     different processes. This directive must be inside	a parallel
	     region.

     C$PAR [END] PSECTION[S]
	     Parcels out each block of code in turn to a process.

     C$PAR SECTION
	     Signifies a starting line for an individual section within	a
	     parallel section.

     C$PAR [END] SINGLE	PROCESS
	     Ensures that the enclosed block of	code is	executed by exactly
	     one process.

     C$PAR & Continues a PCF directive onto multiple lines.


     Parallel Region    [Toc]    [Back]






								       Page 11






MP(3F)									MP(3F)



     A parallel	region encloses	any number of PCF constructs.  It signifies
     the boundary within which slave threads execute. A	user program can
     contain any number	of parallel regions. The syntax	of the parallel	region
     is:

     C$PAR PARALLEL [clause [[,] clause]...]

		code

     C$PAR END PARALLEL

     where valid clauses are:

     [IF ( logical_expression )]

     [{LOCAL | PRIVATE}(item [,item ...])]

     [{SHARE | SHARED}(item [,item ...])]

     The IF, LOCAL, and	SHARED clauses have the	same meaning as	in the
     C$DOACROSS	directive.

     The preferred form	of the directive has no	commas between the clauses.
     The SHARED	clause is preferred over SHARE and LOCAL is preferred over
     PRIVATE.


     PCF Constructs    [Toc]    [Back]

     The three types of	PCF constructs are work-sharing	constructs, critical
     sections, and barriers. All master	and slave threads synchronize at the
     bottom of a work-sharing construct. None of the threads continue past the
     end of the	construct until	they all have completed	execution within that
     construct.

     The four work-sharing constructs are:  parallel DO, PDO, sections and
     single process.

     If	specified, these constructs (except for	the parallel DO	construct)
     must appear inside	of a parallel region.  Specifying a parallel DO
     construct inside of a parallel region produces a syntax error.

     The critical section construct protects a block of	code with a lock so
     that it is	executed by only one thread at a time. Threads do not
     synchronize at the	bottom of a critical section.

     The barrier construct ensures that	each process that is executing waits
     until all others reach the	barrier	before proceeding.







								       Page 12






MP(3F)									MP(3F)



     Parallel DO    [Toc]    [Back]

     The parallel DO construct is the same as the C$DOACROSS directive and
     conceptually the same as a	parallel region	containing exactly one PDO
     construct and no other code. Each thread inside the enclosing parallel
     region executes separate iterations of the	loop within the	parallel DO
     construct.	The syntax of the parallel DO construct	is

     C$PAR PARALLEL DO [clause [[,] clause]...]

     where clause is defined as	the same as for	C$DOACROSS.

     For the C$PAR PARALLEL DO directive, MP_SCHEDTYPE=	is optional; you can
     just specify mode.


     PDO    [Toc]    [Back]

     Each thread inside	the enclosing parallel region executes a separate
     iteration of the loop within the PDO construct. The syntax	of the PDO
     construct,	which can only be specified within a parallel region, is:

     C$PAR PDO [clause [[,] clause]...]

	     code

     [C$PAR END	PDO [NOWAIT]]

     where valid values	for clause are

     [{LOCAL | PRIVATE}	(item[,item ...])]

     [{LASTLOCAL | LAST	LOCAL} (item[,item ...])]

     [(ORDERED)]

     [ sched ]

     [ chunk ]

     LOCAL, LASTLOCAL, sched, and chunk	have the same me

 Similar pages
Name OS Title
sysmp IRIX multiprocessing control
mpc IRIX Multiprocessing C Source Transformer
mpconf IRIX multiprocessing control and information
forkpty NetBSD tty utility functions
login_tty NetBSD tty utility functions
forkpty OpenBSD tty utility functions
openpty OpenBSD tty utility functions
login_tty OpenBSD tty utility functions
openpty NetBSD tty utility functions
login OpenBSD login utility functions
Copyright © 2004-2005 DeniX Solutions SRL
newsletter delivery service