OpenMP

From csi702

Jump to: navigation, search


Contents

  1. Introduction to OpenMP
  2. OpenMP Threads
  3. Goals of OpenMP
  4. OpenMP Programming Model
    1. Explicit Parallelism
    2. Environment and Library Routines
    3. Not Message Passing
    4. General Syntax
    5. A Trivial Example
  5. Parallelizing Loops
  6. Reductions
    1. Numerical Integration
    2. A Simple Code to Calculate PI
    3. CPUs currently in our Lab
  7. Loop Splitting
    1. Using OpenMP in Loop Splitting
  8. Controlling Thread Execution
    1. Controlling Loops
    2. Gmice
    3. OpenPBS
  9. Examples
    1. Example 1
    2. Example 2
    3. Example 3
    4. Example 4
    5. Example 5
    6. Example 6
    7. Example 7
  10. Comparison to pthreads and MPI
  11. Assignment
  12. Links & References

1 Introduction to OpenMP

OpenMP was as a set of simple program additions to make codes run efficiently on shared memory computers. The formal API for OpenMP is only about 50 pages long, and consists of compiler directives, library functions, and environment variables.

http://www.llnl.gov/computing/tutorials/openMP/

2 OpenMP Threads

OpenMP uses threads for parallel programming

  • Forks and Joins are used for most of the internal programming.
  • Speedup is achieved by the operating system splitting the threads across multiple CPUs.
  • New threads are created explicitly by the program directives dynamically.

3 Goals of OpenMP

  • Standardization
  • Lean and Mean - only 3-4 directives
  • Ease of use
  • Portability - F77, F90, F95, C, C++

4 OpenMP Programming Model

  • Shared Memory, thread based
  • Explicit Parallelism
  • Fork-Join Model
  • Compiler Directives
  • Nested Parallelism Support - in most implementations
  • Dynamic Threads
  • Not tied to I/O

4.1 Explicit Parallelism

  • You must tell the computer what sections of code to parallelize using complier directives.
  • The compiler directives vary between languages, but are ignored when OpenMP flags are not set with the compiler.
  • Codes written with OpenMP can run easily on serial machines.

4.2 Environment and Library Routines

  • Some environmental variables are needed to make the code execute using the correct number of threads
  • Some library routines allow the programmer to set and access system variables

4.3 Not Message Passing

This is NOT a set of message passing routines. Instead, you give directives to the compiler of what parts of the code can be executed in parallel.
In some ways, OpenMP is a set of directives to tell the compiler how to more efficiently handle loops.

4.4 General Syntax

Fortran

!$OMP <directive>
do useful stuff
!$OMP end <directive>

C/C++

#pragma omp <directive-name> clause
{do useful stuff in a structured block}

4.5 A Trivial Example

Basic Code

program trival
print*,’hello world’
end program

OMP Additions

program trivial

!$omp parallel
print*,’hello world’
!$omp end parallel

end program trivial

Execution of the Trivial Example

(158) % ifort trival.f90
(159) % ./a.out
hello world
(160) %

What went wrong?

Execution of the Trivial Example

>bash-3.00$ ifort trivial.f90 -openmp
trivial.f90(3) : (col. 7) remark:
OpenMP DEFINED REGION WAS PARALLELIZED.
> ./a.out
hello world
hello world

> gfortran trivial.f90 -fopenmp
> ./a.out
hello world
hello world
> export OMP_NUM_THREADS=5
> ./a.out
hello world
hello world
hello world
hello world
hello world

Using ifort on the cds workstations

# on cdsxx
> module list
Currently Loaded Modulefiles:
1) intel_icc_10.1.017 3) intel_idb_10.1.017 5) intel_2) intel_ifort_10.1.017 4) intel_mkl_10.0.4.023
>export OMP_NUM_THREADS=2

Thread ID

program trival1
implicit none
integer :: OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS
integer :: tid, nthreads
!$OMP PARALLEL PRIVATE(NTHREADS, TID)
tid = OMP_GET_THREAD_NUM()
nthreads = OMP_GET_MAX_THREADS()
print*,’hello world from ’, tid,nthreads
!$OMP END PARALLEL
end program

Note the PRIVATE key word, indicating that all threads have their own copy of the variable.

Thread ID (2)

(172) % ifort -openmp trival1.f90
trival1.f90(5): (col. 7) remark:
OpenMP DEFINED REGION WAS PARALLELIZED.
(173) % a.out
hello world from 1 4
hello world from 0 4
hello world from 2 4
hello world from 3 4
(174) %

5 Parallelizing Loops

To parallelize a loop, you need to help the compiler figure out the most efficient way to use threads. There are simple defaults, but giving it more details can help efficiency.
The basic directives are:

!$OMP PARALLEL
!$OMP DO

some parallel loop

!$OMP END DO
!$OMP END PARALLEL

You do not have to have a separate directive on each line. For example,

!$OMP PARALLEL
!$OMP DO
!$OMP PRIVATE(NTHREADS, TID)

Becomes

!$OMP PARALLEL DO PRIVATE(NTHREADS, TID)

6 Reductions

Because the loops are executing separately, you may wish to combine the results from different threads to a final answer. You need to use reduction to make this work.

$!OMP PARALLEL PRIVATE(X) REDUCTION(+:SUM)
$!

6.1 Numerical Integration

Integrating

              \int^{1/2}_{-1/2}{\frac{4.0}{1+x^2} dx}               

We can approximate this integral using Simpson’s algorithms

  • input the number of partitions to be used
  • divide the domain into n partitions
  • evaluate the function at each partition
  • multiply the function evaluation times the width of the function to find a differential area
  • add the differential areas together
  • output the result

Parallel Integration
In parallel, the problem is nearly the same.

  • on processor zero, input the number of partitions
  • broadcast the user input information to all processors
  • determine the number of processors - m
  • divide the domain into n/m partitions on each processor
  • evaluate the function at each partition
  • multiply the function evaluation times the width of the function to find a differential area
  • add the differential areas together across all the processors
  • on processor zero, output the result

6.2 A Simple Code to Calculate PI

program reduce
! modified from Mattson, Sanders and McGill’s book
integer :: i, num_steps
double precision :: x, pi, step, sum
sum =0.0d0 ; nsteps = 10000
step = 1.0d0 / dble(nsteps)
do i = 1, nsteps
   x = (dble(i) + 0.5d0) * step
   sum = sum + 4.0d0 /(1.0d0 + x*x)
enddo
pi = step * sum
print *, "estimate of Pi with ",nsteps, " is ",pi
end program reduce

OpenMP Modifications

!$OMP PARALLEL DO PRIVATE(X) REDUCTION(+:SUM)
do i = 1, nsteps
   x = (dble(i) + 0.5d0) * step
   sum = sum + 4.0d0 /(1.0d0 + x*x)
enddo
!$OMP END PARALLEL DO

6.3 CPUs currently in our Lab

wk08
Total CPU’s: 2
Intel(R) Pentium(R) 4 CPU 3.80GHz 3791 MHZ
Cache Size : 1024 KB
Main Memory Size: 2046 MB
cpu MHz : 3791.076
32 bit

wk14
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz 1596 MHZ
Cache Size : 4096 KB
Main Memory Size: 2023 MB
cpu MHz : 1596.000

Skynet and Cameron

skynet

Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz
cache size : 6144 KB
Main Memory Size: 4051 MB
cpu MHz : 1998.000

cameron
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
stepping : 4
cpu MHz : 1600.000
cache size : 8192 KB

OpenMP Results - pi.f ( Old Lab Machines - 2007 )
Old machines in

ifort
22.265u 0.009s 0:11.22 198.3% 0+0k 0+0io 0pf+0w

ifort -openmp
10.884u 0.005s 0:10.89 99.9% 0+0k 0+0io 0pf+0w

OpenMP Results - pi.f ( skynet )

ifort
real 0m3.877s
user 0m3.872s
sys 0m0.000s

ifort -openmp
real 0m1.002s
user 0m3.812s
sys 0m0.004s

3.877 seconds TO 1.002 second with two lines of code!

OpenMP Results - pi example ( cameron )

gfortran
real 0m18.571s
user 0m18.565s
sys 0m0.000s
estimate of Pi with 1000000000 is 3.1415926515899706
gfortran -openmp
real 0m2.717s
user 0m19.357s
sys 0m0.004s
estimate of Pi with 1000000000 is 3.1415926515897685

That 18.571 to 2.717 seconds!!!!!! 6.85 times faster with two lines of code.

OpenMP Results - mandelbrot example ( cameron )

without openmp
real 0m0.653s
user 0m0.836s
sys 0m0.012s

with openmp
real 0m9.024s
user 0m8.989s
sys 0m0.012s

OpenMP Results - jacobi example ( cameron )

without openmp
Residual 9.99979529016497684E-008
Solution Error : 1.16990135380457038E-003
real 0m23.225s
user 0m23.217s
sys 0m0.000s

with openmp
Residual 9.99979529016516478E-008
Solution Error : 1.16990135380456712E-003
real 0m7.861s
user 0m51.067s
sys 0m0.548s

OpenMP Results - pi example ( cameron )
Anyone notice the problem??

gfortran
real 0m18.571s
user 0m18.565s
sys 0m0.000s
estimate of Pi with 1000000000 is 3.1415926515899706

gfortran -openmp
real 0m2.717s
user 0m19.357s
sys 0m0.004s
estimate of Pi with 1000000000 is 3.1415926515897685

That 18.571 to 2.717 seconds!!!!!! 6.85 times faster with two lines of code.

OpenMP Results - pi example ( cameron )
Anyone notice the problem??

estimate of Pi with 1000000000 is 3.1415926515899706
estimate of Pi with 1000000000 is 3.1415926515897685

What happened?

Initially, we might expect that the serial and parallel versions of the code will produce identical results, since they are performing identical computations for all of the terms in the series. In fact, each of the terms of the series computed are identical. However, deviations are introduced when the terms are summed. Since computer representations of floating point numbers have finite precision, the accuracy of an addition operation depends on the relative magnitudes of the two numbers. If we sum a series of numbers, the accuracy of the result depends not only on the numbers that are being summed, but also on the order in which they are summed.

Since the parallel version of the algorithm doesn't add the numbers in the exact same order as the serial algorithm, it is likely that the results will not be identical. Furthermore, if the order of execution of the threads is not identical from run to run, then the same parallel algorithm can produce different results each time it is run. Note that this is a general issue with parallel algorithm and is not specific to OpenMP. Therefore, we expect similar behavior with pthreads and MPI.


Molecular Dynamics Code Example

!$omp parallel do
!$omp& default(shared)
!$omp& private(i,j,k,rij,d)
!$omp& reduction(+ : pot, kin)
    do i=1,np
        ! compute potential energy and forces
        f(1:nd,i) = 0.0
        do j=1,np
           if (i .ne. j) then
             call dist(nd,box,pos(1,i),pos(1,j),rij,d)
             ! attribute half of the potential energy to particle
             pot = pot + 0.5*v(d)
             do k=1,nd
               f(k,i) = f(k,i) - rij(k)*dv(d)/d
             enddo
          endif
       enddo
      ! compute kinetic energy
      kin = kin + dotr8(nd,vel(1,i),vel(1,i))
   enddo
!$omp end parallel do

Molecular Dynamics Code Example: Timing Results - Old machines

pgi
104.648u 0.040s 1:44.69 99.9% 0+0k 0+0io 0pf+0w

pgi openmp
125.540u 0.050s 1:02.87 199.7% 0+0k 0+0io 0pf+0w

ifort
29.248u 0.031s 0:29.28 99.9% 0+0k 0+0io 0pf+0w

ifort -openmp
41.625u 0.066s 0:23.16 179.9% 0+0k 0+0io 0pf+0w

Molecular Dynamics Code Example: Timing Results - New machines

ifort
28.362u 0.008s 0:28.37 99.9% 0+0k 0+0io 0pf+0w

ifort -openmp
29.272u 0.017s 0:15.09 194.0% 0+0k 1960+0io 5pf+0w

Molecular Dynamics Code Example: Timing Results - skynet - quad core

ifort -O0
real 0m59.805s
user 0m59.776s
sys 0m0.032s

ifort
real 0m32.083s
user 0m32.078s
sys 0m0.008s

ifort -openmp
real 0m18.003s
user 0m57.372s
sys 0m1.304s

Automatic parallelization???

ifort -openmp
real 0m18.003s
user 0m57.372s
sys 0m1.304s

ifort -parallel
real 0m28.683s
user 1m26.273s
sys 0m3.476s

Molecular Dynamics Code Example:Timing Results - wk14 - 64 bit dual core

ifort -O0
real 1m15.493s
user 1m15.431s
sys 0m0.008s

ifort
real 0m29.570s
user 0m29.522s
sys 0m0.008s

ifort -openmp
real 0m14.046s
user 0m27.797s
sys 0m0.022s

Molecular Dynamics Code Example:Timing Results - wk08 - 32 bit dual core

ifort -O0
real 1m35.032s
user 1m34.674s
sys 0m0.041s

ifort
real 0m30.898s
user 0m30.837s
sys 0m0.015s

ifort -openmp
real 0m25.291s
user 0m43.954s
sys 0m0.051s

7 Loop Splitting

One of the key ideas to remember is that loops often contain several operations that can be split. Taking an example from the Patterns in Parallel Programming book, imagine we have a loop with two functions:

  • BIG COMPUTATION - a big computation the executes independently on each element in the loop
  • COMBINE - an element that cannot be parallelized and must execute in order
do i = 1, nsteps
  x = BIG_COMPUTATION(i)
  call COMBINE(x,answer)
enddo

can be split into

do i = 1, nsteps
  x(i) = BIG_COMPUTATION(i)
enddo
do i = 1, nsteps
  call COMBINE(x(i),answer)
enddo

7.1 Using OpenMP in Loop Splitting

!$OMP PARALLEL DO PRIVATE(I)
do i = 1, nsteps
  x(i) = BIG_COMPUTATION(i)
enddo
!$OMP END PARALLEL DO

do i = 1, nsteps
  call COMBINE(x(i),answer)
enddo

Elliptical PDE using the Jacobi Method

http://www.llnl.gov/computing/tutorials/openMP/exercise.html

Poisson Equation on a rectangular grid assuming Uniform discretization and Dirichlet boundary conditions.

\bigtriangledown^2 u = b

over the domain \{x, y\} = [−1,\ 1].

The example has an exact solution of

 u(x, y) = (1 -− x^2)\times (1 -− y^2)

The Approach

  • parallelize initialization routine
  • parallelize main iteration loop
  • use reduction for computing the residual and error check

Efficiency- harlie - dual AMD 2.6Ghz

(8) % time a.out < input
Total Number of Iterations 18556
Residual 9.9998760719761766E-008
Solution Error : 2.1683208415060635E-004
FORTRAN STOP
50.064u 0.066s 0:50.15 99.9% 0+0k 0+0io 0pf+0w
(9) % setenv OMP_NUM_THREADS 2
(10) % time a.out < input
Total Number of Iterations 18556
Residual 9.9998760719761819E-008
Solution Error : 2.1683208415060609E-004
FORTRAN STOP
68.157u 1.456s 0:34.88 199.5% 0+0k 0+0io 0pf+0w

Efficiency - hal - dual Pentium III 600 Mhz

hal [~/openmp] (59) % time a.out < input
Total Number of Iterations 10976
Residual 9.9988418447383420E-008
Solution Error : 1.2326073098302639E-003
20.080u 0.060s 0:20.12 100.0% 0+0k 0+0io 161pf+0w

hal [~/openmp] (63) % setenv OMP_NUM_THREADS=2
hal [~/openmp] (64) % time a.out < input
Total Number of Iterations 10976
Residual 9.9988418447383882E-008
Solution Error : 1.2326073098302657E-003
22.650u 0.020s 0:11.37 199.3% 0+0k 0+0io 185pf+0w

Processes While the Code Runs

11:26pm up 21 days, 20:32, 3 users, load average: 0.74, 0.39, 0.16
83 processes: 79 sleeping, 4 running, 0 zombie, 0 stopped
CPU0 states: 99.0% user, 0.5% system, 0.0% nice, 0.0% idle
CPU1 states: 99.2% user, 0.3% system, 0.0% nice, 0.0% idle
Mem: 383768K av, 351256K used, 32512K free, 0K shrd, 34620K buff
Swap: 922288K av, 9168K used, 913120K free 95560K cached

PID    USER   PRI NI SIZE RSS  SHARE STAT %CPU  %MEM TIME   COMMAND
24661 jwallin 16  0  960  960  560   R    99.0  0.2  0:10   a.out
24663 jwallin 15  0  960  960  560   R    97.6  0.2  0:10   a.out
24293 jwallin 10  0  9428 9268 7852  S    1.5   2.4  0:12   kdeinit
1021  root     9  0  160M 13M  1568  R    0.7   3.5  577:10 X
24650 jwallin 12  0  1068 1068 832   R    0.5   0.2  0:00   top

Efficiences of the Jacobi.f program ( skynet )

  • no mpi, no opt = 28.679s
  • no mpi, with opt = 3.719s
  • with mpi, with opt = 1.203s

23 times faster than the unoptimized no mpi code

Results

MachineCodeTime With OMPTime Without OMP Speedup
Harlie 2.6Ghz AMDPi5.19s9.97s1.92
skynet quadPi1.0023.8773.87
Hal 600Mhz PentiumJacobi11.37s20.12s1.77
Harlie 2.6Ghz AMDJacobi34.88s50.15s1.44
skynet quadJacobi1.2033.5762.98

What are the efficiencies of the parallel codes and why isn’t it 100%?

8 Controlling Thread Execution

There are many options for controlling the execution of threads.

!$OMP DO SCHEDULE(TYPE,integer)
  • schedule(static[,chunk]) - groups of size chunk statically assigned in a round-robin fashion
  • schedule(dynamic[,chunk]) - threads dynamically grab work as it is completed
  • schedule(guided[,chunk]) - chunk size is reduced automatically during iteration toward a minimum level of chunk
  • schedule(runtime) - checks the OMP SCHEDULE environmental variable
integer, parameter :: chunk = 10

!$OMP PARALLEL PRIVATE(i,j,z,c,it) DEFAULT(SHARED)
!$OMP DO SCHEDULE(DYNAMIC,CHUNK)
 do i = 1, n
   do j = 1, n
...

8.1 Controlling Loops

setenv OMP_SCHEDULE static
11.477u 0.012s 0:08.24 139.3% 0+0k 0+0io 0pf+0w

setenv OMP_SCHEDULE dynamic
11.239u 0.006s 0:05.67 198.0% 0+0k 0+0io 0pf+0w

setenv OMP_SCHEDULE guided
11.453u 0.005s 0:06.52 175.6% 0+0k 0+0io 0pf+0w

setenv OMP_SCHEDULE static,20
11.439u 0.028s 0:05.89 194.3% 0+0k 0+0io 0pf+0w

no mpi
11.280u 0.004s 0:11.28 100.0% 0+0k 0+0io 0pf+0w

Mandelbrot Set - WITH an output file

ifort
11.980u 1.067s 0:15.54 83.9% 0+0k 0+35168io 0pf+0w

ifort -openmp
12.535u 1.146s 0:10.00 136.7% 0+0k 0+35168io 0pf+0w

What happened???

8.2 Gmice

  • Log into gmice.gmu.edu - you will need to have your private key in the appropriate location, and have your pass phrase
  • Learn how to use OpenPBS
  • Submit jobs to the “class” and “classtest” queues ONLY. They have CPU limits of 30 minutes and 2 minutes

respectively.

8.3 OpenPBS

Set up the PBS variables

#!/bin/csh
#PBS -j oe
#PBS -N WallinTESTRUN
#PBS -l select=1:ncpus=8:mpiprocs=1
#PBS -l walltime=48:00:00

Run the code!

setenv OMP_NUM_THREADS 8
/home/jwallin/pbs_openmp_example/mandel > mandel_out8
begin{verbatim}
setenv OMP_NUM_THREADS 4
/home/jwallin/pbs_openmp_example/mandel > mandel_out4
#PBS -o /home/jwallin/pbs_openmp_example/out_err

> qsub -q classtest filename.sh

> qstat

9 Examples

9.1 Example 1

program trivial
 
!$omp parallel
 print*,'hello world'
!$omp end parallel
 
end program trivial

9.2 Example 2

program trival1
  implicit none
  integer :: OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS
  integer :: tid, nthreads, sum
  sum =0
!$OMP PARALLEL PRIVATE(sum)
  tid = OMP_GET_THREAD_NUM()
  nthreads = OMP_GET_MAX_THREADS()
  print*,'hello world from ', tid,nthreads
 
  sum = sum + tid
!$OMP END PARALLEL
 
  print *, sum
end program

9.3 Example 3

program reduce
! modified from Mattson, Sanders and McGill's book
integer :: i, num_steps
double precision :: x, pi, step, sum
sum =0.0d0 ; nsteps = 1000000000 
step = 1.0d0 / dble(nsteps)
 
!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum)
do i = 1, nsteps
  x = (dble(i) + 0.5d0) * step
  sum = sum + 4.0d0 /(1.0d0 + x*x)
enddo
!$OMP END PARALLEL DO
pi = step * sum
print *, "estimate of Pi with ",nsteps, " is ",pi
end program reduce

9.4 Example 4

C******************************************************************************
C FILE: omp_workshare1.f
C DESCRIPTION:
C   OpenMP Example - Loop Work-sharing - Fortran Version
C   In this example, the iterations of a loop are scheduled dynamically
C   across the team of threads.  A thread will perform CHUNK iterations
C   at a time before being scheduled for the next CHUNK of work.
C AUTHOR: Blaise Barney  5/99
C LAST REVISED: 01/09/04
C******************************************************************************
 
      PROGRAM WORKSHARE1
 
      INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
     +  OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I
      PARAMETER (N=100)
      PARAMETER (CHUNKSIZE=10) 
      REAL A(N), B(N), C(N)
 
!     Some initializations
      DO I = 1, N
        A(I) = I * 1.0
        B(I) = A(I)
      ENDDO
      CHUNK = CHUNKSIZE
 
!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID)
 
      TID = OMP_GET_THREAD_NUM()
      IF (TID .EQ. 0) THEN
        NTHREADS = OMP_GET_NUM_THREADS()
        PRINT *, 'Number of threads =', NTHREADS
      END IF
      PRINT *, 'Thread',TID,' starting...'
 
!$OMP DO SCHEDULE(DYNAMIC,CHUNK)
      DO I = 1, N
        C(I) = A(I) + B(I)
        WRITE(*,100) TID,I,C(I)
 100    FORMAT(' Thread',I2,': C(',I3,')=',F8.2)
      ENDDO
!$OMP END DO NOWAIT
 
      PRINT *, 'Thread',TID,' done.'
 
!$OMP END PARALLEL
 
      END

9.5 Example 5

C******************************************************************************
C FILE: omp_mm.f
C DESCRIPTION:  
C   OpenMp Example - Matrix Multiply - Fortran Version 
C   Demonstrates a matrix multiply using OpenMP. Threads share row iterations
C   according to a predefined chunk size.
C AUTHOR: Blaise Barney
C LAST REVISED: 1/5/04 Blaise Barney
C******************************************************************************
 
      PROGRAM MATMULT
 
      INTEGER  NRA, NCA, NCB, TID, NTHREADS, I, J, K, CHUNK,
     +         OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
C     number of rows in matrix A 
      PARAMETER (NRA=62)
C     number of columns in matrix A
      PARAMETER (NCA=15)
C     number of columns in matrix B
      PARAMETER (NCB=7)
 
      REAL*8 A(NRA,NCA), B(NCA,NCB), C(NRA,NCB)
 
C     Set loop iteration chunk size 
      CHUNK = 10
 
C     Spawn a parallel region explicitly scoping all variables
!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(TID,I,J,K)
      TID = OMP_GET_THREAD_NUM()
      IF (TID .EQ. 0) THEN
        NTHREADS = OMP_GET_NUM_THREADS()
        PRINT *, 'Starting matrix multiple example with', NTHREADS,
     +           'threads'
        PRINT *, 'Initializing matrices'
      END IF
 
C     Initialize matrices
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 30 I=1, NRA
        DO 30 J=1, NCA
          A(I,J) = (I-1)+(J-1)
  30  CONTINUE
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 40 I=1, NCA
        DO 40 J=1, NCB
          B(I,J) = (I-1)*(J-1)
  40  CONTINUE
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 50 I=1, NRA
        DO 50 J=1, NCB
          C(I,J) = 0
  50  CONTINUE
 
C     Do matrix multiply sharing iterations on outer loop
C     Display who does which iterations for demonstration purposes
      PRINT *, 'Thread', TID, 'starting matrix multiply...'
!$OMP DO SCHEDULE(STATIC, CHUNK)
      DO 60 I=1, NRA
      PRINT *, 'Thread', TID, 'did row', I
        DO 60 J=1, NCB
          DO 60 K=1, NCA
            C(I,J) = C(I,J) + A(I,K) * B(K,J)
  60  CONTINUE
 
C     End of parallel region 
!$OMP END PARALLEL
 
C     Print results
      PRINT *, '******************************************************'
      PRINT *, 'Result Matrix:'
      DO 90 I=1, NRA
        DO 80 J=1, NCB
          WRITE(*,70) C(I,J)
  70      FORMAT(2x,f8.2,$)
  80      CONTINUE
          PRINT *, ' '
  90      CONTINUE
      PRINT *, '******************************************************'
      PRINT *, 'Done.'
 
      END

9.6 Example 6

program mandel
 
  real :: xmin, xmax
  real :: ymin, ymax
 
  real :: dx, x
  real :: dy, y
  integer :: i, j
  integer :: it, itmax
  integer, parameter :: nsize = 700
  complex  :: c, z
 
! initialize variables
  itmax = 5000
  xmin = -2
  xmax = 2
  ymin = -2
  ymax = 2
 
! determine dx, dy for grid based on initial values
  n = nsize
  dx = (xmax - xmin)/(n-1)
  dy = (ymax - ymin)/(n-1)
 
! loop over the grid  
  do i = 1, n
    do j = 1, n
! find x,y and set the c to the location in the complex plane
      x = (i-1) * dx + xmin
      y = (j-1) * dy + ymin
      c = cmplx(x,y)
      z = c
 
! determine the number of iterations needed 
      it = 0
      do while (it < itmax .and. abs(z) < 2)
        it = it + 1
        z = z*z + c
      enddo
 
! print the result to a data file
      write(11,'(2i6, i5)') i,j,it
    enddo
  enddo
 
end program mandel

9.7 Example 7

program main 
************************************************************
* program to solve a finite difference 
* discretization of Helmholtz equation :  
* (d2/dx2)u + (d2/dy2)u - alpha u = f 
* using Jacobi iterative method. 
*
* Modified: Sanjiv Shah,       Kuck and Associates, Inc. (KAI), 1998
* Author:   Joseph Robicheaux, Kuck and Associates, Inc. (KAI), 1998
* 
* Directives are used in this code to achieve paralleism. 
* All do loops are parallized with default 'static' scheduling.
* 
* Input :  n - grid dimension in x direction 
*          m - grid dimension in y direction
*          alpha - Helmholtz constant (always greater than 0.0)
*          tol   - error tolerance for iterative solver
*          relax - Successice over relaxation parameter
*          mits  - Maximum iterations for iterative solver
*
* On output 
*       : u(n,m) - Dependent variable (solutions)
*       : f(n,m) - Right hand side function 
*************************************************************
      implicit none 
 
      integer n,m,mits 
      double precision tol,relax,alpha 
 
      common /idat/ n,m,mits
      common /fdat/tol,alpha,relax
* 
* Read info 
* 
      write(*,*) "Input n,m - grid dimension in x,y direction " 
      read(5,*) n,m 
      write(*,*) "Input alpha - Helmholts constant " 
      read(5,*) alpha
      write(*,*) "Input relax - Successive over-relaxation parameter"
      read(5,*) relax 
      write(*,*) "Input tol - error tolerance for iterative solver" 
      read(5,*) tol 
      write(*,*) "Input mits - Maximum iterations for solver" 
      read(5,*) mits
 
*
* Calls a driver routine 
* 
      call driver () 
 
      stop
      end 
 
      subroutine driver ( ) 
*************************************************************
* Subroutine driver () 
* This is where the arrays are allocated and initialzed. 
*
* Working varaibles/arrays 
*     dx  - grid spacing in x direction 
*     dy  - grid spacing in y direction 
*************************************************************
      implicit none 
 
      integer n,m,mits,mtemp 
      double precision tol,relax,alpha 
 
      common /idat/ n,m,mits,mtemp
      common /fdat/tol,alpha,relax
 
      double precision u(n,m),f(n,m),dx,dy
 
* Initialize data
 
      call initialize (n,m,alpha,dx,dy,u,f)
 
* Solve Helmholtz equation
 
      call jacobi (n,m,dx,dy,alpha,relax,u,f,tol,mits)
 
* Check error between exact solution
 
      call  error_check (n,m,alpha,dx,dy,u,f)
 
      return 
      end 
 
      subroutine initialize (n,m,alpha,dx,dy,u,f) 
******************************************************
* Initializes data 
* Assumes exact solution is u(x,y) = (1-x^2)*(1-y^2)
*
******************************************************
      implicit none 
 
      integer n,m
      double precision u(n,m),f(n,m),dx,dy,alpha
 
      integer i,j, xx,yy
      double precision PI 
      parameter (PI=3.1415926)
 
      dx = 2.0 / (n-1)
      dy = 2.0 / (m-1)
 
* Initilize initial condition and RHS
 
!$omp parallel do private(xx,yy)
      do j = 1,m
         do i = 1,n
            xx = -1.0 + dx * dble(i-1)        ! -1 < x < 1
            yy = -1.0 + dy * dble(j-1)        ! -1 < y < 1
            u(i,j) = 0.0 
            f(i,j) = -alpha *(1.0-xx*xx)*(1.0-yy*yy) 
     &           - 2.0*(1.0-xx*xx)-2.0*(1.0-yy*yy)
         enddo
      enddo
!$omp end parallel do
 
      return 
      end 
 
      subroutine jacobi (n,m,dx,dy,alpha,omega,u,f,tol,maxit)
******************************************************************
* Subroutine HelmholtzJ
* Solves poisson equation on rectangular grid assuming : 
* (1) Uniform discretization in each direction, and 
* (2) Dirichlect boundary conditions 
* 
* Jacobi method is used in this routine 
*
* Input : n,m   Number of grid points in the X/Y directions 
*         dx,dy Grid spacing in the X/Y directions 
*         alpha Helmholtz eqn. coefficient 
*         omega Relaxation factor 
*         f(n,m) Right hand side function 
*         u(n,m) Dependent variable/Solution
*         tol    Tolerance for iterative solver 
*         maxit  Maximum number of iterations 
*
* Output : u(n,m) - Solution 
*****************************************************************
      implicit none 
      integer n,m,maxit
      double precision dx,dy,f(n,m),u(n,m),alpha, tol,omega
*
* Local variables 
* 
      integer i,j,k,k_local 
      double precision error,resid,rsum,ax,ay,b
      double precision error_local, uold(n,m)
 
      real ta,tb,tc,td,te,ta1,ta2,tb1,tb2,tc1,tc2,td1,td2
      real te1,te2
      real second
      external second
*
* Initialize coefficients 
      ax = 1.0/(dx*dx) ! X-direction coef 
      ay = 1.0/(dy*dy) ! Y-direction coef
      b  = -2.0/(dx*dx)-2.0/(dy*dy) - alpha ! Central coeff  
 
      error = 10.0 * tol 
      k = 1
 
      do while (k.le.maxit .and. error.gt. tol) 
 
         error = 0.0    
 
* Copy new solution into old
!$omp parallel
 
!$omp do 
         do j=1,m
            do i=1,n
               uold(i,j) = u(i,j) 
            enddo
         enddo
 
* Compute stencil, residual, & update
 
!$omp do private(resid) reduction(+:error)
         do j = 2,m-1
            do i = 2,n-1 
*     Evaluate residual 
               resid = (ax*(uold(i-1,j) + uold(i+1,j)) 
     &                + ay*(uold(i,j-1) + uold(i,j+1))
     &                 + b * uold(i,j) - f(i,j))/b
* Update solution 
               u(i,j) = uold(i,j) - omega * resid
* Accumulate residual error
               error = error + resid*resid 
            end do
         enddo
!$omp enddo nowait
 
!$omp end parallel
 
* Error check 
 
         k = k + 1
 
         error = sqrt(error)/dble(n*m)
*
      enddo                     ! End iteration loop 
*
      print *, 'Total Number of Iterations ', k 
      print *, 'Residual                   ', error 
 
      return 
      end 
 
      subroutine error_check (n,m,alpha,dx,dy,u,f) 
      implicit none 
************************************************************
* Checks error between numerical and exact solution 
*
************************************************************ 
 
      integer n,m
      double precision u(n,m),f(n,m),dx,dy,alpha 
 
      integer i,j
      double precision xx,yy,temp,error 
 
      dx = 2.0 / (n-1)
      dy = 2.0 / (m-1)
      error = 0.0 
 
!$omp parallel do private(xx,yy,temp) reduction(+:error)
      do j = 1,m
         do i = 1,n
            xx = -1.0d0 + dx * dble(i-1)
            yy = -1.0d0 + dy * dble(j-1)
            temp  = u(i,j) - (1.0-xx*xx)*(1.0-yy*yy)
            error = error + temp*temp 
         enddo
      enddo
 
      error = sqrt(error)/dble(n*m)
 
      print *, 'Solution Error : ',error
 
      return 
      end

10 Comparison to pthreads and MPI

Since OpenMP uses a thread-based fork-and-join paradigm for parallelization, it is more similar to pthreads than MPI, which uses message passing between independent processes. pthreads exclusively uses library function calls to enable parallelization, whereas OpenMP also uses compiler directives and environment variables. Parallelization with pthreads tends to be more invasive (with respect to source code modification), due to the greater number of function calls required to create and synchronize threads. However, by requiring more explicit thread management, pthreads can support more complex parallelization strategies in addition to the fork-and-join strategy. On the other hand, while OpenMP is more aligned with the fork-and-join paradigm, it can enable parallelization and significant speedup of a program with just a single #pragma line added to a program's source code.

11 Assignment

Using OpenMP, parallelize the Mandelbrot set code on the website.
Run the code on :

  • The cdsXX workstations
  • gmice
  • The Amazon Cloud

Time the code with and without OMP additions, calculate the speed up and efficiency. Set OMP NUM THREADS between 1 and 16 for all tests, and graph the results for final cpu time. You may wish to use omp_get_wtime() to get the cpu time at the beginning and end of the code.

Hint: Make sure you consider loop separation.

12 Links & References

Personal tools