OpenMP
From csi702
Contents |
1 Introduction to OpenMP
OpenMP was as a set of simple program additions to make codes run efficiently on shared memory computers. The formal API for OpenMP is only about 50 pages long, and consists of compiler directives, library functions, and environment variables.
http://www.llnl.gov/computing/tutorials/openMP/
2 OpenMP Threads
OpenMP uses threads for parallel programming
- Forks and Joins are used for most of the internal programming.
- Speedup is achieved by the operating system splitting the threads across multiple CPUs.
- New threads are created explicitly by the program directives dynamically.
3 Goals of OpenMP
- Standardization
- Lean and Mean - only 3-4 directives
- Ease of use
- Portability - F77, F90, F95, C, C++
4 OpenMP Programming Model
- Shared Memory, thread based
- Explicit Parallelism
- Fork-Join Model
- Compiler Directives
- Nested Parallelism Support - in most implementations
- Dynamic Threads
- Not tied to I/O
4.1 Explicit Parallelism
- You must tell the computer what sections of code to parallelize using complier directives.
- The compiler directives vary between languages, but are ignored when OpenMP flags are not set with the compiler.
- Codes written with OpenMP can run easily on serial machines.
4.2 Environment and Library Routines
- Some environmental variables are needed to make the code execute using the correct number of threads
- Some library routines allow the programmer to set and access system variables
4.3 Not Message Passing
This is NOT a set of message passing routines. Instead, you give directives to the compiler of what parts of the code can be executed in parallel.
In some ways, OpenMP is a set of directives to tell the compiler how to more efficiently handle loops.
4.4 General Syntax
Fortran
!$OMP <directive> do useful stuff !$OMP end <directive>
C/C++
#pragma omp <directive-name> clause
{do useful stuff in a structured block}
4.5 A Trivial Example
Basic Code
program trival print*,’hello world’ end program
OMP Additions
program trivial !$omp parallel print*,’hello world’ !$omp end parallel end program trivial
Execution of the Trivial Example
(158) % ifort trival.f90 (159) % ./a.out hello world (160) %
What went wrong?
Execution of the Trivial Example
>bash-3.00$ ifort trivial.f90 -openmp trivial.f90(3) : (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED. > ./a.out hello world hello world > gfortran trivial.f90 -fopenmp > ./a.out hello world hello world > export OMP_NUM_THREADS=5 > ./a.out hello world hello world hello world hello world hello world
Using ifort on the cds workstations
# on cdsxx > module list Currently Loaded Modulefiles: 1) intel_icc_10.1.017 3) intel_idb_10.1.017 5) intel_2) intel_ifort_10.1.017 4) intel_mkl_10.0.4.023 >export OMP_NUM_THREADS=2
Thread ID
program trival1 implicit none integer :: OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS integer :: tid, nthreads !$OMP PARALLEL PRIVATE(NTHREADS, TID) tid = OMP_GET_THREAD_NUM() nthreads = OMP_GET_MAX_THREADS() print*,’hello world from ’, tid,nthreads !$OMP END PARALLEL end program
Note the PRIVATE key word, indicating that all threads have their own copy of the variable.
Thread ID (2)
(172) % ifort -openmp trival1.f90 trival1.f90(5): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED. (173) % a.out hello world from 1 4 hello world from 0 4 hello world from 2 4 hello world from 3 4 (174) %
5 Parallelizing Loops
To parallelize a loop, you need to help the compiler figure out the most efficient way to use threads. There are simple defaults, but giving it more details can help efficiency.
The basic directives are:
!$OMP PARALLEL !$OMP DO some parallel loop !$OMP END DO !$OMP END PARALLEL
You do not have to have a separate directive on each line. For example,
!$OMP PARALLEL !$OMP DO !$OMP PRIVATE(NTHREADS, TID)
Becomes
!$OMP PARALLEL DO PRIVATE(NTHREADS, TID)
6 Reductions
Because the loops are executing separately, you may wish to combine the results from different threads to a final answer. You need to use reduction to make this work.
$!OMP PARALLEL PRIVATE(X) REDUCTION(+:SUM) $!
6.1 Numerical Integration
Integrating
We can approximate this integral using Simpson’s algorithms
- input the number of partitions to be used
- divide the domain into n partitions
- evaluate the function at each partition
- multiply the function evaluation times the width of the function to find a differential area
- add the differential areas together
- output the result
Parallel Integration
In parallel, the problem is nearly the same.
- on processor zero, input the number of partitions
- broadcast the user input information to all processors
- determine the number of processors - m
- divide the domain into n/m partitions on each processor
- evaluate the function at each partition
- multiply the function evaluation times the width of the function to find a differential area
- add the differential areas together across all the processors
- on processor zero, output the result
6.2 A Simple Code to Calculate PI
program reduce ! modified from Mattson, Sanders and McGill’s book integer :: i, num_steps double precision :: x, pi, step, sum sum =0.0d0 ; nsteps = 10000 step = 1.0d0 / dble(nsteps) do i = 1, nsteps x = (dble(i) + 0.5d0) * step sum = sum + 4.0d0 /(1.0d0 + x*x) enddo pi = step * sum print *, "estimate of Pi with ",nsteps, " is ",pi end program reduce
OpenMP Modifications
!$OMP PARALLEL DO PRIVATE(X) REDUCTION(+:SUM) do i = 1, nsteps x = (dble(i) + 0.5d0) * step sum = sum + 4.0d0 /(1.0d0 + x*x) enddo !$OMP END PARALLEL DO
6.3 CPUs currently in our Lab
wk08 Total CPU’s: 2 Intel(R) Pentium(R) 4 CPU 3.80GHz 3791 MHZ Cache Size : 1024 KB Main Memory Size: 2046 MB cpu MHz : 3791.076 32 bit wk14 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz 1596 MHZ Cache Size : 4096 KB Main Memory Size: 2023 MB cpu MHz : 1596.000
Skynet and Cameron
skynet Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz cache size : 6144 KB Main Memory Size: 4051 MB cpu MHz : 1998.000 cameron model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz stepping : 4 cpu MHz : 1600.000 cache size : 8192 KB
OpenMP Results - pi.f ( Old Lab Machines - 2007 )
Old machines in
ifort 22.265u 0.009s 0:11.22 198.3% 0+0k 0+0io 0pf+0w ifort -openmp 10.884u 0.005s 0:10.89 99.9% 0+0k 0+0io 0pf+0w
OpenMP Results - pi.f ( skynet )
ifort real 0m3.877s user 0m3.872s sys 0m0.000s ifort -openmp real 0m1.002s user 0m3.812s sys 0m0.004s 3.877 seconds TO 1.002 second with two lines of code!
OpenMP Results - pi example ( cameron )
gfortran real 0m18.571s user 0m18.565s sys 0m0.000s estimate of Pi with 1000000000 is 3.1415926515899706 gfortran -openmp real 0m2.717s user 0m19.357s sys 0m0.004s estimate of Pi with 1000000000 is 3.1415926515897685 That 18.571 to 2.717 seconds!!!!!! 6.85 times faster with two lines of code.
OpenMP Results - mandelbrot example ( cameron )
without openmp real 0m0.653s user 0m0.836s sys 0m0.012s with openmp real 0m9.024s user 0m8.989s sys 0m0.012s
OpenMP Results - jacobi example ( cameron )
without openmp Residual 9.99979529016497684E-008 Solution Error : 1.16990135380457038E-003 real 0m23.225s user 0m23.217s sys 0m0.000s with openmp Residual 9.99979529016516478E-008 Solution Error : 1.16990135380456712E-003 real 0m7.861s user 0m51.067s sys 0m0.548s
OpenMP Results - pi example ( cameron )
Anyone notice the problem??
gfortran real 0m18.571s user 0m18.565s sys 0m0.000s estimate of Pi with 1000000000 is 3.1415926515899706 gfortran -openmp real 0m2.717s user 0m19.357s sys 0m0.004s estimate of Pi with 1000000000 is 3.1415926515897685 That 18.571 to 2.717 seconds!!!!!! 6.85 times faster with two lines of code.
OpenMP Results - pi example ( cameron )
Anyone notice the problem??
estimate of Pi with 1000000000 is 3.1415926515899706 estimate of Pi with 1000000000 is 3.1415926515897685
What happened?
Initially, we might expect that the serial and parallel versions of the code will produce identical results, since they are performing identical computations for all of the terms in the series. In fact, each of the terms of the series computed are identical. However, deviations are introduced when the terms are summed. Since computer representations of floating point numbers have finite precision, the accuracy of an addition operation depends on the relative magnitudes of the two numbers. If we sum a series of numbers, the accuracy of the result depends not only on the numbers that are being summed, but also on the order in which they are summed.
Since the parallel version of the algorithm doesn't add the numbers in the exact same order as the serial algorithm, it is likely that the results will not be identical. Furthermore, if the order of execution of the threads is not identical from run to run, then the same parallel algorithm can produce different results each time it is run. Note that this is a general issue with parallel algorithm and is not specific to OpenMP. Therefore, we expect similar behavior with pthreads and MPI.
Molecular Dynamics Code Example
!$omp parallel do
!$omp& default(shared)
!$omp& private(i,j,k,rij,d)
!$omp& reduction(+ : pot, kin)
do i=1,np
! compute potential energy and forces
f(1:nd,i) = 0.0
do j=1,np
if (i .ne. j) then
call dist(nd,box,pos(1,i),pos(1,j),rij,d)
! attribute half of the potential energy to particle
pot = pot + 0.5*v(d)
do k=1,nd
f(k,i) = f(k,i) - rij(k)*dv(d)/d
enddo
endif
enddo
! compute kinetic energy
kin = kin + dotr8(nd,vel(1,i),vel(1,i))
enddo
!$omp end parallel do
Molecular Dynamics Code Example: Timing Results - Old machines
pgi 104.648u 0.040s 1:44.69 99.9% 0+0k 0+0io 0pf+0w pgi openmp 125.540u 0.050s 1:02.87 199.7% 0+0k 0+0io 0pf+0w ifort 29.248u 0.031s 0:29.28 99.9% 0+0k 0+0io 0pf+0w ifort -openmp 41.625u 0.066s 0:23.16 179.9% 0+0k 0+0io 0pf+0w
Molecular Dynamics Code Example: Timing Results - New machines
ifort 28.362u 0.008s 0:28.37 99.9% 0+0k 0+0io 0pf+0w ifort -openmp 29.272u 0.017s 0:15.09 194.0% 0+0k 1960+0io 5pf+0w
Molecular Dynamics Code Example: Timing Results - skynet - quad core
ifort -O0 real 0m59.805s user 0m59.776s sys 0m0.032s ifort real 0m32.083s user 0m32.078s sys 0m0.008s ifort -openmp real 0m18.003s user 0m57.372s sys 0m1.304s
Automatic parallelization???
ifort -openmp real 0m18.003s user 0m57.372s sys 0m1.304s ifort -parallel real 0m28.683s user 1m26.273s sys 0m3.476s
Molecular Dynamics Code Example:Timing Results - wk14 - 64 bit dual core
ifort -O0 real 1m15.493s user 1m15.431s sys 0m0.008s ifort real 0m29.570s user 0m29.522s sys 0m0.008s ifort -openmp real 0m14.046s user 0m27.797s sys 0m0.022s
Molecular Dynamics Code Example:Timing Results - wk08 - 32 bit dual core
ifort -O0 real 1m35.032s user 1m34.674s sys 0m0.041s ifort real 0m30.898s user 0m30.837s sys 0m0.015s ifort -openmp real 0m25.291s user 0m43.954s sys 0m0.051s
7 Loop Splitting
One of the key ideas to remember is that loops often contain several operations that can be split. Taking an example from the Patterns in Parallel Programming book, imagine we have a loop with two functions:
- BIG COMPUTATION - a big computation the executes independently on each element in the loop
- COMBINE - an element that cannot be parallelized and must execute in order
do i = 1, nsteps x = BIG_COMPUTATION(i) call COMBINE(x,answer) enddo
can be split into
do i = 1, nsteps x(i) = BIG_COMPUTATION(i) enddo do i = 1, nsteps call COMBINE(x(i),answer) enddo
7.1 Using OpenMP in Loop Splitting
!$OMP PARALLEL DO PRIVATE(I) do i = 1, nsteps x(i) = BIG_COMPUTATION(i) enddo !$OMP END PARALLEL DO do i = 1, nsteps call COMBINE(x(i),answer) enddo
Elliptical PDE using the Jacobi Method
http://www.llnl.gov/computing/tutorials/openMP/exercise.html
Poisson Equation on a rectangular grid assuming Uniform discretization and Dirichlet boundary conditions.
over the domain
![\{x, y\} = [−1,\ 1].](/csi702/images/math/b/a/6/ba6242403414ee2b8cb944f9287db471.png)
The example has an exact solution of
The Approach
- parallelize initialization routine
- parallelize main iteration loop
- use reduction for computing the residual and error check
Efficiency- harlie - dual AMD 2.6Ghz
(8) % time a.out < input Total Number of Iterations 18556 Residual 9.9998760719761766E-008 Solution Error : 2.1683208415060635E-004 FORTRAN STOP 50.064u 0.066s 0:50.15 99.9% 0+0k 0+0io 0pf+0w (9) % setenv OMP_NUM_THREADS 2 (10) % time a.out < input Total Number of Iterations 18556 Residual 9.9998760719761819E-008 Solution Error : 2.1683208415060609E-004 FORTRAN STOP 68.157u 1.456s 0:34.88 199.5% 0+0k 0+0io 0pf+0w
Efficiency - hal - dual Pentium III 600 Mhz
hal [~/openmp] (59) % time a.out < input Total Number of Iterations 10976 Residual 9.9988418447383420E-008 Solution Error : 1.2326073098302639E-003 20.080u 0.060s 0:20.12 100.0% 0+0k 0+0io 161pf+0w hal [~/openmp] (63) % setenv OMP_NUM_THREADS=2 hal [~/openmp] (64) % time a.out < input Total Number of Iterations 10976 Residual 9.9988418447383882E-008 Solution Error : 1.2326073098302657E-003 22.650u 0.020s 0:11.37 199.3% 0+0k 0+0io 185pf+0w
Processes While the Code Runs
11:26pm up 21 days, 20:32, 3 users, load average: 0.74, 0.39, 0.16 83 processes: 79 sleeping, 4 running, 0 zombie, 0 stopped CPU0 states: 99.0% user, 0.5% system, 0.0% nice, 0.0% idle CPU1 states: 99.2% user, 0.3% system, 0.0% nice, 0.0% idle Mem: 383768K av, 351256K used, 32512K free, 0K shrd, 34620K buff Swap: 922288K av, 9168K used, 913120K free 95560K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 24661 jwallin 16 0 960 960 560 R 99.0 0.2 0:10 a.out 24663 jwallin 15 0 960 960 560 R 97.6 0.2 0:10 a.out 24293 jwallin 10 0 9428 9268 7852 S 1.5 2.4 0:12 kdeinit 1021 root 9 0 160M 13M 1568 R 0.7 3.5 577:10 X 24650 jwallin 12 0 1068 1068 832 R 0.5 0.2 0:00 top
Efficiences of the Jacobi.f program ( skynet )
- no mpi, no opt = 28.679s
- no mpi, with opt = 3.719s
- with mpi, with opt = 1.203s
23 times faster than the unoptimized no mpi code
Results
| Machine | Code | Time With OMP | Time Without OMP | Speedup |
|---|---|---|---|---|
| Harlie 2.6Ghz AMD | Pi | 5.19s | 9.97s | 1.92 |
| skynet quad | Pi | 1.002 | 3.877 | 3.87 |
| Hal 600Mhz Pentium | Jacobi | 11.37s | 20.12s | 1.77 |
| Harlie 2.6Ghz AMD | Jacobi | 34.88s | 50.15s | 1.44 |
| skynet quad | Jacobi | 1.203 | 3.576 | 2.98 |
What are the efficiencies of the parallel codes and why isn’t it 100%?
8 Controlling Thread Execution
There are many options for controlling the execution of threads.
!$OMP DO SCHEDULE(TYPE,integer)
- schedule(static[,chunk]) - groups of size chunk statically assigned in a round-robin fashion
- schedule(dynamic[,chunk]) - threads dynamically grab work as it is completed
- schedule(guided[,chunk]) - chunk size is reduced automatically during iteration toward a minimum level of chunk
- schedule(runtime) - checks the OMP SCHEDULE environmental variable
integer, parameter :: chunk = 10 !$OMP PARALLEL PRIVATE(i,j,z,c,it) DEFAULT(SHARED) !$OMP DO SCHEDULE(DYNAMIC,CHUNK) do i = 1, n do j = 1, n ...
8.1 Controlling Loops
setenv OMP_SCHEDULE static 11.477u 0.012s 0:08.24 139.3% 0+0k 0+0io 0pf+0w setenv OMP_SCHEDULE dynamic 11.239u 0.006s 0:05.67 198.0% 0+0k 0+0io 0pf+0w setenv OMP_SCHEDULE guided 11.453u 0.005s 0:06.52 175.6% 0+0k 0+0io 0pf+0w setenv OMP_SCHEDULE static,20 11.439u 0.028s 0:05.89 194.3% 0+0k 0+0io 0pf+0w no mpi 11.280u 0.004s 0:11.28 100.0% 0+0k 0+0io 0pf+0w
Mandelbrot Set - WITH an output file
ifort 11.980u 1.067s 0:15.54 83.9% 0+0k 0+35168io 0pf+0w ifort -openmp 12.535u 1.146s 0:10.00 136.7% 0+0k 0+35168io 0pf+0w
What happened???
8.2 Gmice
- Log into gmice.gmu.edu - you will need to have your private key in the appropriate location, and have your pass phrase
- Learn how to use OpenPBS
- Submit jobs to the “class” and “classtest” queues ONLY. They have CPU limits of 30 minutes and 2 minutes
respectively.
8.3 OpenPBS
Set up the PBS variables
#!/bin/csh #PBS -j oe #PBS -N WallinTESTRUN #PBS -l select=1:ncpus=8:mpiprocs=1 #PBS -l walltime=48:00:00
Run the code!
setenv OMP_NUM_THREADS 8
/home/jwallin/pbs_openmp_example/mandel > mandel_out8
begin{verbatim}
setenv OMP_NUM_THREADS 4
/home/jwallin/pbs_openmp_example/mandel > mandel_out4
#PBS -o /home/jwallin/pbs_openmp_example/out_err
> qsub -q classtest filename.sh
> qstat
9 Examples
9.1 Example 1
program trivial !$omp parallel print*,'hello world' !$omp end parallel end program trivial
9.2 Example 2
program trival1 implicit none integer :: OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS integer :: tid, nthreads, sum sum =0 !$OMP PARALLEL PRIVATE(sum) tid = OMP_GET_THREAD_NUM() nthreads = OMP_GET_MAX_THREADS() print*,'hello world from ', tid,nthreads sum = sum + tid !$OMP END PARALLEL print *, sum end program
9.3 Example 3
program reduce ! modified from Mattson, Sanders and McGill's book integer :: i, num_steps double precision :: x, pi, step, sum sum =0.0d0 ; nsteps = 1000000000 step = 1.0d0 / dble(nsteps) !$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum) do i = 1, nsteps x = (dble(i) + 0.5d0) * step sum = sum + 4.0d0 /(1.0d0 + x*x) enddo !$OMP END PARALLEL DO pi = step * sum print *, "estimate of Pi with ",nsteps, " is ",pi end program reduce
9.4 Example 4
C****************************************************************************** C FILE: omp_workshare1.f C DESCRIPTION: C OpenMP Example - Loop Work-sharing - Fortran Version C In this example, the iterations of a loop are scheduled dynamically C across the team of threads. A thread will perform CHUNK iterations C at a time before being scheduled for the next CHUNK of work. C AUTHOR: Blaise Barney 5/99 C LAST REVISED: 01/09/04 C****************************************************************************** PROGRAM WORKSHARE1 INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), C(N) ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(I,TID) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads =', NTHREADS END IF PRINT *, 'Thread',TID,' starting...' !$OMP DO SCHEDULE(DYNAMIC,CHUNK) DO I = 1, N C(I) = A(I) + B(I) WRITE(*,100) TID,I,C(I) 100 FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT PRINT *, 'Thread',TID,' done.' !$OMP END PARALLEL END
9.5 Example 5
C****************************************************************************** C FILE: omp_mm.f C DESCRIPTION: C OpenMp Example - Matrix Multiply - Fortran Version C Demonstrates a matrix multiply using OpenMP. Threads share row iterations C according to a predefined chunk size. C AUTHOR: Blaise Barney C LAST REVISED: 1/5/04 Blaise Barney C****************************************************************************** PROGRAM MATMULT INTEGER NRA, NCA, NCB, TID, NTHREADS, I, J, K, CHUNK, + OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM C number of rows in matrix A PARAMETER (NRA=62) C number of columns in matrix A PARAMETER (NCA=15) C number of columns in matrix B PARAMETER (NCB=7) REAL*8 A(NRA,NCA), B(NCA,NCB), C(NRA,NCB) C Set loop iteration chunk size CHUNK = 10 C Spawn a parallel region explicitly scoping all variables !$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK) PRIVATE(TID,I,J,K) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Starting matrix multiple example with', NTHREADS, + 'threads' PRINT *, 'Initializing matrices' END IF C Initialize matrices !$OMP DO SCHEDULE(STATIC, CHUNK) DO 30 I=1, NRA DO 30 J=1, NCA A(I,J) = (I-1)+(J-1) 30 CONTINUE !$OMP DO SCHEDULE(STATIC, CHUNK) DO 40 I=1, NCA DO 40 J=1, NCB B(I,J) = (I-1)*(J-1) 40 CONTINUE !$OMP DO SCHEDULE(STATIC, CHUNK) DO 50 I=1, NRA DO 50 J=1, NCB C(I,J) = 0 50 CONTINUE C Do matrix multiply sharing iterations on outer loop C Display who does which iterations for demonstration purposes PRINT *, 'Thread', TID, 'starting matrix multiply...' !$OMP DO SCHEDULE(STATIC, CHUNK) DO 60 I=1, NRA PRINT *, 'Thread', TID, 'did row', I DO 60 J=1, NCB DO 60 K=1, NCA C(I,J) = C(I,J) + A(I,K) * B(K,J) 60 CONTINUE C End of parallel region !$OMP END PARALLEL C Print results PRINT *, '******************************************************' PRINT *, 'Result Matrix:' DO 90 I=1, NRA DO 80 J=1, NCB WRITE(*,70) C(I,J) 70 FORMAT(2x,f8.2,$) 80 CONTINUE PRINT *, ' ' 90 CONTINUE PRINT *, '******************************************************' PRINT *, 'Done.' END
9.6 Example 6
program mandel real :: xmin, xmax real :: ymin, ymax real :: dx, x real :: dy, y integer :: i, j integer :: it, itmax integer, parameter :: nsize = 700 complex :: c, z ! initialize variables itmax = 5000 xmin = -2 xmax = 2 ymin = -2 ymax = 2 ! determine dx, dy for grid based on initial values n = nsize dx = (xmax - xmin)/(n-1) dy = (ymax - ymin)/(n-1) ! loop over the grid do i = 1, n do j = 1, n ! find x,y and set the c to the location in the complex plane x = (i-1) * dx + xmin y = (j-1) * dy + ymin c = cmplx(x,y) z = c ! determine the number of iterations needed it = 0 do while (it < itmax .and. abs(z) < 2) it = it + 1 z = z*z + c enddo ! print the result to a data file write(11,'(2i6, i5)') i,j,it enddo enddo end program mandel
9.7 Example 7
program main ************************************************************ * program to solve a finite difference * discretization of Helmholtz equation : * (d2/dx2)u + (d2/dy2)u - alpha u = f * using Jacobi iterative method. * * Modified: Sanjiv Shah, Kuck and Associates, Inc. (KAI), 1998 * Author: Joseph Robicheaux, Kuck and Associates, Inc. (KAI), 1998 * * Directives are used in this code to achieve paralleism. * All do loops are parallized with default 'static' scheduling. * * Input : n - grid dimension in x direction * m - grid dimension in y direction * alpha - Helmholtz constant (always greater than 0.0) * tol - error tolerance for iterative solver * relax - Successice over relaxation parameter * mits - Maximum iterations for iterative solver * * On output * : u(n,m) - Dependent variable (solutions) * : f(n,m) - Right hand side function ************************************************************* implicit none integer n,m,mits double precision tol,relax,alpha common /idat/ n,m,mits common /fdat/tol,alpha,relax * * Read info * write(*,*) "Input n,m - grid dimension in x,y direction " read(5,*) n,m write(*,*) "Input alpha - Helmholts constant " read(5,*) alpha write(*,*) "Input relax - Successive over-relaxation parameter" read(5,*) relax write(*,*) "Input tol - error tolerance for iterative solver" read(5,*) tol write(*,*) "Input mits - Maximum iterations for solver" read(5,*) mits * * Calls a driver routine * call driver () stop end subroutine driver ( ) ************************************************************* * Subroutine driver () * This is where the arrays are allocated and initialzed. * * Working varaibles/arrays * dx - grid spacing in x direction * dy - grid spacing in y direction ************************************************************* implicit none integer n,m,mits,mtemp double precision tol,relax,alpha common /idat/ n,m,mits,mtemp common /fdat/tol,alpha,relax double precision u(n,m),f(n,m),dx,dy * Initialize data call initialize (n,m,alpha,dx,dy,u,f) * Solve Helmholtz equation call jacobi (n,m,dx,dy,alpha,relax,u,f,tol,mits) * Check error between exact solution call error_check (n,m,alpha,dx,dy,u,f) return end subroutine initialize (n,m,alpha,dx,dy,u,f) ****************************************************** * Initializes data * Assumes exact solution is u(x,y) = (1-x^2)*(1-y^2) * ****************************************************** implicit none integer n,m double precision u(n,m),f(n,m),dx,dy,alpha integer i,j, xx,yy double precision PI parameter (PI=3.1415926) dx = 2.0 / (n-1) dy = 2.0 / (m-1) * Initilize initial condition and RHS !$omp parallel do private(xx,yy) do j = 1,m do i = 1,n xx = -1.0 + dx * dble(i-1) ! -1 < x < 1 yy = -1.0 + dy * dble(j-1) ! -1 < y < 1 u(i,j) = 0.0 f(i,j) = -alpha *(1.0-xx*xx)*(1.0-yy*yy) & - 2.0*(1.0-xx*xx)-2.0*(1.0-yy*yy) enddo enddo !$omp end parallel do return end subroutine jacobi (n,m,dx,dy,alpha,omega,u,f,tol,maxit) ****************************************************************** * Subroutine HelmholtzJ * Solves poisson equation on rectangular grid assuming : * (1) Uniform discretization in each direction, and * (2) Dirichlect boundary conditions * * Jacobi method is used in this routine * * Input : n,m Number of grid points in the X/Y directions * dx,dy Grid spacing in the X/Y directions * alpha Helmholtz eqn. coefficient * omega Relaxation factor * f(n,m) Right hand side function * u(n,m) Dependent variable/Solution * tol Tolerance for iterative solver * maxit Maximum number of iterations * * Output : u(n,m) - Solution ***************************************************************** implicit none integer n,m,maxit double precision dx,dy,f(n,m),u(n,m),alpha, tol,omega * * Local variables * integer i,j,k,k_local double precision error,resid,rsum,ax,ay,b double precision error_local, uold(n,m) real ta,tb,tc,td,te,ta1,ta2,tb1,tb2,tc1,tc2,td1,td2 real te1,te2 real second external second * * Initialize coefficients ax = 1.0/(dx*dx) ! X-direction coef ay = 1.0/(dy*dy) ! Y-direction coef b = -2.0/(dx*dx)-2.0/(dy*dy) - alpha ! Central coeff error = 10.0 * tol k = 1 do while (k.le.maxit .and. error.gt. tol) error = 0.0 * Copy new solution into old !$omp parallel !$omp do do j=1,m do i=1,n uold(i,j) = u(i,j) enddo enddo * Compute stencil, residual, & update !$omp do private(resid) reduction(+:error) do j = 2,m-1 do i = 2,n-1 * Evaluate residual resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b * Update solution u(i,j) = uold(i,j) - omega * resid * Accumulate residual error error = error + resid*resid end do enddo !$omp enddo nowait !$omp end parallel * Error check k = k + 1 error = sqrt(error)/dble(n*m) * enddo ! End iteration loop * print *, 'Total Number of Iterations ', k print *, 'Residual ', error return end subroutine error_check (n,m,alpha,dx,dy,u,f) implicit none ************************************************************ * Checks error between numerical and exact solution * ************************************************************ integer n,m double precision u(n,m),f(n,m),dx,dy,alpha integer i,j double precision xx,yy,temp,error dx = 2.0 / (n-1) dy = 2.0 / (m-1) error = 0.0 !$omp parallel do private(xx,yy,temp) reduction(+:error) do j = 1,m do i = 1,n xx = -1.0d0 + dx * dble(i-1) yy = -1.0d0 + dy * dble(j-1) temp = u(i,j) - (1.0-xx*xx)*(1.0-yy*yy) error = error + temp*temp enddo enddo error = sqrt(error)/dble(n*m) print *, 'Solution Error : ',error return end
10 Comparison to pthreads and MPI
Since OpenMP uses a thread-based fork-and-join paradigm for parallelization, it is more similar to pthreads than MPI, which uses message passing between independent processes. pthreads exclusively uses library function calls to enable parallelization, whereas OpenMP also uses compiler directives and environment variables. Parallelization with pthreads tends to be more invasive (with respect to source code modification), due to the greater number of function calls required to create and synchronize threads. However, by requiring more explicit thread management, pthreads can support more complex parallelization strategies in addition to the fork-and-join strategy. On the other hand, while OpenMP is more aligned with the fork-and-join paradigm, it can enable parallelization and significant speedup of a program with just a single #pragma line added to a program's source code.
11 Assignment
Using OpenMP, parallelize the Mandelbrot set code on the website.
Run the code on :
- The cdsXX workstations
- gmice
- The Amazon Cloud
Time the code with and without OMP additions, calculate the speed up and efficiency. Set OMP NUM THREADS between 1 and 16 for all tests, and graph the results for final cpu time. You may wish to use omp_get_wtime() to get the cpu time at the beginning
and end of the code.
Hint: Make sure you consider loop separation.
