Advanced MPI
From csi702
Contents |
1 Advanced MPI
1.1 MPE Library
Note that the MPE libraries are still not up on the CDS clusters. -4/3/10
MPE is a set of MPI extensions that are most commonly used for profiling program performance.
Profiling is essential to writing efficient parallel codes. It is necessary to identify communications bottlenecks and unbalanced loads (ie find the nodes that aren't doing anything). Inefficiencies are most likely due to communications bottlenecks and unbalanced loads. MPE is the most universal way of identifying these issues in MPI codes.
Steps to profiling using MPE:
- include the header for the MPE library
- compile the code so it accesses the correct headers
- link to the MPE and LMPE libraries
Add the header to your code:
- include "mpe.h"
Compile with the -lmpe library.
It looks like Dr. Wallin suggests copying the header files mpe.h and mpe_log.h to your local compile directory but instead you can just add it to your LD_LIBRARY_PATH (?).
When mpe is loaded properly mpirun will tell you its writing the log file.
Once you've got your MPI code compiling and running with MPE you can start visualizing the log files and profiling your code.
1.1.1 MPE Profiling
The output of the profiling is a set of time marks associated with different events. Events can be MPI communications or user defined events. The most common visualization tool is “jumpshot”. Example output:
1.1.2 MPE Tools
Other MPE events:
- MPE_Log_get_state_eventIDs( &event2a, &event2b );
- MPE_Describe_state( event2a, event2b,"Send", "orange" );
- MPE_Log_get_solo_eventID( &event1 );
- MPE_Describe_event( event1,"Sync Start", "white" );
Marking Sections of the code:
- ierr = MPE_Log_event(event2a, 0, NULL);
- ierr = MPI_Send(message, message_size,MPI_INT, dest, tag, MPI_COMM_WORLD);
- ierr = MPE_Log_event(event2b, 0, NULL);
Synchronization of the Clocks:
- MPE_Log_sync_clocks();
- MPE_Finish_log( "logname" );
1.2 MPI Data Packing
1.2.1 Structures
typedef struct test1 { char c; double b; char cc; } t1; typedef struct test2 { char cc; double b; } t2; printf("size of structure 1 = %d \n",sizeof(t1)); printf("size of structure 2 = %d \n",sizeof(t2));
What we expected:
1 + 8 + 1 = 10 and 1 + 1 + 8 = 10
But in reality:
size of char 1 size of double 8 size of structure 1 = 16 size of structure 2 = 12
So 1 + 8 + 1 = 16 and 1 + 1 + 8 = 12??????
What happened?
1.2.2 Packing of data
For 32 bit Operation System, it has a 4 bytes boundary. If the data length exceeds boundary, it will use a new line.
For 64 bit Operation System, it has a 8 bytes boundary. If the data length exceeds boundary, it will use a new line.
1.2.3 MPI Packing data
- the displacement from the start of the data block
- the number of elements in the block
- the type of elements in the block
1.2.4 MPI Domain Decomposition
Two examples below are all for 32bit OS
The packing code is like:
type (new_type) :: my_type integer :: mpi_my_type integer, parameter :: new_type_length = 3 integer, parameter :: new_mpi_types(1:3) = & (/MPI_INTEGER,MPI_DOUBLE_PRECISION, MPI_INTEGER/) integer, parameter :: block_length(1:3) = (/1,3,1/) integer, dimension(3) :: address call MPI_ADDRESS(my_type%istart, address(1), ierr) call MPI_ADDRESS(my_type%qxx, address(2), ierr) call MPI_ADDRESS(my_type%iend, address(3), ierr) do i = 1, new_type_length displacements(i) = address(i) - address(1) enddo call MPI_TYPE_STRUCT(new_type_length, & block_length, displacements, & ptypes, mpi_my_type, ierr) call MPI_TYPE_COMMIT(mpi_my_type, ierr) ! optional call MPI_TYPE_SIZE(mpi_my_type, isize, ierr)
A data structure example code is like:
type ucquad_type integer :: inode real :: qxx real :: qxy real :: qxz real :: qyy real :: qyz real :: qzz end type ucquad_type integer :: mpi_ucquad_type integer, parameter :: ucquadblock_length = 2 integer, parameter :: ucquadblock_counts(1:2) = (/1,6/) integer, parameter :: ucquadblock_size(1:2) = (/1,1/) integer, parameter :: ucquad_mpi_type(1:2) = (/MPI_INTEGER, MPI_REAL/) integer :: mpi_ucquad_type integer, parameter :: ucquadblock_length = 2 integer, parameter :: ucquadblock_counts(1:2) = (/1,6/) integer, parameter :: ucquadblock_size(1:2) = (/1,1/) integer, parameter :: ucquad_mpi_type(1:2) = (/MPI_INTEGER,MPI_REAL/)
Code for starting addresses of the blocks
call MPI_ADDRESS(ucquad_tmp%inode, address(1), ierr) call MPI_ADDRESS(ucquad_tmp%qxx, address(2), ierr)
Code for calculate displacements and setting types
do i = 1, ucquadblock_length block_length(i) = ucquadblock_counts(i) * & ucquadblock_size(i) ptypes(i) = ucquad_mpi_type(i) displacements(i) = address(i) - address(1) enddo
Code for creating the structure
call MPI_TYPE_STRUCT(ucquadblock_length, & block_length, displacements, & ptypes, mpi_ucquad_type, ierr) call MPI_TYPE_COMMIT(mpi_ucquad_type, ierr) call MPI_TYPE_SIZE(mpi_ucquad_type, isize, ierr)
Code for using the new data type
call MPI_IRECV(ucquad_r(buffer_start), & msize, & mpi_ucquad_type, & p2, tag+1, MPI_COMM_WORLD, & quad_request(next_recv,queue_id), ierr) call MPI_GET_COUNT(qstatus, mpi_ucquad_type, qcount, ierr)
1.2.5 MPI Domain Decomposition
Code for creating two sets
do i = 1, median_proc + 1 ranks(i) = i-1 enddo left_group_size = median_proc + 1 right_group_size = group_size - left_group_size
Code for creating two new Comm Groups
if (left_group_size > 1 ) then call MPI_GROUP_INCL(full_group, & left_group_size, ranks, left_group, ierr) call MPI_GROUP_SIZE(left_group, group_size, ierr) call MPI_COMM_CREATE(my_comm, left_group, left_comm, ierr endif if (right_group_size > 1) then call MPI_GROUP_EXCL(full_group, & left_group_size, ranks, right_group, ierr) call MPI_GROUP_SIZE(right_group, group_size, ierr) call MPI_COMM_CREATE(my_comm, right_group, right_comm, ie endif
Code for applying the decomposition recursively
if (left_group_size > 1 .and. & local_id <= median_proc ) then call ORB_RECURSIVE(left_comm, initial_sample_rate, & final_sample_rate, level + 1) call MPI_COMM_FREE(left_comm, ierr) endif if (right_group_size > 1 .and. & local_id > median_proc) then call ORB_RECURSIVE(right_comm, initial_sample_rate, & final_sample_rate, level + 1) call MPI_COMM_FREE(right_comm, ierr) endif
1.2.6 Sends and Receives
- Blocking
The syntax for blocking sends is showed below. It is good for a synchronization code. Since the code won't proceed until send or receive are done.
int MPI Send( void *buf, int count, MPI Datatype datatype, int dest, int type, MPI Comm comm); int MPI Recv( void *buf, int count, MPI Datatype datatype, int source, int type, MPI Comm comm, MPI Status status);
- Non-Blocking
It can also be used for synchronization. But it required better designed code or we can add a MPI_Wait to wait for the receive or send end.
int MPI_Isend( void *buf, int count, MPI_Datatype datatype int dest, int type, MPI_Comm comm); int MPI_Irecv( void *buf, int count, MPI_Datatype datatype int source, int type, MPI_Comm comm, MPI_Status status); int MPI_Iprobe(int source, int type, MPI_Comm comm, MPI_Status status, boolean flag, int status[], int err);
References
[1] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Send.html
[2] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Recv.html
[3] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Isend.html
[4] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Irecv.html
[5] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Wait.html
[6] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Probe.html
1.2.7 Message Passing
- Message passing a difficult paradigm to use
- Two processors must execute simultaneous and different
commands to make the code work
- Non-blocking sends and receives help, as does wild cards... but
Making this work is difficult
References
There are a lot of interesting examples on wikipedia. Click the link below.
[1] http://en.wikipedia.org/wiki/Message_passing
1.3 Remote Memory Access
- Remote Memory Access allows you to do one-sided put's and get's on remote processor memory
- This model is similar to shared memory
- You need to be aware of synchronization issues
The Big Idea:
- One process specifies communication parameters.
- Separates communication and synchronization.
- User imposes right ordering of memory accesses.
- Origin: the process that performs the call.
- Target: the process in which memory is accessed.
1.3.1 Creating and Deleting Memory Windows
- Creating a window in memory that is accessible to other processors
MPI_WIN_CREATE(a, ndim, 4, MPI_INFO_NULL, MPI_COMM_WORLD, my_win, ierr)
- Closing same window in memory
MPI_WIN_FREE(my_win, ierr)
Fences
- Performing an MPI fence for synchronization on a MPI window
- The mutex variable of MPI
- The assert argument is used to indicate special conditions for the fence that an implementation may use to optimize the MPI_Win_fence operation.
MPI_WIN_FENCE(assert, my_win, ierr)
For most cases: assert = 0
1.3.2 MPI Puts and Gets
MPI_Puts
- Putting data into a memory window on a remote processor
MPI_PUT(source array, source nelements, source MPI_TYPE, target rank, target_displacement, target_length, target MPI_TYPE, my_win, ierr)
MPI_Gets
- Getting data from a memory window on a remote processor
MPI_GET(source array, source nelements, source MPI_TYPE, target rank, target_displacement, target_count, & target MPI_TYPE, my_win, ierr)
1.3.3 Example Put Code
program mpi_rma include "mpif.h" integer, parameter :: ndim = 100; integer, dimension(ndim) :: a, b integer :: ierr, nproc, my_id integer :: my_win, assert call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
/* Set up the window */ a(1:ndim) = my_id b(1:ndim) = (my_id+1)*50 call MPI_WIN_CREATE(a, ndim, 4, MPI_INFO_NULL, MPI_COMM_WORLD, my_win, ierr)
/* Do the Transfer */ assert = 0 call MPI_WIN_FENCE(assert, my_win, ierr)
/* Examine the results */ if (my_id == 0) then call MPI_PUT(b, 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr) endif call MPI_WIN_FENCE(assert, my_win, ierr) if (my_id == 1) then print*, a endif call MPI_WIN_FREE(my_win, ierr) call MPI_FINALIZE(ierr) end program mpi_rma
Example Put Code Output
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 50 50 50 50
50 50 50 50 50 50
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1...etc.
Note: Here, MPI_Put has placed ten '50's' from array b, into array a starting at index 20, using put call:
MPI_PUT(b, 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr)
1.3.4 Counterpart Example Get Code
program mpi_rma include "mpif.h" integer, parameter :: ndim = 40; integer, dimension(ndim) :: a, b integer :: ierr, nproc, my_id integer :: my_win, assert, i
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
! initialize the a and b array do i = 1, ndim a(i) = i b(i) = -i enddo
! create the window
call MPI_WIN_CREATE(a, ndim, 4, MPI_INFO_NULL, &
MPI_COMM_WORLD, my_win, ierr)
! do the get with a fence around it - not the displacement of 20 on the ! target size and loading into element 10 assert = 0 call MPI_WIN_FENCE(assert, my_win, ierr) if (my_id == 0) then call MPI_GET(b(10), 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr) endif call MPI_WIN_FENCE(assert, my_win, ierr)
! print out the results if (my_id == 0) then print*, b endif
! clean up and exit call MPI_WIN_FREE(my_win, ierr) call MPI_FINALIZE(ierr)
end program mpi_rma
Example Get Code Output
-1 -2 -3 -4 -5 -6
-7 -8 -9 21 22 23
24 25 26 27 28 29
30 -20 -21 -22 -23 -24
-25 -26 -27 -28 -29 -30 etc.
Note: Here, MPI_Get has placed ten elements from array a, into array b starting at index 10, using get call:
MPI_GET(b(10), 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr)
1.3.5 RMA Synchronization
- Must synchronize the system both before and after the PUT and GET commands lest bad things happen....
- before - to make sure all sides can begin the transfer
- after - to make sure the transfer is complete
- the format for access is a bit simpler than doing two sides send/recv pairs
- communications can go on in the background, as long as you don't access the elements before the �nal synchronization
1.3.6 References
1.4 PGAS Languages
Partitioned Global Address Space, also known as distributed shared memory. With low latency networks reaching disk read speeds its possible to treat networked computers as a large shared memory. These models are typically modeled after threads.
- Default declaration of a variable is local
- Shared variables can be created that can be updated and accessed
- Shared variables have affinity to particular processors
- It is not necessarily known which node 'owns' the distributed variable -- very dangerous!
Example Languages:
1.4.1 Unified Parallel C (UPC)
Unified Parallel C was created by a consortium of computer scientists to help reduce the difficulties associated with MPI. The primary website is [1]. There are free versions of the documentation and compiler for linux and other platforms. This language is currently being moved into the international standards process. Example 1 (from UPC documentation):
#include <upc_relaxed.h>
#include <stdio.h>
void main(){
printf(Hello World from THREAD %d (of %d THREADS)\n,MYTHREAD, THREADS);
}
MYTHREAD and THREAD is a parameter associated with the UPC compiler. Similar to:
call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
Example2:
up_forall(i=0; i< N; i++; i) {
printf("THREAD %d of %d is performing iteration\n",
MYTHREAD, THREADS, i);
}
Similar to a for loop, but it contains a fourth parameter 'i', this can be an integer or a variable with a specific affinity. The fourth parameter determines how the data is distributed across the threads - equivalent to adding:
if (MYTHREAD = i%thread) {
}
Example 3:
#include <upc_relaxed.h>
#include <stdio.h>
#define N 10
shared [2] int arr[10];
int main(){
int i=0;
upc_forall (i=0; i<N; i++; &arr[i]){
printf("THREAD %d (of %d THREADS) performing iteration %d\
MYTHREAD, THREADS, i);
}
return 0;
}
The shared line that is divided into chucks of size 2 in a round-robin fashion. The fourth argument in the upcforall statement associates the affinity of the loop "if (MYTHEAD == upc_threadof( &arr[i]) {"
How to Calculate Pi using UPC:
void main(void)
{
float local_pi=0.0;
int i;
upc_forall(i=0; i<N; i++; i%THREADS)
local_pi+=(float)f((0.5+i)/N);
local_pi*=(float)4.0/N;
upc_lock(&l);
pi+=local_pi;
upc_unlock(&l);
upc_barrier 1; // Ensure all is done
if (MYTHREAD==0) printf("PI=%10.9f\n",pi);
}
1.4.2 Coarray Fortran (Fortran 08)
- Coarray fortran was developed in 1998 as a small addition to Fortran.
- It is likely to be approved as part of the next ISO Fortran standard next month.
1.4.2.1 Basic Ideas for Coarrays
- Each running program is called an image
- Global variables are declared explicitly
doubleprecision::foo[*] real,dimension(10,10)::a[*] integer,dimension(5,3)::ii[4,*] integer,dimension(-2:5,7:10)::u[2:5,6,*]
- A is a 10X10 that has a copy on each image
- ii is a 5X3 array that is distributed into a virtual processing array of 4Xk such that it fills all the available images
- For a 6 cpu system, the following images exist for ii
ii[1,1],ii[2,1],ii[3,1],ii[4,1],ii[2,1],ii[2,2]
1.4.2.2 Coarrays Model
1.4.2.3 MPI
IF(NFIRST.EQ.0)THEN C C PERSISTENT COMMUNICATION REQUESTS C NFIRST=1 CALL MPI_SEND_INIT( + B(1,1),N,MPI_REAL,IMG_N,9905, + MPI_COMM_WORLD,MPIREQ(1),MPIERR) CALL MPI_SEND_INIT( + B(1,2),N,MPI_REAL,IMG_S,9906, + MPI_COMM_WORLD,MPIREQ(2),MPIERR) CALL MPI_RECV_INIT( + B(1,3),N,MPI_REAL,IMG_S,9905, + MPI_COMM_WORLD,MPIREQ(3),MPIERR) CALL MPI_RECV_INIT( + B(1,4),N,MPI_REAL,IMG_N,9906, + MPI_COMM_WORLD,MPIREQ(4),MPIERR) ENDIF C CALL MPI_STARTALL(4,MPIREQ,MPIERR) CALL MPI_WAITALL(4,MPIREQ,MPISTAT,MPIERR)
1.4.2.4 Coarray Fortran
real B(N,4)[*] CALL SYNC_ALL(WAIT=(/IMG_S,IMG_N/)) B(:,3) = B(:,1)[IMG_S] B(:,4) = B(:,2)[IMG_N] CALL SYNC_ALL(WAIT=(/IMG_S,IMG_N/))
1.4.2.5 Coarray Synchronization
Calc starts on image1, then the other images start processing after the memory is sychronized.
NTEGER::ME,NE,STEP,NSTEPS NE = NUM_IMAGES() ME = THIS_IMAGE() !Initialcalculation SYN CALL DO STEP=1,NSTEPS IF(ME>1)SYNC IMAGES(ME-1) !Performcalculation IF(ME<NE)SYNC IMAGES(ME+1) END DO SYNC ALL
1.4.2.6 The Big Pictures
- MPI and message passing is NOT the final word in parallel programming
- PGAS languages are becoming standards and compilers are coming out of the experimental stage
- These are the languages to use for any future big projects





