Advanced MPI

From csi702

Jump to: navigation, search

Contents

  1. Advanced MPI
    1. MPE Library
      1. MPE Profiling
      2. MPE Tools
    2. MPI Data Packing
      1. Structures
      2. Packing of data
      3. MPI Packing data
      4. MPI Domain Decomposition
      5. MPI Domain Decomposition
      6. Sends and Receives
      7. Message Passing
    3. Remote Memory Access
      1. Creating and Deleting Memory Windows
      2. MPI Puts and Gets
      3. Example Put Code
      4. Counterpart Example Get Code
      5. RMA Synchronization
      6. References
    4. PGAS Languages
      1. Unified Parallel C (UPC)
      2. Coarray Fortran (Fortran 08)
        1. Basic Ideas for Coarrays
        2. Coarrays Model
        3. MPI
        4. Coarray Fortran
        5. Coarray Synchronization
        6. The Big Pictures

1 Advanced MPI

1.1 MPE Library

Note that the MPE libraries are still not up on the CDS clusters. -4/3/10

MPE is a set of MPI extensions that are most commonly used for profiling program performance.

Profiling is essential to writing efficient parallel codes. It is necessary to identify communications bottlenecks and unbalanced loads (ie find the nodes that aren't doing anything). Inefficiencies are most likely due to communications bottlenecks and unbalanced loads. MPE is the most universal way of identifying these issues in MPI codes.

Steps to profiling using MPE:

  • include the header for the MPE library
  • compile the code so it accesses the correct headers
  • link to the MPE and LMPE libraries

Add the header to your code:

  1. include "mpe.h"

Compile with the -lmpe library.

It looks like Dr. Wallin suggests copying the header files mpe.h and mpe_log.h to your local compile directory but instead you can just add it to your LD_LIBRARY_PATH (?).

When mpe is loaded properly mpirun will tell you its writing the log file.

Once you've got your MPI code compiling and running with MPE you can start visualizing the log files and profiling your code.

1.1.1 MPE Profiling

The output of the profiling is a set of time marks associated with different events. Events can be MPI communications or user defined events. The most common visualization tool is “jumpshot”. Example output:

1.1.2 MPE Tools

Other MPE events:

  • MPE_Log_get_state_eventIDs( &event2a, &event2b );
  • MPE_Describe_state( event2a, event2b,"Send", "orange" );
  • MPE_Log_get_solo_eventID( &event1 );
  • MPE_Describe_event( event1,"Sync Start", "white" );

Marking Sections of the code:

  • ierr = MPE_Log_event(event2a, 0, NULL);
  • ierr = MPI_Send(message, message_size,MPI_INT, dest, tag, MPI_COMM_WORLD);
  • ierr = MPE_Log_event(event2b, 0, NULL);

Synchronization of the Clocks:

  • MPE_Log_sync_clocks();
  • MPE_Finish_log( "logname" );

1.2 MPI Data Packing

1.2.1 Structures

typedef struct test1 {
  char c;
  double b;
  char cc;
}   t1;
typedef struct test2 {
  char cc;
  double b;
}   t2;
printf("size of structure 1 = %d \n",sizeof(t1));
printf("size of structure 2 = %d \n",sizeof(t2));

What we expected:

1 + 8 + 1 = 10 and 1 + 1 + 8 = 10

But in reality:

size  of char 1
size  of double 8
size  of structure 1 = 16
size  of structure 2 = 12

So 1 + 8 + 1 = 16 and 1 + 1 + 8 = 12??????

What happened?

1.2.2 Packing of data

For 32 bit Operation System, it has a 4 bytes boundary. If the data length exceeds boundary, it will use a new line.

For 64 bit Operation System, it has a 8 bytes boundary. If the data length exceeds boundary, it will use a new line.

1.2.3 MPI Packing data

  • the displacement from the start of the data block
  • the number of elements in the block
  • the type of elements in the block

1.2.4 MPI Domain Decomposition

Two examples below are all for 32bit OS

The packing code is like:

type (new_type) :: my_type
integer :: mpi_my_type
integer, parameter :: new_type_length = 3
integer, parameter :: new_mpi_types(1:3) = &
  (/MPI_INTEGER,MPI_DOUBLE_PRECISION, MPI_INTEGER/)
integer, parameter :: block_length(1:3) = (/1,3,1/)
integer, dimension(3) :: address
call MPI_ADDRESS(my_type%istart, address(1), ierr)
call MPI_ADDRESS(my_type%qxx,    address(2), ierr)
call MPI_ADDRESS(my_type%iend,   address(3), ierr)
do i = 1, new_type_length
  displacements(i) = address(i) - address(1)
enddo
call MPI_TYPE_STRUCT(new_type_length, &
     block_length, displacements, &
     ptypes, mpi_my_type, ierr)
call MPI_TYPE_COMMIT(mpi_my_type, ierr)
! optional
call MPI_TYPE_SIZE(mpi_my_type, isize, ierr)

A data structure example code is like:

type ucquad_type
 integer :: inode
 real :: qxx
 real :: qxy
 real :: qxz
 real :: qyy
 real :: qyz
 real :: qzz
end type ucquad_type
 
integer :: mpi_ucquad_type
integer, parameter :: ucquadblock_length = 2
integer, parameter :: ucquadblock_counts(1:2) = (/1,6/)
integer, parameter :: ucquadblock_size(1:2)   = (/1,1/)
integer, parameter :: ucquad_mpi_type(1:2) = (/MPI_INTEGER, MPI_REAL/)
 
integer :: mpi_ucquad_type
integer, parameter :: ucquadblock_length = 2
integer, parameter :: ucquadblock_counts(1:2) = (/1,6/)
integer, parameter :: ucquadblock_size(1:2)   = (/1,1/)
integer, parameter :: ucquad_mpi_type(1:2) = (/MPI_INTEGER,MPI_REAL/)


Code for starting addresses of the blocks

call MPI_ADDRESS(ucquad_tmp%inode, address(1), ierr)
call MPI_ADDRESS(ucquad_tmp%qxx, address(2), ierr)

Code for calculate displacements and setting types

do i = 1, ucquadblock_length
  block_length(i) = ucquadblock_counts(i) * &
          ucquadblock_size(i)
  ptypes(i) = ucquad_mpi_type(i)
  displacements(i) = address(i) - address(1)
enddo

Code for creating the structure

call MPI_TYPE_STRUCT(ucquadblock_length, &
  block_length, displacements, &
  ptypes, mpi_ucquad_type, ierr)
call MPI_TYPE_COMMIT(mpi_ucquad_type, ierr)
call MPI_TYPE_SIZE(mpi_ucquad_type, isize, ierr)

Code for using the new data type

call MPI_IRECV(ucquad_r(buffer_start), &
            msize, &
            mpi_ucquad_type, &
             p2, tag+1, MPI_COMM_WORLD, &
            quad_request(next_recv,queue_id), ierr)
call MPI_GET_COUNT(qstatus,  mpi_ucquad_type, qcount, ierr)

1.2.5 MPI Domain Decomposition

Code for creating two sets

do i = 1, median_proc + 1
  ranks(i) = i-1
enddo
 left_group_size = median_proc + 1
 right_group_size = group_size - left_group_size

Code for creating two new Comm Groups

if (left_group_size > 1 ) then
 call MPI_GROUP_INCL(full_group, &
     left_group_size, ranks, left_group, ierr)
 call MPI_GROUP_SIZE(left_group, group_size, ierr)
 call MPI_COMM_CREATE(my_comm, left_group, left_comm, ierr
endif
if (right_group_size > 1) then
 call MPI_GROUP_EXCL(full_group, &
      left_group_size, ranks, right_group, ierr)
 call MPI_GROUP_SIZE(right_group, group_size, ierr)
 call MPI_COMM_CREATE(my_comm, right_group, right_comm, ie
endif

Code for applying the decomposition recursively

if (left_group_size > 1 .and. &
        local_id <= median_proc ) then
  call ORB_RECURSIVE(left_comm, initial_sample_rate, &
     final_sample_rate, level + 1)
  call MPI_COMM_FREE(left_comm, ierr)
endif
if (right_group_size > 1 .and. &
       local_id > median_proc) then
  call ORB_RECURSIVE(right_comm, initial_sample_rate, &
     final_sample_rate, level + 1)
  call MPI_COMM_FREE(right_comm, ierr)
endif

1.2.6 Sends and Receives

  • Blocking

The syntax for blocking sends is showed below. It is good for a synchronization code. Since the code won't proceed until send or receive are done.

int MPI Send( void *buf, int count, MPI Datatype datatype, int
dest, int type, MPI Comm comm);
int MPI Recv( void *buf, int count, MPI Datatype datatype, int
source, int type, MPI Comm comm, MPI Status status);
  • Non-Blocking

It can also be used for synchronization. But it required better designed code or we can add a MPI_Wait to wait for the receive or send end.

int MPI_Isend( void *buf, int count, MPI_Datatype datatype
 int dest, int type, MPI_Comm comm);
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype
  int source, int type, MPI_Comm comm, MPI_Status status);
int MPI_Iprobe(int source, int type, MPI_Comm comm,
  MPI_Status status, boolean flag, int status[], int err);

References

[1] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Send.html

[2] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Recv.html

[3] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Isend.html

[4] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Irecv.html

[5] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Wait.html

[6] http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Probe.html

1.2.7 Message Passing

  • Message passing a difficult paradigm to use
  • Two processors must execute simultaneous and different

commands to make the code work

  • Non-blocking sends and receives help, as does wild cards... but

Making this work is difficult

References

There are a lot of interesting examples on wikipedia. Click the link below.

[1] http://en.wikipedia.org/wiki/Message_passing

1.3 Remote Memory Access

  • Remote Memory Access allows you to do one-sided put's and get's on remote processor memory
  • This model is similar to shared memory
  • You need to be aware of synchronization issues

The Big Idea:

  • One process specifies communication parameters.
  • Separates communication and synchronization.
  • User imposes right ordering of memory accesses.
  • Origin: the process that performs the call.
  • Target: the process in which memory is accessed.


frame Explicit diffusion problem in 2d
frame Explicit diffusion problem in 2d

[1]

1.3.1 Creating and Deleting Memory Windows

  • Creating a window in memory that is accessible to other processors

MPI_WIN_CREATE(a, ndim, 4, MPI_INFO_NULL, MPI_COMM_WORLD, my_win, ierr)

MPI_WIN_CREATE Documentation

  • Closing same window in memory

MPI_WIN_FREE(my_win, ierr)

Fences

  • Performing an MPI fence for synchronization on a MPI window
  • The mutex variable of MPI
  • The assert argument is used to indicate special conditions for the fence that an implementation may use to optimize the MPI_Win_fence operation.

MPI_WIN_FENCE(assert, my_win, ierr)

For most cases: assert = 0

MPI_WIN_FENCE Documentation

1.3.2 MPI Puts and Gets

MPI_Puts

  • Putting data into a memory window on a remote processor

MPI_PUT(source array, source nelements, source MPI_TYPE, target rank, target_displacement, target_length, target MPI_TYPE, my_win, ierr)

MPI_PUT Documentation

MPI_Gets

  • Getting data from a memory window on a remote processor

MPI_GET(source array, source nelements, source MPI_TYPE, target rank, target_displacement, target_count, & target MPI_TYPE, my_win, ierr)


MPI_GET Documentation

1.3.3 Example Put Code

program mpi_rma
 include "mpif.h"
 integer, parameter :: ndim = 100;
 integer, dimension(ndim) :: a, b
 integer :: ierr, nproc, my_id
 integer :: my_win, assert
 call MPI_INIT(ierr)
 call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
 call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
 /* Set up the window  */
 a(1:ndim) = my_id
 b(1:ndim) = (my_id+1)*50
 call MPI_WIN_CREATE(a, ndim, 4, MPI_INFO_NULL,  MPI_COMM_WORLD, my_win, ierr)
 /* Do the Transfer  */
 assert = 0
 call MPI_WIN_FENCE(assert, my_win, ierr)
 /* Examine the results  */
 if (my_id == 0) then
   call MPI_PUT(b, 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr)
 endif
 call MPI_WIN_FENCE(assert, my_win, ierr)
 if (my_id == 1) then
  print*, a
   endif
  call MPI_WIN_FREE(my_win, ierr)
 call MPI_FINALIZE(ierr)
end program mpi_rma


Example Put Code Output

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

1 1 50 50 50 50

50 50 50 50 50 50

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1...etc.

Note: Here, MPI_Put has placed ten '50's' from array b, into array a starting at index 20, using put call:

MPI_PUT(b, 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr)

1.3.4 Counterpart Example Get Code

program mpi_rma
include "mpif.h" 
integer, parameter :: ndim = 40;
integer, dimension(ndim) :: a, b
integer :: ierr, nproc, my_id
integer :: my_win, assert, i
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) 
! initialize the a and b array
do i = 1, ndim
  a(i) = i
  b(i) = -i
enddo
! create the window
call MPI_WIN_CREATE(a, ndim, 4, MPI_INFO_NULL, &
     MPI_COMM_WORLD, my_win, ierr)


! do the get with a fence around it - not the displacement of 20 on the 
! target size and loading into element 10
assert = 0
call MPI_WIN_FENCE(assert, my_win, ierr)
if (my_id == 0) then
 call MPI_GET(b(10), 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr)
endif
call MPI_WIN_FENCE(assert, my_win, ierr)
! print out the results
if (my_id == 0) then
   print*, b
 endif
! clean up and exit
call MPI_WIN_FREE(my_win, ierr)
call MPI_FINALIZE(ierr)
end program mpi_rma

Example Get Code Output

-1 -2 -3 -4 -5 -6

-7 -8 -9 21 22 23

24 25 26 27 28 29

30 -20 -21 -22 -23 -24

-25 -26 -27 -28 -29 -30 etc.


Note: Here, MPI_Get has placed ten elements from array a, into array b starting at index 10, using get call:

MPI_GET(b(10), 10, MPI_INTEGER, 1, 20, 10, MPI_INTEGER, my_win, ierr)


1.3.5 RMA Synchronization

  • Must synchronize the system both before and after the PUT and GET commands lest bad things happen....
    • before - to make sure all sides can begin the transfer
    • after - to make sure the transfer is complete
  • the format for access is a bit simpler than doing two sides send/recv pairs
  • communications can go on in the background, as long as you don't access the elements before the �nal synchronization

1.3.6 References



link title

1.4 PGAS Languages

Partitioned Global Address Space, also known as distributed shared memory. With low latency networks reaching disk read speeds its possible to treat networked computers as a large shared memory. These models are typically modeled after threads.

  • Default declaration of a variable is local
  • Shared variables can be created that can be updated and accessed
  • Shared variables have affinity to particular processors
  • It is not necessarily known which node 'owns' the distributed variable -- very dangerous!

Example Languages:

1.4.1 Unified Parallel C (UPC)

Unified Parallel C was created by a consortium of computer scientists to help reduce the difficulties associated with MPI. The primary website is [1]. There are free versions of the documentation and compiler for linux and other platforms. This language is currently being moved into the international standards process. Example 1 (from UPC documentation):

#include <upc_relaxed.h>
#include <stdio.h>
void main(){
  printf(Hello World from THREAD %d (of %d THREADS)\n,MYTHREAD, THREADS);
}

MYTHREAD and THREAD is a parameter associated with the UPC compiler. Similar to:

   call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
   call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)

Example2:

up_forall(i=0; i< N; i++; i) {
  printf("THREAD %d of %d is performing iteration\n",
     MYTHREAD, THREADS, i);
}

Similar to a for loop, but it contains a fourth parameter 'i', this can be an integer or a variable with a specific affinity. The fourth parameter determines how the data is distributed across the threads - equivalent to adding:

if (MYTHREAD = i%thread) {
}

Example 3:

#include <upc_relaxed.h>
#include <stdio.h>
#define N 10
shared [2] int arr[10];
int main(){
int i=0;
upc_forall (i=0; i<N; i++; &arr[i]){
printf("THREAD %d (of %d THREADS) performing iteration %d\
    MYTHREAD, THREADS, i);
  }
return 0;
}

The shared line that is divided into chucks of size 2 in a round-robin fashion. The fourth argument in the upcforall statement associates the affinity of the loop "if (MYTHEAD == upc_threadof( &arr[i]) {"

How to Calculate Pi using UPC:

void main(void)
{
  float local_pi=0.0;
  int i;
  upc_forall(i=0; i<N; i++; i%THREADS)
    local_pi+=(float)f((0.5+i)/N);
  local_pi*=(float)4.0/N;
  upc_lock(&l);
  pi+=local_pi;
  upc_unlock(&l);
  upc_barrier 1;  // Ensure all is done
  if (MYTHREAD==0) printf("PI=%10.9f\n",pi);
}

1.4.2 Coarray Fortran (Fortran 08)

  • Coarray fortran was developed in 1998 as a small addition to Fortran.
  • It is likely to be approved as part of the next ISO Fortran standard next month.

1.4.2.1 Basic Ideas for Coarrays

  • Each running program is called an image
  • Global variables are declared explicitly
doubleprecision::foo[*]
real,dimension(10,10)::a[*]
integer,dimension(5,3)::ii[4,*]
integer,dimension(-2:5,7:10)::u[2:5,6,*]
  • A is a 10X10 that has a copy on each image
  • ii is a 5X3 array that is distributed into a virtual processing array of 4Xk such that it fills all the available images
    • For a 6 cpu system, the following images exist for ii
     ii[1,1],ii[2,1],ii[3,1],ii[4,1],ii[2,1],ii[2,2]

1.4.2.2 Coarrays Model

1.4.2.3 MPI

IF(NFIRST.EQ.0)THEN
C
C    PERSISTENT COMMUNICATION REQUESTS
C
     NFIRST=1
     CALL MPI_SEND_INIT(
   +     B(1,1),N,MPI_REAL,IMG_N,9905,
   +     MPI_COMM_WORLD,MPIREQ(1),MPIERR)
     CALL MPI_SEND_INIT(
   +     B(1,2),N,MPI_REAL,IMG_S,9906,
   +     MPI_COMM_WORLD,MPIREQ(2),MPIERR)
     CALL MPI_RECV_INIT(
   +     B(1,3),N,MPI_REAL,IMG_S,9905,
   +     MPI_COMM_WORLD,MPIREQ(3),MPIERR)
     CALL MPI_RECV_INIT(
   +     B(1,4),N,MPI_REAL,IMG_N,9906,
   +     MPI_COMM_WORLD,MPIREQ(4),MPIERR)
     ENDIF
C
     CALL MPI_STARTALL(4,MPIREQ,MPIERR)
     CALL MPI_WAITALL(4,MPIREQ,MPISTAT,MPIERR)

1.4.2.4 Coarray Fortran

real  B(N,4)[*]
CALL SYNC_ALL(WAIT=(/IMG_S,IMG_N/))
B(:,3) = B(:,1)[IMG_S]
B(:,4) = B(:,2)[IMG_N]
CALL SYNC_ALL(WAIT=(/IMG_S,IMG_N/))

1.4.2.5 Coarray Synchronization

Calc starts on image1, then the other images start processing after the memory is sychronized.

NTEGER::ME,NE,STEP,NSTEPS
NE = NUM_IMAGES()
ME = THIS_IMAGE()
   !Initialcalculation
SYN CALL
DO STEP=1,NSTEPS
IF(ME>1)SYNC IMAGES(ME-1)
   !Performcalculation
IF(ME<NE)SYNC IMAGES(ME+1)
END DO
SYNC ALL

1.4.2.6 The Big Pictures

  • MPI and message passing is NOT the final word in parallel programming
  • PGAS languages are becoming standards and compilers are coming out of the experimental stage
  • These are the languages to use for any future big projects
Personal tools