Problem 10

From csi702

Jump to: navigation, search

Contents

  1. Problem 10
    1. Description
    2. Expectation
    3. Code
    4. Comments
    5. Contributors

1 Problem 10

1.1 Description

Which of the following will run faster?

 real :: a(5,100), b(100,5)
 do j = 1, 100
   do i = 1, 5
     total = total + a(i,j)
   enddo
 enddo

or

 real :: a(5,100), b(100,5)
 do j = 1, 5
   do i = 1, 100
     total = total + b(i,j)
   enddo
 enddo

1.2 Expectation

The expectation is determined by the interpretation of the above code, and highlights some differences between C and Fortran. The invariant optimal solution would be to maximize the number of sequential operations that need to be read. Therefore at least in the case of C, that treats two dimensional arrays as arrays of arrays, we would want to address a single array containing the data elements one at a time.

Unfortunately with C neither of these cases provide sequential memory operations. So then the cost is associated with reading from memory locations. If either one is faster, it could be due to caching optimizations.

In Fortran, two dimensional arrays are stored as columns (Hargrove@sccm.stanford.edu, 1997) so one would expect the second loop version to run a bit faster due to the inner loop summing the column elements thus allowing memory to be accessed contiguously.

1.3 Code

This is the code in C for the first problem.

#include <stdio.h>

#define LOOP 10000000
	
	
/*   ## inline function to be called multiple times ##   */
void inline algorithm( void );


/*   ## MAIN FUNCTION ##   */
int main (int argc, char const *argv[])
{
	int j;
	
	
	for(j = 0; j < LOOP; j++) {
		algorithm();
	}
	
	return 0;
}
	
	
/*   ## inline function to be called multiple times ##   */
void inline algorithm( void ) {
	
	int	a[5][100];
	int	b[100][5];
	int i;
	int j;
	int total = 0;
	
	for (j = 0; j < 100; j++){
		for(i = 0; i < 5; i++){
			total = total + a[i][j];
		}
	}
}


This is the change for the second.

/*   ## inline function to be called multiple times ##   */
void inline algorithm( void ) {
	
	int	a[5][100];
	int	b[100][5];
	int i;
	int j;
	int total = 0;
	
	for (j = 0; j < 5; j++){
		for(i = 0; i < 100; i++){
			total = total + b[i][j];
		}
	}
}


In Fortran:

 program problem10
 integer k
 do k = 1, 1000000
   call algorithm1
   call algorithm2
 enddo
 end

Code where the sum is calculated across the rows

 subroutine algorithm1()
 integer i, j
 real total
 real :: a(5,100), b(100,5)
 do j = 1, 100
   do i = 1, 5
     total = total + a(i,j)
   enddo
 enddo
 end

Code where the sum is calculated down the columns

 subroutine algorithm2()
 integer i, j
 real total
 real :: a(5,100), b(100,5)
 do j = 1, 5
   do i = 1, 100
     total = total + b(i,j)
   enddo
 enddo
 end

1.4 Comments

Method 1 Method 2
No Optimization real 23.249 / user 21.537 / sys 0.073 real 22.693 / user 20.890 / sys 0.084
Optimization real 0.013 / user 0.001 / sys 0.003 real 0.008 / user 0.001 / sys 0.003


As can be seen here, the first method is slightly slower, this is caused by the additional cost of jumping in and out of the inner loop considerably more times than the second method. This may not be entirely obvious unless we use a detailed evaluation program such as shark under OSX with slightly more skewed numbers. This is best seen through the following two screen captures:

Method 1:

Image:Problem10a.png

Method 2:

Image:Problem10b.png

Here we can see that the first method is spending a huge amount of time proportionally to changing between both loops, while method two gets to chug along and work most of the time on the inner loop. As for the optimized timings the dramatic loss at -O3 is most likely the compiler determining the sum pre-compilation.

The GPROF times were not sufficiently large except for the No Optimization methods.

Method 1:

granularity: each sample hit covers 4 byte(s) for 0.06% of 16.42 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.07   16.36                 main [1]
               16.36    0.00 10000000/10000000     algorithm [2]
-----------------------------------------------
               16.36    0.00 10000000/10000000     main [1]
[2]     99.6   16.36    0.00 10000000         algorithm [2]
-----------------------------------------------

Method 2:

granularity: each sample hit covers 4 byte(s) for 0.07% of 13.86 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    0.04   13.82                 main [1]
               13.82    0.00 10000000/10000000     algorithm [2]
-----------------------------------------------
               13.82    0.00 10000000/10000000     main [1]
[2]     99.7   13.82    0.00 10000000         algorithm [2]
-----------------------------------------------

Here are some timings for the more extreme case I showed with the shark profiling tool:

Method 1:

real	0m0.889s 
user	0m0.872s
sys	0m0.008s

Method 2:

real	0m0.779s
user	0m0.768s
sys	0m0.004s


Note: The GPROF timings were done later on a separate run on a Linux virtual machine, the timing runs were done in OSX since GPROF does not provide timings in OSX.


Another try of the problem.

Both version are programed as loop1 and loop2, plus a third version loop3 for comparison. In loop3 the array is accessed sequentially.

 #include <stdlib.h>
 int loop1(double *a) {
   int i,j;
   double total = 0.;
   for (j = 0; j < 10000; j++){
     for(i = 0; i < 500; i++){
           total = total + a[i*10000+j];
     }
   }
   return 0;
 }
 int loop2(double *a) {
   int i,j;
   double total = 0.;
   for (j = 0; j < 500; j++){
     for(i = 0; i < 10000; i++){
           total = total + a[i*500+j];
     }
   }
   return 0;
 }
 int loop3(double *a) {
   int i,j;
   double total = 0.;
   for (i = 0; i < 500; i++){
     for(j = 0; j < 10000; j++){
           total = total + a[i*10000+j];
     }
   }
   return 0;
 }
 int main (int argc, char *argv[]){
   int i;
   int n = atoi(argv[1]);
   double *a;
   a = (double *)malloc(sizeof(double)*5000000);
   for(i = 0; i < 5000000; i++) a[i]=1.;
   for(i = 0; i < n; i++) loop1(a);
   for(i = 0; i < n; i++) loop2(a);
   for(i = 0; i < n; i++) loop3(a);
   return 0;
 }

Image:P10.png

The figure shows the times given by gprof for each subroutine for the following cases:

  1. Array of 100x5 elements accessed 10,000,000 times (total 5x10^9), compiled without optimization.
  2. Array of 10000x500 elements accessed 1000 times (total 5x10^9), compiled without optimization.
  3. Array of 100x5 elements accessed 10,000,000 times (total 5x10^9), compiled with optimization O2.
  4. Array of 10000x500 elements accessed 1000 times (total 5x10^9), compiled with optimization O2.

For the first case (red), the complete array most likely fits in the cache, so there are no significative differences in the timing between loops. In the second case (green), the array is bigger than the cache, and it can be seen that the times increased for loops 1 and 2 due to cache misses, while loop 3 has the same time than in the previous case, as it uses the cache optimally. It has yet to be explain why the time of loop 1, which jumps 500 elements per iteration, is smaller than for loop 2, which jumps of 10000 elements.

If optimization is used (cases 3 and 4, blue) the time drops to <0.01s.


Fortran Comments

Using gprof on the compiled code with no optimization yielded the following results

  • Trial 1: 57.51 sec for algorithm 1, 55.37 sec for algorithm 2
  • Trial 2: 56.53 sec for algorithm 1, 55.82 sec for algorithm 2
  • Trial 3: 56.68 sec for algorithm 1, 56.00 sec for algorithm 2

On average, algorithm 2 is 2% faster than algorithm 1. If optimization is turned on to -O2, both algorithms finish < .01 seconds.

1.5 Contributors

Jonathan Lisic

Marcelo Raschi

H. Le

Personal tools