Problem 10
From csi702
Contents |
1 Problem 10
1.1 Description
Which of the following will run faster?
real :: a(5,100), b(100,5)
do j = 1, 100
do i = 1, 5
total = total + a(i,j)
enddo
enddo
or
real :: a(5,100), b(100,5)
do j = 1, 5
do i = 1, 100
total = total + b(i,j)
enddo
enddo
1.2 Expectation
The expectation is determined by the interpretation of the above code, and highlights some differences between C and Fortran. The invariant optimal solution would be to maximize the number of sequential operations that need to be read. Therefore at least in the case of C, that treats two dimensional arrays as arrays of arrays, we would want to address a single array containing the data elements one at a time.
Unfortunately with C neither of these cases provide sequential memory operations. So then the cost is associated with reading from memory locations. If either one is faster, it could be due to caching optimizations.
In Fortran, two dimensional arrays are stored as columns (Hargrove@sccm.stanford.edu, 1997) so one would expect the second loop version to run a bit faster due to the inner loop summing the column elements thus allowing memory to be accessed contiguously.
1.3 Code
This is the code in C for the first problem.
#include <stdio.h>
#define LOOP 10000000
/* ## inline function to be called multiple times ## */
void inline algorithm( void );
/* ## MAIN FUNCTION ## */
int main (int argc, char const *argv[])
{
int j;
for(j = 0; j < LOOP; j++) {
algorithm();
}
return 0;
}
/* ## inline function to be called multiple times ## */
void inline algorithm( void ) {
int a[5][100];
int b[100][5];
int i;
int j;
int total = 0;
for (j = 0; j < 100; j++){
for(i = 0; i < 5; i++){
total = total + a[i][j];
}
}
}
This is the change for the second.
/* ## inline function to be called multiple times ## */
void inline algorithm( void ) {
int a[5][100];
int b[100][5];
int i;
int j;
int total = 0;
for (j = 0; j < 5; j++){
for(i = 0; i < 100; i++){
total = total + b[i][j];
}
}
}
In Fortran:
program problem10 integer k do k = 1, 1000000 call algorithm1 call algorithm2 enddo end
Code where the sum is calculated across the rows
subroutine algorithm1()
integer i, j
real total
real :: a(5,100), b(100,5)
do j = 1, 100
do i = 1, 5
total = total + a(i,j)
enddo
enddo
end
Code where the sum is calculated down the columns
subroutine algorithm2()
integer i, j
real total
real :: a(5,100), b(100,5)
do j = 1, 5
do i = 1, 100
total = total + b(i,j)
enddo
enddo
end
1.4 Comments
| Method 1 | Method 2 | |
|---|---|---|
| No Optimization | real 23.249 / user 21.537 / sys 0.073 | real 22.693 / user 20.890 / sys 0.084 |
| Optimization | real 0.013 / user 0.001 / sys 0.003 | real 0.008 / user 0.001 / sys 0.003 |
As can be seen here, the first method is slightly slower, this is caused by the additional cost of jumping in and out of the inner loop considerably more times than the second method. This may not be entirely obvious unless we use a detailed evaluation program such as shark under OSX with slightly more skewed numbers. This is best seen through the following two screen captures:
Method 1:
Method 2:
Here we can see that the first method is spending a huge amount of time proportionally to changing between both loops, while method two gets to chug along and work most of the time on the inner loop. As for the optimized timings the dramatic loss at -O3 is most likely the compiler determining the sum pre-compilation.
The GPROF times were not sufficiently large except for the No Optimization methods.
Method 1:
granularity: each sample hit covers 4 byte(s) for 0.06% of 16.42 seconds
index % time self children called name
<spontaneous>
[1] 100.0 0.07 16.36 main [1]
16.36 0.00 10000000/10000000 algorithm [2]
-----------------------------------------------
16.36 0.00 10000000/10000000 main [1]
[2] 99.6 16.36 0.00 10000000 algorithm [2]
-----------------------------------------------
Method 2:
granularity: each sample hit covers 4 byte(s) for 0.07% of 13.86 seconds
index % time self children called name
<spontaneous>
[1] 100.0 0.04 13.82 main [1]
13.82 0.00 10000000/10000000 algorithm [2]
-----------------------------------------------
13.82 0.00 10000000/10000000 main [1]
[2] 99.7 13.82 0.00 10000000 algorithm [2]
-----------------------------------------------
Here are some timings for the more extreme case I showed with the shark profiling tool:
Method 1:
real 0m0.889s user 0m0.872s sys 0m0.008s
Method 2:
real 0m0.779s user 0m0.768s sys 0m0.004s
Note: The GPROF timings were done later on a separate run on a Linux virtual machine, the timing runs were done in OSX since GPROF does not provide timings in OSX.
Another try of the problem.
Both version are programed as loop1 and loop2, plus a third version loop3 for comparison. In loop3 the array is accessed sequentially.
#include <stdlib.h>
int loop1(double *a) {
int i,j;
double total = 0.;
for (j = 0; j < 10000; j++){
for(i = 0; i < 500; i++){
total = total + a[i*10000+j];
}
}
return 0;
}
int loop2(double *a) {
int i,j;
double total = 0.;
for (j = 0; j < 500; j++){
for(i = 0; i < 10000; i++){
total = total + a[i*500+j];
}
}
return 0;
}
int loop3(double *a) {
int i,j;
double total = 0.;
for (i = 0; i < 500; i++){
for(j = 0; j < 10000; j++){
total = total + a[i*10000+j];
}
}
return 0;
}
int main (int argc, char *argv[]){
int i;
int n = atoi(argv[1]);
double *a;
a = (double *)malloc(sizeof(double)*5000000);
for(i = 0; i < 5000000; i++) a[i]=1.;
for(i = 0; i < n; i++) loop1(a);
for(i = 0; i < n; i++) loop2(a);
for(i = 0; i < n; i++) loop3(a);
return 0;
}
The figure shows the times given by gprof for each subroutine for the following cases:
- Array of 100x5 elements accessed 10,000,000 times (total 5x10^9), compiled without optimization.
- Array of 10000x500 elements accessed 1000 times (total 5x10^9), compiled without optimization.
- Array of 100x5 elements accessed 10,000,000 times (total 5x10^9), compiled with optimization O2.
- Array of 10000x500 elements accessed 1000 times (total 5x10^9), compiled with optimization O2.
For the first case (red), the complete array most likely fits in the cache, so there are no significative differences in the timing between loops. In the second case (green), the array is bigger than the cache, and it can be seen that the times increased for loops 1 and 2 due to cache misses, while loop 3 has the same time than in the previous case, as it uses the cache optimally. It has yet to be explain why the time of loop 1, which jumps 500 elements per iteration, is smaller than for loop 2, which jumps of 10000 elements.
If optimization is used (cases 3 and 4, blue) the time drops to <0.01s.
Fortran Comments
Using gprof on the compiled code with no optimization yielded the following results
- Trial 1: 57.51 sec for algorithm 1, 55.37 sec for algorithm 2
- Trial 2: 56.53 sec for algorithm 1, 55.82 sec for algorithm 2
- Trial 3: 56.68 sec for algorithm 1, 56.00 sec for algorithm 2
On average, algorithm 2 is 2% faster than algorithm 1. If optimization is turned on to -O2, both algorithms finish < .01 seconds.
1.5 Contributors
Marcelo Raschi
H. Le



