Optimization for Single Processor Machines

From csi702

Jump to: navigation, search


Contents

  1. Introduction to Optimization
    1. Philosophy of Code Optimization
      1. Don't Optimize Prematurely
      2. Have Performance Goals
      3. Profile To Find Bottlenecks
      4. Understand Theoretical Limits
  2. Hardware Considerations
    1. CPU Design
      1. CISC Processors
      2. RISC Processors
        1. Instruction Pipelining
        2. Uniform Instruction Length
        3. Simple Addressing Modes
      3. Post-RISC Processors
    2. Memory Design
      1. Cache
  3. Compilers
  4. Module and Routine Design
    1. Timing and Profiling
    2. Optimization By Hand
  5. References

1 Introduction to Optimization

Optimization is the process of making your application execute more efficiently. Efficiency can be measured either by execution speed or resource usage. In general, there are four main targets for optimization:

  • Hardware -- Adding additional resources to your computer. (i.e. ram, hard drive, processor)
  • Compiler -- Compiling using various optimization flags (i.e. -O1, -O2, -O3)
  • Module and Routine Design -- Choosing efficient algorithms and memory schemes (i.e. bubble sort vs quick sort)
  • Operating System Interactions -- Optimizing network and disk IO and taking advantage of threading

1.1 Philosophy of Code Optimization

Program optimization has existed since the the very first computer applications were written. As a result, there are a wide variety of opinions regarding how code should be optimized. However, there are useful general guidelines and widely accepted program optimization practices which can provide useful and practical guidance.

1.1.1 Don't Optimize Prematurely

While optimization is important, it is useless unless the program that has been developed works as designed. Only after the code has been completed should optimization begin. This is not to say that algorithm choice and resource usage should be ignored until the code has been written. However, performance tuning of an application makes the most sense once you know what portions of the code are executed most often. The Pareto Principle[1] or "80:20 rule" is as applicable to computer code as it is to real world situations. That is to say that your application will often spend 80% of its time in 20% of the code. As a result, optimizing that 20% can pay huge dividends in your application's efficiency.

1.1.2 Have Performance Goals

However, application efficiency is only important if you must meet particular performance benchmarks. For example, the speed of code which processes user input might be bound by a user's data entry speed regardless of the efficiency of the code. Similarly, slow and inefficient code which is only run nightly as a batch job may not be worth optimizing because the code already performs its task within the required time frame.

1.1.3 Profile To Find Bottlenecks

Performance bottlenecks, locations in a piece of code where the processor spends significant amounts of execution time, can occur for many reasons. Software which accesses large amounts of data from slow media (such as hard disks) may be IO bound. Similarly, software which accesses large amounts of remote data may be network bound. Finally, complicated or inefficient algorithms can cause execution speed to be bounded by the clock speed of the CPU. When single bottlenecks dominate the execution time of an application, relatively small optimizations can often dramatically improve performance. However, as obvious bottlenecks disappear, much more significant optimization effort is often required to achieve further efficiency gains.

1.1.4 Understand Theoretical Limits

Theoretical results may sometimes show that further optimization will likely be ineffective. For example, if an O(n log n) comparison based sorting algorithm has been implemented, you can't expect to receive a significant speed up of that portion of the code. As a result, it is important to know what portions of the code stand the best chance of being optimized.

2 Hardware Considerations

The speed and complexity of computer hardware has increased exponentially over the last 70 years. Computers today are millions to billions of times faster the first computers while at the same occupy a fraction of the space. Early computers such as Colossus and ENIAC were able to process between 5 and 100 operations per second.

A modern "commodity" microprocessor (as of 2007) can process billions of operations per second, and many of these operations are more complicated and useful than early computer operations. [1] However, even as computers have become more and more complex, the actual building blocks that make up a computer are relatively simple. In fact, an entire computer can be created using only Nand gates. Likewise, while computer languages have become more complex, the actual operations performed by microprocessors is relatively small. These basic operations can be grouped into the following categories:

  • Load - load data from memory into the CPU
  • Store - store data from the CPU into memory
  • Branch or Jump - alter order of instruction execution
  • Math and Logical Operations - internal operations within or between different words (shifts, adds, XOR, etc)

These simple operations are combined to perform more complex operations.

Over time, the largest advances have been in:

  • CPU design
  • Memory design

2.1 CPU Design

CPUs are the central processing component within every modern computer. The architecture of CPUs has changed drastically since the first computer was invented. These transformations occurred in response to both technological and philosophical shifts.

2.1.1 CISC Processors

Complex Instruction Set Computers have machine languages with many of the following characteristics:

  • Multiple types of operations (load, arithmetic, store) are performed using single instructions.[1]
  • Instructions operate on specific registers or specific memory locations.
  • Large number of possible machine instructions.

CISC architectures have included both early microprocessors such as : PDP-11, System/360, VAX, Motorola 68000 family, as well as relatively more modern chips including: Intel x86, Intel Pentium/Core, and AMD Athlon.[1]

In environments where memory is at a premium, CISC designs offered very compact binaries due to the large amount of work which could be performed by every instruction. The small size of CISC binaries also allowed less frequent fetching of instructions from main memory.

The expressiveness of CISC instruction sets also allowed for a close mapping between machine instructions and high level programming language statements. CISC assembly languages often included instructions which mimicked abstract constructs found in higher level programming languages such as array access and loops.[1]

In addition to simplifying writing software directly in assembly language, CISC architectures also simplified compiler design (the similarities in syntax and structure between assembly and higher level languages simply left compilers with less to do).[1] Unfortunately, this same advantage also caused difficulties as chip designs evolved. Tight coupling between assembly commands and machine hardware meant that when processor designs changed, the assembly language (and the compilers which targeted them), often had to be heavily modified.

2.1.2 RISC Processors

RISC processors developed from CISC processor designs as it was realized that by avoiding particularly large and slow machine instructions and building programs using only a subset of the machine instructions which CISC architectures provided, significant performance gains could be achieved.[1] This lead to the development of Reduced Instruction Set Computers with small instructions sets with:

  • instruction pipelining
  • uniform instruction length
  • simple addressing modes
  • load/store architecture
  • delayed branching
  • pipelining floating point numbers
2.1.2.1 Instruction Pipelining

Many of the above characteristics work together in powerful ways. For example, the similarity in structure between the operations (a simplified RISC processor might follow the same set of basic steps: Fetch, Decode, Load, Process, Save) allows pipelining of instructions. The load/store architecture also helps to enable pipelining by homogenizing machine instructions (most instructions follow the same pattern of: load data from registers or memory, perform operations, and save results back to memory). Finally, uniform instruction lengths and simple addressing modes allow operations like decoding instructions and accessing memory to occur in a single clock cycle. This is especially important in pipelined execution because noops (clock cycles where no new instruction is loaded)[1] caused by machine instructions waiting on the results of slower computations, must pass through the entire pipeline, causing parts of the hardware to sit idle.

Pipelined Execution Example
Pipelined Execution Example

Branch instructions in particular can cause particular problems for pipelined program execution. For example consider the following simple for loop written in c and the same loop written in MIPS assembly[1]

for ( int i = 0 ; i < 20 ; i++ ) { }
// int i = 0
// register $0 always contains 0, so this command adds the constant 0 to 0 and stores it in register $8
addi $8, $0, 0

// MIPS assembly allows naming of instructions as branch targets instead of referring to explicit memory addresses
loop:

// i++ 
addi $8, $8, 1

// i < 20
// store the result temporarily in register $9, if true, $9 will hold 1, else 0
slti $9, $8, 20

// return to the top of the loop if register $9 does not hold 0
bne $9, $0, loop

Here, the correct location of the program counter after execution of the bne (branch not equal) instruction depends on the value stored in register $9 by the slti (set on less than immediate) instruction, which in turn depends on the result stored in register $8 by the addi (add immediate) instruction. However, looking at the simplified pipeline image above, while the bne command is being fetched, the addi command two instructions before is just loading the value of register $8 and has not yet begun to add 1 to it, much less store the value back to the register. Further, when the bne command is in the "Load" step, the slti command above it is in "Process", the value in register $9 that bne is trying to load hasn't yet been updated by the slti command and won't be for another two clock cycles!

This problem could be solved by inserting noops into the above code, resulting in:

addi $8, $0, 0
loop:
addi $8, $8, 1
noop
noop
slti $9, $8, 20
noop
noop
bne $9, $0, loop

However, this is unsatisfactory because it eliminates much of the performance advantage gained by pipelining. Instead, RISC instruction set processors use branch prediction techniques to avoid waiting for prior computations to become available before continuing. RISC processors often use one of three strategies:

  • Assume the branch will fail and treat it as a noop.
  • Assume the branch succeeds and start processing instructions at the branch location.
  • Try to predict branch based on recent behavior.

If these branch prediction guesses later turn out to be wrong, RISC processors have special hardware to flush the pipeline, trashing the speculatively executed instructions and loading the correct instructions.

2.1.2.2 Uniform Instruction Length

Most modern RISC processors use a 32 or 64 bit instruction length. In addition to uniform length, machine language instructions often have similar structure, to simplify decoding the instruction. For example, MIPS assembly language has only three instruction formats: R-type ('r' for 'register'), I-type ('i' for 'immediate' or 'constant'), and J-type ('j' for 'jump'). Every MIPS command is exactly 32 bits and is decoded in one of the three possible ways. This greatly simplifies the hardware necessary to decode instructions.[1](pg. 118)

2.1.2.3 Simple Addressing Modes

An addressing mode is a means to derive a memory location from a machine instruction. Most RISC architectures have around five addressing modes whereas CISC architectures can have twelve or more.[1] MIPS assembly language is an example of such a language. MIPS supports the following addressing modes:[1](pg. 150)

  • Register Addressing - Memory location stored in a register
  • Displacement Addressing - Memory location stored in a register plus a constant offset
  • Immediate Addressing - Memory location is stored directly in the machine instruction
  • PC-relative Addressing - Memory location is a fixed offset from the current instruction address
  • Pseudodirect Addressing - Memory location is a bitwise concatenation of the high bits of the current instruction address and a constant in the machine instruction.

2.1.3 Post-RISC Processors

Super-scalar processors use multiple pipelines to simultaneously execute multiple independent instructions at the same time. Keeping multiple pipelines filled efficiently is a difficult problem, but compilers for modern RISC processors are able rearrange instructions to separate dependent instructions.

Super-pipelined architectures improve performance over standard pipelining by breaking each clock cycle into multiple operations.[1] For example, a super-pipelined version of the 5 step pipeline pictured above might break each operation into two. Each instruction would fetch two instructions each clock cycle, perform two memory fetches, etc... This technique speeds program execution by allowing more operations to be executing simultaneously.

Super-pipelined or Super-scalar architectures are sometimes combined with Very Long Instruction World (VLIW) assembly languages which allow many instructions which are safe to execute in parallel to be bundled together at compile time. This simplifies the specialized hardware needed by the processor.[1]

2.2 Memory Design

Computer speeds have been significantly increased through the use of high speed caches of memory, and large internal registers. Additionally, large jobs can be run using virtual memory, where the program actually uses storage on the disk and not in RAM.

Early versus current computer architectures

2.2.1 Cache

Cache, as its name implies, is a block of data copied from the main memory and stored on the CPU. Cached memory is used to avoid using a slow speed bus to access RAM. Cached items utilize memory locality. There are two types of locality:

  • spatial locality - main memory regions that are physically close together
  • temporal locality - main memory regions that are accessed close together in time

Modern processors employ various techniques to ensure that memory consistency is maintained. A common method of achieving this goal is to use bits on the memory that is placed in main memory or cache to determine its state. Below are flags commonly used to maintain memory consistency.

  • modified--Used to mark memory that has been updated, and needs to be written back to maintain consistency
  • owned--Notes that this memory is currently in use.
  • exclusive--Notes that this memory can not be altered
  • shared--Notes that this memory is available for use by other applications
  • invalid--Used to mark memory that is out of date and needs to be pulled again from the main memory or disk

Cache memory is temporary, and constantly being replaced. There are many algorithms which are used to decide how the blocks of cache memory should be cleared, such as LRU (Least Recently Used), FIFO (First In First Out), LFU (Least Frequently Used), and Random.

Cache, and memory, are organized into 32 byte blocks. Atomic units of storage in cache are often referred to as "slots", whereas main memory is organized into "blocks". There are a variety of options for how the cache memory interacts with main memory, such as:

Name Description Advantages
Direct Map Cache A slot of cache memory loads directly into a block of main memory. Simple to implement.
Associative Mapping Each slot of cache can load to multiple blocks of main memory. Very flexible memory management.
Set Associative Mapping Slot can load to multiple blocks, but each block only receives data from one slot Combines characteristics of both models.

Set Associative Mapped Cache is used in nearly all modern computers.

Cache is often organized into tiers, with small amounts of very fast cache and increasingly larger amounts of slower cache. This is partly due to cost: very fast caches are costly to produce, so they are kept small. There are also engineering difficulties. Small L1 caches are fast partly because of their proximity to the processor. The much larger L3 caches present on some chips are simply too large to fit directly on the silicone die and must communicate with the processor over a bus. This results in slower data access speeds.[1]

Testing cache speeds with varying amounts of data. Tiered speed decreases are due to tiered cache architecture.
Testing cache speeds with varying amounts of data. Tiered speed decreases are due to tiered cache architecture.

Besides tiered cache architectures, cache performance can also be improved through better compiler design. By intelligently rearranging program instructions to improve spatial and temporal locality, cache performance can be improved. Some assembly languages even support special machine instructions which preload or prefetch blocks of data into cache before they are needed. Compilers which can automatically analyze memory access patterns in code and intelligently insert prefetch instructions into compiled code can sometimes significantly increase cache performance[1](pg. 620).

3 Compilers

Compilers are essential for a CPU to achieve high performance. By understanding hardware details of the underlying processor, such as how many clock cycles various operations take to complete, compilers can often reorder instructions to prevent execution from stalling while previous operations complete. This provides a level of abstraction which frees the high level programmer from thinking about such low level implementation details.

In order to translate a program into machine code, the compiler goes through the following stages:[1]

  • pre-processing - adding definitions of variables, and including necessary files
  • lexical analysis - finding keywords, variables, constants and operators
  • parsing - moving the code into an intermediate representation
  • optimization - making simple code changes to improve efficiency
  • code generation - the creation of machine code.

With regards to optimization, compilers have become increasingly sophisticated, but the optimization changes remain relatively simple. These include:[1]

  • removal of inaccessible code - taking out code that will never be accessed by the program
  • removal of code that produces unused results
  • simplification of constants
  • constant folding - combining constants into a temporary constant so that operations only need to be performed once
  • common subexpression elimination - common subexpression are combined into a variable, again so calculations are only performed once
  • mathematical simplifications - replacing operations with mathematically equivalent yet more simple operations
  • removal of loop invariant code - moving operations from a loop which are not affected by the looping variable
  • simplification of inductive loops - such as using additions instead of multiplications in a loop, to use less CPU time

4 Module and Routine Design

4.1 Timing and Profiling

Understanding the resources used by your code is essential to increasing performance. A key element of this is being able to accurately measuer the amount of time needed to run a code. The "time" command within Unix can be used to show a variety of time measurements and system resources used.

Different methods are used in order to assess performance and identify bottlenecks within a program, but the one most commonly used is profiling. Profiling measures the amount of time spent in each section of your code. Types of measurements made within profiling include:

  • Program Counter Sampling - how often lines of code are used
  • Hardware Counter - program counter sampling taking advantage of special hardware support
  • CPU Time - amount of time cpu spent actually executing program instructions
  • Ideal - number of executions of each basic block and the ideal CPU time for each function.

The basic steps with profiling are:

  1. Create an application.
  2. Run a set of numerical experiments.
  3. Examine the results from the performance measurements.
  4. Modify the program to address the areas of concern, then repeat.

It is important to make sure profiling is done at the right level. Start with the routine level, then narrow to particular problem lines. Also, don't simply look at CPU time, make sure to consider memory access as well. UNIX commands such as top, vmstat, ps, and size can help with that.

4.2 Optimization By Hand

While compilers will often perform automatic optimization of code, there are simple things a programmer can do to make their code run more efficiently. These include:

  • Procedure In-Lining - there is a certain amount of overhead in calling a subroutine or function. This can be minimized by "in-lining" your subroutine/function calls within the main program. Do this by specifying the routines to be "in-lined" in the compiler line, put in-line directives into the code, or let the compiler handle it automatically.
  • Remove Branches in Loops - branches in loops break vector pipelines. If possible, remove conditionals from loops to avoid creating branches upon excution.
  • Minimize Page Faults and Cache Hits - large strides in matrix and vector operations forces the computer to store additional information in high-level cache.
  • Loop Unrolling - because of how some vector processors operate, partially unrolling a loop can lead to increased performance.
  • Eliminate Loops With Low Trip Counts - given the overhead involved with defining a loop, simply avoid using loops with a very small number of iterations.
  • Rearrange Loop Order - C and Fortran process their arrays in different orders. Simply reversing the order of the loops in a program could significantly improve speed.
  • Improve Your Algorithm - a better algorithm means better performance, even with today's vast processing machines.
  • Stop Testing When You Know The Answer - don't have multiple conditions testing for something that can be measured by one.
  • Organize Testing By Order of Frequency - minimize numbers of tests to be performed by placing the most stringent test first, most common one last.
  • Sentinel Value
  • Change Loop Order - consider the amount of runs to be done by reversing loop order.
  • Reduction of Multiplications in a Loop - look for ways to turn multiplication operations within a loop to addition (less complicated) operations.
  • Caching Answers - if you need to re-calculate the same answer over and over, caching the answer could increase efficiency.
  • Be Careful of System Routines
  • Pre-Compute Results - create a table of values and look them up instead of re-calculating values each time.
  • Compare Performance of Similar Logic Structures
  • Strength Reduction - try to replace high-order complexity operations with simpler ones.
  • Reduce the Dimension of Arrays
  • Minimize Array References
  • Exploit Algebraice Identities - to simplify expressions.

5 References

Personal tools