Introduction to HPC

From csi702

Jump to: navigation, search


Contents

  1. Why scientists use computers
  2. Characteristics of supercomputers
  3. Applications of supercomputing
  4. Trends in supercomputers
  5. Supercomputing architectures
    1. Single Instruction, Single Data (SISD)
    2. Single Instruction, Multiple Data (SIMD)
    3. Multiple Instruction, Multiple Data (MIMD)
    4. Cluster Commputing
    5. Beowulf Clusters:
  6. Algorithms vs. Hardware
    1. Example: Sorting Lists (Quadratic time vs Linearithmic time)
  7. Drivers of High(er) Performance Computing
  8. Comparisons to brain function
  9. Supercomputing challenges
  10. Links & References

1 Why scientists use computers

The use of supercomputers allow scientists to explore a variety of problems that would normally be difficult to tackle through other means.

  • Experiments are impossible: Due to both physical and logistical restraints, it may be impossible for a scientist to conduct an experiment through traditional means. Example: modeling solar corona.
  • Experiments are expensive: Financial restrictions may reduce amount of testing that can be conducted. Example: explosive testing.
  • Equations are too difficult to solve analytically:
  • Experiments don't provide enough accuracy:
  • Simulation would take too long with a conventional computer:
  • Data sets are too complex to analyze by hand: With the increase of data generation in modern experiments, trying to process and analyze data by hand would be a daunting task. Example: Large Hadron Collider experiments.

2 Characteristics of supercomputers

In our discussion, the terms "supercomputing" and "high performance computing" will be synonymous, unless otherwise indicated. In common usage of the terms, there are sometimes subtle distinctions implied because "supercomputer" in many contexts will refer to a single physical computer, whereas "high performance computing" is often performed with clusters many computers, which may be distributed over a network. However, a distributed network of computers may still be regarded as a "supercomputer", even if the configuration of the network of computers is dynamic.

In discussing supercomputers, a fundamental question is "what makes a computer a supercomputer?" The terms "super" and "high performance" are relative and are not defined by absolute performance characteristics. Systems are characterized as supercomputers primarily by the fact that they have performance characteristics that are significantly higher than the vast majority of commercially computers available at a given time. Since the performance characteristics of commercial computers steadily increases over time, so do characteristics of supercomputers.

Since scientific computing typical involves floating point calculations, a commonly reported characteristic of supercomputers is the number of FLoating point OPerations per Second (FLOPS) that the system can perform. The term "FLOPS" can alternately mean simply "FLoating point OPerationS" (not per second) but the meaning is usually obvious by the usage of the term. While FLOPS indicates the theoretical number of floating point operations that can be performed per second, the floating point arguments must first be made available to the processor. If the arguments and results of the operations can not be moved in and out at the theoretical FLOP rate of the system, then a lower FLOP rate will be realized in practice. Thus, a number of other system characteristics, such as speed and capacity of random access memory (RAM), disk (long term) storage, algorithm efficiency, etc. will also affect the overall throughput and performance of the system.

3 Applications of supercomputing

One of the primary uses of supercomputers for scientific discovery is simulation of physical processes and phenomena. The physical scale of such simulations range from simulation of quantum mechanics at the subatomic scale to simulation of the formation of galaxies at the astronomical scale. Such simulations rarely, if ever, have closed form mathematical solutions; therefore, spatial and/or temporal discretization is performed, and numerical methods such as finite difference methods, finite element methods, or N-body simulations are performed. The accuracy of such simulations depends not only on the mathematical models use for the physical systems but also on the granularity of the spatial and temporal discretizations performed. The number of computations required for a simulation often scales exponentially with the number of discrete elements modeled. Once a sufficiently large number of computations are required for a simulation, it becomes impractical, if not impossible, to simulate the system on standard computer systems; thus, supercomputers are required.

In addition to performing simulations, supercomputers are also used to process experimental data. As scientific experimental research produces ever-increasing volumes of data, supercomputing resources will be needed to sift through the massive data archives and process the data. Examples of data collection systems that will require supercomputing resources include the Large Hadron Collider[1], Large Synoptic Survey Telescope[1], and the numerous sensors collecting data used for climate research[1].

4 Trends in supercomputers

The increase of performance of supercomputer technology over time appears to be a manifestation of Moore's Law. Since the birth of modern computers, in the early 1940's, there has been a general exponential growth in the computational power of supercomputers.

Maximal reported LINPACK performance achieved, as of 13 Nov 2009
Maximal reported LINPACK performance achieved, as of 13 Nov 2009

As of November 2009, Top500.org has documented the top five supercomputers to be the following:

Name# CoresPetaflops (Rmax)
Jaguar224,1621.76
Roadrunner122,4001.04
Kraken98,9280.832
Jugene294,9120.826
Tianhe-1 Cluster71,6800.563
Source: TOP500_200911_Poster.png

5 Supercomputing architectures

Although, algorithms are dominantly important for performance, the supercomputer architecture is important as well in determining its speed. Needless to say then, both the hardware and the software are responsible for the overall performance of a supercomputer.

The supercomputers can be broadly classified based on the taxonomy of Flynn [1]. Based on the manipulation of data and instruction streams, he defined four main architectural classes, namely, SISD, SIMD, MISD and MIMD. This section categorizes and gives a brief description of the various supercomputer architectural classes.[1]

  • SISD (examples include mainframes, workstations, PCs).
  • SIMD Shared Memory (examples of this type are the Vector machines, Cray NEC, Hitachi, Fujitsu, etc).
  • MIMD Shared Memory (Encore, Alliant, Sequent, KSR, Tera, Silicon Graphics, Sun, DEC/Compaq, HP).
  • SIMD Distributed Memory (ICL/AMT/CPP DAP, TMC CM-2, Maspar).
  • MIMD Distributed Memory (nCUBE, Intel, Transputers, TMC CM-5, recent PC and workstation clusters (IBM SP2, HP Alpha, Sun) connected with various networking/switching technologies offered by Cisco, etc.).
  • Cluster Computing

5.1 Single Instruction, Single Data (SISD)

The Single Instruction, Single Data stream machines (SISD) comprise of a single CPU. Thus, they can accommodate a single instruction executed sequentially. SISD machines are also known as Von Neumann type machines. SISD may contain more than one CPU, however, they are still considered as SISD machines because they can only execute one instruction sequentially. Workstations, desktops, laptops and notebooks as we know them today does not completely fall in the SISD category. Modern processors use many concepts from vector and parallel architectures such as pipelining, parallel execution of instructions, pre-fetching of data, etc in order to achieve one or more arithmetic operations per clock cycle. (A clock cycle being defined as the basic internal unit of time for the system.)

5.2 Single Instruction, Multiple Data (SIMD)

In the SIMD architecture, a single instruction stream is broadcast to all the processors. These processors execute the same instructions in lock-step on their own local data stream. SIMD systems usually contain a large number of processing units ranging between 1024 – 16384. These machines are synchronous, with fine-grained parallelism. They run a large number parallel processes, one for each data element in a parallel vector or array. Examples of SIMD machines include CPP DAP Gamma II and the Quadrics Apemille. A subclass of the SIMD systems is called the vectorprocessors. Vectorprocessors act on arrays of similar data using specially structured CPUs. When data can be manipulated by these vector units, results can be delivered with a rate of one, two and three per clock cycle. Vector processors execute on their data in a parallel manner when executing in vector mode. In such a scenario, they are several times faster than when executing in conventional scalar mode. Vectorprocessors are therefore regarded as SIMD machines. An example of such a system is the NEC SX-6i.

5.3 Multiple Instruction, Multiple Data (MIMD)

In this type of architecture, each processor can independently execute its own instruction stream on its own local data stream. MIMD machines are asynchronous, with more coarse-grained parallelism. They run a smaller number of parallel processes, one for each processor, operating on the large chunks of data local to each processor.

These machines execute several instruction streams in parallel on different data. The difference when compared to the multi-processor SISD machines mentioned above is that in this architecture, the instructions and data are related because they represent different parts of the same task to be executed. Therefore, MIMD systems may run several sub-tasks in parallel in order to shorten the time-to-solution for the main task to be executed. Systems such as the four-processor NEC SX-6 and a thousand processor SGI/Cray T3E fall in this class.

Shared memory systems: Shared memory systems have multiple CPUs sharing the same address space. Here the user is not concerned with where data is stored since all CPUs access the same memory. Shared memory systems can be both SIMD or MIMD. Single-CPU vector processors can be regarded as an example of SIMD, while the multi-CPU models are examples of the MIMD.

Distributed memory systems: Each CPU has its own associated memory. The CPUs are connected by a network and may exchange data between their respective memories as needed. In contrast to the shared memory systems, the end-user have to know the location of the data in the local memories and will have to move or distribute this data explicitly as needed. Distributed memory systems may be either SIMD or MIMD types.

5.4 Cluster Commputing

What is the significance of Clusters? Why are they needed? Clusters are needed for the very same reasons that supercomputers are needed. The requirement for increased computational power to solve complex scientific problems or to process complex SQL queries faster in database applications, e-commerce and other web applications. These complexities often include:

  • Large run times or real time constraints
  • Large memory usage
  • High I/O usage
  • Fault tolerance (web or scientific computation)
  • Interest in low cost alternative to expensive parallel machines

Cluster Computing refers to a collection of workstations and/or personal computers connected together by a local area or a wide-area network. The first cluster also called the Beowulf cluster was introduced in 1994. Since then, the application of cluster computing has virtually exploded, specifically in the area of data warehousing and business intelligence. Clusters are preferred because both the hardware and software are cheap in comparison to the conventional supercomputers and the abundance of engineers available to work with them. Other resources in the form of consultants, books, scientific journals and papers are also available to learn, advance and use this technology. In addition, various vendors of hardware and software have started building systems and offering both inside-the-box and outside-the-box solutions that can be used in the cluster computing architecture.[1]

The various categories of clusters include:

High-availability (HA) clusters: HA Clusters also known as Failover Clusters are implemented primarily for the purpose of improving the availability of services that the cluster provides. They operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for an HA cluster is two nodes, , which is the minimum requirement to provide redundancy. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure. There are many commercial implementations of High-Availability clusters for many operating systems. The most commonly used HA Cluster is based on Linux operating system. Several other flavors of Unix Operating Systems such as Solaris and HP-UX are also used in HA Clusters.

Load-balancing clusters: Load-balancing is when multiple computers are linked together to share computational workload, i.e., function as a single virtual computer. Logically, from the user's perspective, they are multiple machines, simply functioning as a single virtual machine. Requests initiated from the users are managed by, and distributed among, all the standalone computers to form a cluster. This results in balanced computational work among different machines, improving the performance of the cluster system. This type of clusters are mostly used in e-commerce, many popular news websites, social networking sites and sites offering streaming video services.

Compute clusters: Often clusters are used primarily for computational purposes, rather than handling IO-oriented operations such as web service or databases. For instance, a cluster might support computational simulations of weather or vehicle crashes. The primary distinction within compute clusters is how tightly-coupled the individual nodes are. For instance, a single compute job may require frequent communication among nodes - this implies that the cluster shares a dedicated network, is densely located, and probably has homogenous nodes. This cluster design is usually referred to as Beowulf Cluster (see the section on Beowulf Clusters). The other extreme is where a compute job uses one or few nodes, and needs little or no inter-node communication. This latter category is sometimes called "Grid" computing (see the next section). Tightly-coupled compute clusters are designed for work that might traditionally have been called "supercomputing". Middleware such as MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) permits compute clustering programs to be portable to a wide variety of clusters.

Grid computing: Grids are usually computer clusters, but more focused on throughput. Grids usually incorporate heterogeneous collection of computers, distributed geographically. These disbursed systems are usually administered by totally different organizations. Grid computing is optimized for workloads consisting of several independent jobs. Resources such as storage may be shared by all the nodes, but intermediate results of one job do not affect other jobs in the grid.

An example of a very large grid is the Folding@home project. It is used in analyzing data that is used by researchers to find cures for diseases such as Alzheimer's and cancer. Another large project is the SETI@home project, which may be the largest distributed grid in existence. It uses approximately three million home computers all over the world to analyze data from the Arecibo Observatory radiotelescope, searching for evidence of extraterrestrial intelligence. In both of these cases, there is no inter-node communication or shared storage. Individual nodes connect to a main, central location to retrieve a small processing job. They then perform the computation and return the result to the central server. In the case of the @home projects, the software is generally run when the computer is idle. U of C Berkley has developed an open source application BOINC to allow individual users to contribute to the above and other projects such as lhc@home (Large Hadron Collider) from a single manager which can then be set up to allocate a percentage of idle time to each of the projects a node has signed up for. The grid setup means that the nodes can take however many jobs they are able to process in one session and then return the results and acquire a new job from a central project server.

The TOP500 organization's semiannual list of the 500 fastest computers usually includes many clusters. TOP500 is a collaboration between the University of Mannheim, the University of Tennessee, and the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory. The TOP500 organization measures the performance in TFlops with High-Performance LINPACK benchmark. Consumer Game Consoles: Due to the increasing computing power of each generation of game consoles, a novel use has emerged where they are repurposed into HPC clusters. Some examples of game console clusters are Sony PlayStation clusters and Microsoft Xbox clusters. It has been suggested in a parody news source that countries which are restricted from buying supercomputing technologies may be obtaining game systems to build computer clusters for military use.

Lastly, please note that the FLOPs (floating point operations per second), aren't always the best metric for supercomputer speed. Clusters can have very high FLOPs, but they cannot access all the data in the cluster at once. Therefore clusters are excellent for parallel computation, but much poorer than the traditional supercomputers when non-parallel computations are required.

5.5 Beowulf Clusters:

With the name derived from the epic poem Beowulf, it refers to a specific computer built in 1994 at NASA by Sterling and Becker. At the time it was built, it was a high-performance parallel computing cluster assembled with inexpensive PCs. Beowulf clusters were deployed for scientific computing. The Beowulf system was Linux-based networked together into a TCP/IP Ethernet LAN and had libraries (Message Passing Interface or MPI and Parallel Virtual Machine or PVM) )and programs implemented to allow shared processing among the networked PCs. These libraries allowed the developer to divide a task among a group of networked computers, and collect the results of processing. By making some changes to the Linux kernel, Sterling and Becker facilitated channel bonding, i.e., they combined multiple Ethernet connections between nodes into a single virtual channel. This way, they overcame the bandwidth limitation of 10MB Ethernet available at that time. In the Beowulf cluster, most client nodes do not have keyboards or monitors, and are accessed only via remote login or possibly a serial terminal. Beowulf nodes can be thought of as a CPU + memory package which can be plugged in to the cluster, just like a CPU or memory module can be plugged into a motherboard.[1]

Although there are many software packages such as kernel modifications, PVM and MPI libraries, and configuration tools which make the Beowulf architecture faster, easier to configure, and much more usable, one can build a Beowulf class machine using standard Linux distribution without any additional software.

Several Federal agencies, national research laboratories and educational institutions to this day operate and maintain Beowulf Clusters for complex and scientific computing.

6 Algorithms vs. Hardware

Although computer hardware performance (FLOPS) has increased by a factor of 1.2·109 since the mid-1940's, it can be shown relatively easily that more significant gains in computational time can be achieved by the selection of more optimized algorithms.

6.1 Example: Sorting Lists (Quadratic time vs Linearithmic time)

Migrating an algorithm that runs in quadratic time (Θ(n2)) to linearithmic time (Θ(n log n)) can result in a significant reduction in computation time. As an example, the bubble sort method is compared to that of a quicksort method used in Octave:

List SizeBubble Sort Execution TimeQuicksort Execution Time
1000.14 seconds
5002.22 seconds
1,0009.06 seconds
5,000212 seconds
1·104555 seconds
1·105~15 hours0.46 seconds
5·105~385 hours0.13 seconds
1·1060.22 seconds
5·1061.10 seconds
1·1072.28 seconds
5·107~439 years12.77 seconds

For large datasets, such as a list of 5·107 elements, the bubblesort method takes about 1.4·1010 longer to run than the quicksort method. Note that for a worst-case scenario, quicksort can take up to Θ(n2) to complete the algorith, however, on average it completes the routine in Θ(n log n).

7 Drivers of High(er) Performance Computing

While many supercomputers have been built over the years and successfully used to solve difficult problems, there is a continual push to develop faster, more capable systems. A number of factors drive this push:

  • Finer resolution: The accuracy of simulations and numerical calculations are often dependent on the spatial/temporal resolution by which the computational domain is discretized. Building faster computers enables calculations at finer resolutions.
  • Larger dimensions: Spatial dimensions of problems must be bounded, whether the computations are for modeling nanoscale systems, global climate modeling, or astrophysical simulations. Faster supercomputers enable the dimensions of simulations to be increased.
  • Greater physical realism: Simulation requires the use of mathematical models for systems & phenomena being studied. Executing simulations even on state-of-the-art supercomputers often requires simplification of analytical models to enable timely execution of the simulations. These simplifications limit the accuracy/realism of the results. Improved supercomputing capabilities allows more complex models to be used, increasing the accuracy of simulations.
  • Real-time processing: There are many situations where it is desirable to process data in real time, either to enable timely decision-making or because storing a large data stream is not feasible. High performance computing systems enable complex event processing of high throughput data streams. Improving supercomputing capabilities enables real-time processing of data streams that could otherwise not be handled and enables processing of larger data streams for existing systems.
  • Automated analysis: Computers are increasingly used to perform automated analysis of empirical data and simulation output. Faster systems, enable scientists to process/mine larger data sets and perform greater numbers of simulations (e.g., Monte Carlo simulations).
  • Automated design: The ever-increasing complexity of many modern devices has made it increasingly difficult for humans to develop optimal designs. Evolutionary algorithms enable high performance systems to rapidly evolve system designs that might never be considered by human engineers[1].
Computer-evolved antenna designs
Computer-evolved antenna designs

8 Comparisons to brain function

Comparisons are often made between the performance of supercomputers and the human brain. While computer architectures are vastly different than the neural functions of animal brains, the question remains: when will computers perform at the same level as the human brain? In narrowly defined tasks, computers already outperform humans. For decades, computers have been able to perform mathematical calculations at rates well beyond human capacity. In 1997, IBM's Deep Blue defeated world chess champion Gary Kasparov in a chess match. While performing competitively at the game of Go has proven more challenging for computers, supercomputers have recently also begun to play Go competitively against expert human players [1].

from Moravec (1998)
from Moravec (1998)[1]

While comparisons of MIPS and FLOPS focus on a more narrow aspect of computation, there are also comparisons of supercomputer-based artificial intelligence with intelligence of humans and animals. One controversial hypothesis is that the acceleration of supercomputing capabilities and corresponding artificial intelligence will eventually result in a technological singularity, after which human intelligence will become irrelevant.

9 Supercomputing challenges

Supercomputer simulations can make the effects of global warming more tangible to the population, thereby instilling in people "the political will to change," - Al Gore.

Some of the supercomputing challenges include:[1]

  • Applications in fusion energy and nuclear power for the production of electricity.
  • Environmental research into carbon sequestration or the mitigation of global warming.
  • Simulation, understanding and predicting natural catastrophies such as earthquakes, Tsunamis and Hurricanes.
  • Simulation of climate models and their diagnostic analysis such as the Atmosphere-Ocean General Circulation Models (AOGCMs).
  • Research and analysis of turbulence in oceans.
  • Simulation of clouds on a global scale and simulation of global solar radiation based on these cloud observations.
  • Studying protein membranes for discovering new and improved drugs.
  • DNA Supercomputers
  • Diagnostics of diseases such as visualization of molecules causing infectious diseases such as avian flu, SARS, malaria, etc.

more...coming soon

10 Links & References

Personal tools