CS计算机代考程序代写 scheme mips data structure chain compiler flex Fortran computer architecture cache algorithm 17 Parallel Processing

17 Parallel Processing

William Stallings
Computer Organization
and Architecture
9th Edition

1
Lecture slides prepared for “Computer Organization and Architecture”, 9/e, by William Stallings, Chapter 17 “Parallel Processing”.

Chapter 17
Parallel Processing

Traditionally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms as sequences of instructions. Processors execute programs by executing machine instructions in a sequence and one at a time. Each instruction is executed in a sequence of operations (fetch instruction, fetch operands, perform operation, store results).

This view of the computer has never been entirely true. At the micro-operation level, multiple control signals are generated at the same time. Instruction pipelining, at least to the extent of overlapping fetch and execute operations, has been around for a long time. Both of these are examples of performing functions in parallel. This approach is taken further with superscalar organization, which exploits instruction- level parallelism. With a superscalar machine, there are multiple execution units within a single processor, and these may execute multiple instructions from the same program in parallel.

As computer technology has evolved, and as the cost of computer hardware has dropped, computer designers have sought more and more opportunities for
parallelism, usually to enhance performance and, in some cases, to increase availability. After an overview, this chapter looks at some of the most prominent approaches to parallel organization. First, we examine symmetric multiprocessors (SMPs), one of the earliest and still the most common example of parallel organization. In an SMP organization, multiple processors share a common memory. This organization raises the issue of cache coherence, to which a separate section is devoted. Next, the chapter examines multithreaded processors and chip multiprocessors. Then we describe clusters, which consist of multiple independent computers organized in a cooperative fashion. Clusters have become increasingly common to support workloads that are beyond the capacity of a single SMP. Another approach to the use of multiple processors that we examine is that of non-uniform memory access (NUMA) machines. The NUMA approach is relatively new and not yet proven in the marketplace, but is often considered as an alternative to the SMP or cluster approach. Finally, this chapter looks at hardware organizational approaches to vector computation. These approaches optimize the ALU for processing vectors or arrays of floating-point numbers. They are common on the class of systems known as supercomputers.

Multiple Processor Organization
Single instruction, single data (SISD) stream
Single processor executes a single instruction stream to operate on data stored in a single memory
Uniprocessors fall into this category

Single instruction, multiple data (SIMD) stream
A single machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis
Vector and array processors fall into this category
Multiple instruction, single data (MISD) stream
A sequence of data is transmitted to a set of processors, each of which executes a different instruction sequence
Not commercially implemented
Multiple instruction, multiple data (MIMD) stream
A set of processors simultaneously execute different instruction sequences on different data sets
SMPs, clusters and NUMA systems fit this category

3
A taxonomy first introduced by Flynn [FLYN72] is still the most common way of categorizing systems with parallel processing capability. Flynn proposed the following categories of computer systems:

Single instruction, single data (SISD) stream: A single processor executes a single instruction stream to operate on data stored in a single memory. Uniprocessors fall into this category.

Single instruction, multiple data (SIMD) stream: A single machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis. Each processing element has an associated data memory, so that instructions are executed on different sets of data by different processors. Vector and array processors fall into this category, and are discussed in Section 18.7.

Multiple instruction, single data (MISD) stream: A sequence of data is transmitted to a set of processors, each of which executes a different instruction sequence. This structure is not commercially implemented.

Multiple instruction, multiple data (MIMD) stream: A set of processors simultaneously execute different instruction sequences on different data sets. SMPs, clusters, and NUMA systems fit into this category.

4
With the MIMD organization, the processors are general purpose; each is able to process all of the instructions necessary to perform the appropriate data transformation. MIMDs can be further subdivided by the means in which the processors communicate (Figure 17.1). If the processors share a common memory, then each processor accesses programs and data stored in the shared memory, and processors communicate with each other via that memory. The most common form of such system is known as a symmetric multiprocessor (SMP), which we examine in Section 17.2. In an SMP, multiple processors share a single memory or pool of memory by means of a shared bus or other interconnection mechanism; a distinguishing feature is that the memory access time to any region of memory is approximately the same for each processor. A more recent development is the non-uniform memory access (NUMA) organization, which is described in Section 17.5. As the name suggests, the memory access time to different regions of memory may differ for a NUMA processor.

A collection of independent uniprocessors or SMPs may be interconnected to form a cluster. Communication among the computers is either via fixed paths or via some network facility.

5
Figure 17.2 illustrates the general organization of the taxonomy of Figure 17.1. Figure 17.2a shows the structure of an SISD. There is some sort of control unit (CU) that provides an instruction stream (IS) to a processing unit (PU). The processing unit operates on a single data stream (DS) from a memory unit (MU). With an
SIMD, there is still a single control unit, now feeding a single instruction stream to multiple PUs. Each PU may have its own dedicated memory (illustrated in Figure 17.2b), or there may be a shared memory. Finally, with the MIMD, there are multiple control units, each feeding a separate instruction stream to its own PU. The MIMD may be a shared-memory multiprocessor (Figure 17.2c) or a distributed- memory multicomputer (Figure 17.2d).

The design issues relating to SMPs, clusters, and NUMAs are complex, involving issues relating to physical organization, interconnection structures, inter-processor communication, operating system design, and application software techniques. Our concern here is primarily with organization, although we touch briefly on operating system design issues.

Symmetric Multiprocessor (SMP)

6
Until fairly recently, virtually all single-user personal computers and most workstations contained a single general-purpose microprocessor. As demands for performance increase and as the cost of microprocessors continues to drop, vendors have introduced systems with an SMP organization. The term SMP refers to a computer hardware architecture and also to the operating system behavior that reflects that architecture. An SMP can be defined as a standalone computer system with the following characteristics:

1. There are two or more similar processors of comparable capability.

2. These processors share the same main memory and I/O facilities and are interconnected by a bus or other internal connection scheme, such that memory access time is approximately the same for each processor.

3. All processors share access to I/O devices, either through the same channels or through different channels that provide paths to the same device.

4. All processors can perform the same functions (hence the term symmetric).

5. The system is controlled by an integrated operating system that provides interaction between processors and their programs at the job, task, file, and data element levels.

Points 1 to 4 should be self-explanatory. Point 5 illustrates one of the contrasts
with a loosely coupled multiprocessing system, such as a cluster. In the latter, the physical unit of interaction is usually a message
or complete file. In an SMP, individual data elements can constitute the level of interaction, and there can be a high degree of cooperation
between processes.

A stand alone computer with the following characteristics:

Two or more similar processors of comparable capacity

Processors share same memory and I/O facilities

Processors are connected by a bus or other internal connection

Memory access time is approximately the same for each processor

All processors share access to I/O devices

Either through same channels or different channels giving paths to same devices

All processors can perform the same functions (hence “symmetric”)

System controlled by integrated operating system

Provides interaction between processors and their programs at job, task, file and data element levels

Multiprogramming
and
Multiprocessing

The operating system of an SMP schedules processes or threads across all of the processors. An SMP organization has a number of potential advantages over a uniprocessor organization, including the following:

• Performance: If the work to be done by a computer can be organized so that some portions of the work can be done in parallel, then a system with multiple processors will yield greater performance than one with a single processor of the same type (Figure 17.3).

* Availability: In a symmetric multiprocessor, because all processors can perform the same functions, the failure of a single processor
does not halt the machine. Instead, the system can continue to function at reduced performance.

* Incremental growth: A user can enhance the performance of a system by adding an additional processor.

* Scaling: Vendors can offer a range of products with different price and performance characteristics based on the number of processors configured in the system.

It is important to note that these are potential, rather than guaranteed, benefits. The operating system must provide tools and functions to exploit the parallelism in an SMP system.

An attractive feature of an SMP is that the existence of multiple processors is transparent to the user. The operating system takes care of scheduling of threads or processes on individual processors and of synchronization among processors.

8
Figure 17.4 depicts in general terms the organization of a multiprocessor system. There are two or more processors. Each processor is self-contained, including a control unit, ALU, registers, and, typically, one or more levels of cache. Each processor has access to a shared main memory and the I/O devices through some form of interconnection mechanism. The processors can communicate with each other through memory (messages and status information left in common data areas). It may also be possible for processors to exchange signals directly. The memory is often organized so that multiple simultaneous accesses to separate blocks of memory are possible. In some configurations, each processor may also have its own private main memory and I/O channels in addition to the shared resources.

Symmetric Multiprocessor Organization

The most common organization for personal computers, workstations, and servers is the time-shared bus. The time-shared bus is the simplest mechanism for constructing a multiprocessor system (Figure 17.5). The structure and interfaces are basically the same as for a single-processor system that uses a bus interconnection. The bus consists of control, address, and data lines. To facilitate DMA transfers from I/O subsystems to processors, the following features are provided:

Addressing: It must be possible to distinguish modules on the bus to determine the source and destination of data.

Arbitration: Any I/O module can temporarily function as “master.” A mechanism is provided to arbitrate competing requests for bus control, using some sort of priority scheme.

Time-sharing: When one module is controlling the bus, other modules are locked out and must, if necessary, suspend operation until bus access is achieved.

These uniprocessor features are directly usable in an SMP organization. In this latter case, there are now multiple processors as well as multiple I/O processors all attempting to gain access to one or more memory modules via the bus.

Simplicity
Simplest approach to multiprocessor organization
Flexibility
Generally easy to expand the system by attaching more processors to the bus
Reliability
The bus is essentially a passive medium and the failure of any attached device should not cause failure of the whole system
The bus organization has several
attractive features:

10
The bus organization has several attractive features:

• Simplicity: This is the simplest approach to multiprocessor organization. The physical interface and the addressing, arbitration,
and time-sharing logic of each processor remain the same as in a single-processor system.

• Flexibility: It is generally easy to expand the system by attaching more processors to the bus.

• Reliability: The bus is essentially a passive medium, and the failure of any attached device should not cause failure of the whole system.

Main drawback is performance
All memory references pass through the common bus
Performance is limited by bus cycle time
Each processor should have cache memory
Reduces the number of bus accesses
Leads to problems with cache coherence
If a word is altered in one cache it could conceivably invalidate a word in another cache
To prevent this the other processors must be alerted that an update has taken place
Typically addressed in hardware rather than the operating system
Disadvantages of the bus organization:

11
The main drawback to the bus organization is performance. All memory references pass through the common bus. Thus, the bus cycle time limits the speed of the system. To improve performance, it is desirable to equip each processor with a cache memory. This should reduce the number of bus accesses dramatically. Typically, workstation and PC SMPs have two levels of cache, with the L1 cache internal (same chip as the processor) and the L2 cache either internal or external. Some processors now employ a L3 cache as well.

The use of caches introduces some new design considerations. Because each local cache contains an image of a portion of memory, if a word is altered in one
cache, it could conceivably invalidate a word in another cache. To prevent this, the other processors must be alerted that an update has taken place. This problem is known as the cache coherence problem and is typically addressed in hardware rather than by the operating system.

Multiprocessor Operating System Design Considerations
Simultaneous concurrent processes
OS routines need to be reentrant to allow several processors to execute the same IS code simultaneously
OS tables and management structures must be managed properly to avoid deadlock or invalid operations
Scheduling
Any processor may perform scheduling so conflicts must be avoided
Scheduler must assign ready processes to available processors
Synchronization
With multiple active processes having potential access to shared address spaces or I/O resources, care must be taken to provide effective synchronization
Synchronization is a facility that enforces mutual exclusion and event ordering
Memory management
In addition to dealing with all of the issues found on uniprocessor machines, the OS needs to exploit the available hardware parallelism to achieve the best performance
Paging mechanisms on different processors must be coordinated to enforce consistency when several processors share a page or segment and to decide on page replacement
Reliability and fault tolerance
OS should provide graceful degradation in the face of processor failure
Scheduler and other portions of the operating system must recognize the loss of a processor and restructure accordingly

12
An SMP operating system manages processor and other computer resources so that the user perceives a single operating system controlling system resources. In fact, such a configuration should appear as a single-processor multiprogramming system. In both the SMP and uniprocessor cases, multiple jobs or processes may be active at one time, and it is the responsibility of the operating system to schedule their execution and to allocate resources. A user may construct applications that use multiple processes or multiple threads within processes without regard to whether a single processor or multiple processors will be available. Thus, a multiprocessor operating system must provide all the functionality of a multiprogramming system plus additional features to accommodate multiple processors. Among the key design issues:

Simultaneous concurrent processes: OS routines need to be reentrant to allow several processors to execute the same IS code simultaneously. With multiple processors executing the same or different parts of the OS, OS tables and management structures must be managed properly to avoid deadlock or invalid operations.

Scheduling: Any processor may perform scheduling, so conflicts must be avoided. The scheduler must assign ready processes to available processors.

Synchronization: With multiple active processes having potential access to shared address spaces or shared I/O resources, care must be taken to provide effective synchronization. Synchronization is a facility that enforces mutual exclusion and event ordering.

Memory management: Memory management on a multiprocessor must deal with all of the issues found on uniprocessor machines, as is discussed in Chapter 8. In addition, the operating system needs to exploit the available hardware parallelism, such as multi-ported memories, to achieve the best performance. The paging mechanisms on different processors must be coordinated to enforce consistency when several processors share a page or segment and to decide on page replacement.

Reliability and fault tolerance: The operating system should provide graceful degradation in the face of processor failure. The scheduler and other portions of the operating system must recognize the loss of a processor and restructure management tables accordingly.

Cache Coherence
Attempt to avoid the need for additional hardware circuitry and logic by relying on the compiler and operating system to deal with the problem
Attractive because the overhead of detecting potential problems is transferred from run time to compile time, and the design complexity is transferred from hardware to software
However, compile-time software approaches generally must make conservative decisions, leading to inefficient cache utilization
Software Solutions

Software cache coherence schemes attempt to avoid the need for additional hard- ware circuitry and logic by relying on the compiler and operating system to deal with the problem. Software approaches are attractive because the overhead of detecting potential problems is transferred from run time to compile time, and the design complexity is transferred from hardware to software. On the other hand, compile- time software approaches generally must make conservative decisions, leading to inefficient cache utilization.

Compiler-based coherence mechanisms perform an analysis on the code to determine which data items may become unsafe for caching, and they mark those items accordingly. The operating system or hardware then prevents non-cacheable items from being cached.

The simplest approach is to prevent any shared data variables from being cached. This is too conservative, because a shared data structure may be exclusively used during some periods and may be effectively read-only during other periods. It is only during periods when at least one process may update the variable and at least one other process may access the variable that cache coherence is an issue.

More efficient approaches analyze the code to determine safe periods for shared variables. The compiler then inserts instructions into the generated code to enforce cache coherence during the critical periods. A number of techniques have been developed for performing the analysis and for enforcing the results; see [LILJ93] and [STEN90] for surveys.

Cache Coherence
Generally referred to as cache coherence protocols
These solutions provide dynamic recognition at run time of potential inconsistency conditions
Because the problem is only dealt with when it actually arises there is more effective use of caches, leading to improved performance over a software approach
Approaches are transparent to the programmer and the compiler, reducing the software development burden
Can be divided into two categories:
Directory protocols
Snoopy protocols
Hardware-Based Solutions

Hardware-based solutions are generally referred to as cache coherence protocols. These solutions provide dynamic recognition at run time of potential inconsistency conditions. Because the problem is only dealt with when it actually arises, there is more effective use of caches, leading to improved performance over a software approach. In addition, these approaches are transparent to the programmer and the compiler, reducing the software development burden.

Hardware schemes differ in a number of particulars, including where the state information about data lines is held, how that information is organized, where coherence is enforced, and the enforcement mechanisms. In general, hardware schemes can be divided into two categories: directory protocols and snoopy protocols.

Directory Protocols

15
Directory protocols collect and maintain information about where copies of lines reside. Typically, there is a centralized controller that is part of the main memory controller, and a directory that is stored in main memory. The directory contains global state information about the contents of the various local caches. When an individual cache controller makes a request, the centralized controller checks and issues necessary commands for data transfer between memory and caches or between caches. It is also responsible for keeping the state information up to date; therefore, every local action that can affect the global state of a line must be reported to the central controller.

Typically, the controller maintains information about which processors have a copy of which lines. Before a processor can write to a local copy of a line, it must request exclusive access to the line from the controller. Before granting this exclusive access, the controller sends a message to all processors with a cached copy of this line, forcing each processor to invalidate its copy. After receiving acknowledgments back from each such processor, the controller grants exclusive access to the requesting processor. When another processor tries to read a line that is exclusively granted to another processor, it will send a miss notification to the controller. The controller then issues a command to the processor holding that line that requires the processor to do a write back to main memory. The line may now be shared for reading by the original processor and the requesting processor.

Directory schemes suffer from the drawbacks of a central bottleneck and the overhead of communication between the various cache controllers and the central controller. However, they are effective in large-scale systems that involve multiple buses or some other complex interconnection scheme.

Collect and maintain information about copies of data in cache

Directory stored in main memory

Requests are checked against directory

Appropriate transfers are performed

Creates central bottleneck

Effective in large scale systems with complex interconnection schemes

Snoopy Protocols
Distribute the responsibility for maintaining cache coherence among all of the cache controllers in a multiprocessor
A cache must recognize when a line that it holds is shared with other caches
When updates are performed on a shared cache line, it must be announced to other caches by a broadcast mechanism
Each cache controller is able to “snoop” on the network to observe these broadcast notifications and react accordingly
Suited to bus-based multiprocessor because the shared bus provides a simple means for broadcasting and snooping
Care must be taken that the increased bus traffic required for broadcasting and snooping does not cancel out the gains from the use of local caches
Two basic approaches have been explored:
Write invalidate
Write update (or write broadcast)

16
Snoopy protocols distribute the responsibility for maintaining cache coherence among all of the cache controllers in a multiprocessor. A cache must recognize when a line that it holds is shared with other caches.
When an update action is performed on a shared cache line, it must be announced to all other caches by a broadcast mechanism. Each cache controller is able to “snoop” on the network to observe these broadcasted notifications, and react accordingly.

Snoopy protocols are ideally suited to a bus-based multiprocessor, because the shared bus provides a simple means for broadcasting and snooping. However, because one of the objectives of the use of local caches is to avoid bus accesses, care must be taken that the increased bus traffic required for broadcasting and snooping does not cancel out the gains from the use of local caches.

Two basic approaches to the snoopy protocol have been explored: write invalidate and write update (or write broadcast).

Write Invalidate
Multiple readers, but only one writer at a time
When a write is required, all other caches of the line are invalidated
Writing processor then has exclusive (cheap) access until line is required by another processor
Most widely used in commercial multiprocessor systems such as the Pentium 4 and PowerPC
State of every line is marked as modified, exclusive, shared or invalid
For this reason the write-invalidate protocol is called MESI

17
With a write-invalidate protocol, there can be multiple readers but only one writer at a time. Initially, a line may be shared among several caches for reading purposes. When one of the caches wants to per- form a write to the line, it first issues a notice that invalidates that line in the other caches, making the line exclusive to the writing cache. Once the line is exclusive, the owning processor can make cheap local writes until some other processor requires the same line.

The write-invalidate approach is the most widely used in commercial multi- processor systems, such as the Pentium 4 and PowerPC. It marks the state of every cache line (using two extra bits in the cache tag) as modified, exclusive, shared, or invalid. For this reason, the write-invalidate protocol is called MESI. In the remainder of this section, we will look at its use among local caches across a multiprocessor. For simplicity in the presentation, we do not examine the mechanisms involved in coordinating among both level 1 and level 2 locally as well as at the same time coordinating across the distributed multiprocessor. This would not add any new principles but would greatly complicate the discussion.

Write Update
Can be multiple readers and writers
When a processor wishes to update a shared line the word to be updated is distributed to all others and caches containing that line can update it
Some systems use an adaptive mixture of both write-invalidate and write-update mechanisms

18
With a write-update protocol, there can be multiple writers as well as multiple readers. When a processor wishes to update a shared line, the word to be updated is distributed to all others, and caches containing that line can update it.

Neither of these two approaches is superior to the other under all circum- stances. Performance depends on the number of local caches and the pattern of memory reads and writes. Some systems implement adaptive protocols that employ both write-invalidate and write-update mechanisms.

MESI Protocol
Modified
The line in the cache has been modified and is available only in this cache
Exclusive
The line in the cache is the same as that in main memory and is not present in any other cache
Shared
The line in the cache is the same as that in main memory and may be present in another cache
Invalid
The line in the cache does not contain valid data
To provide cache consistency on an SMP the data cache supports a protocol known as MESI:

To provide cache consistency on an SMP, the data cache often supports a protocol known as MESI. For MESI, the data cache includes two status bits per tag, so that each line can be in one of four states:

• Modified: The line in the cache has been modified (different from main memory) and is available only in this cache.

• Exclusive: The line in the cache is the same as that in main memory and is not present in any other cache.

• Shared: The line in the cache is the same as that in main memory and may be present in another cache.

• Invalid: The line in the cache does not contain valid data.

Table 17.1
MESI Cache Line States

Table 17.1 summarizes the meaning of the four states.
20

MESI State Transition Diagram

21
Figure 17.6 displays a state diagram for the MESI protocol. Keep in mind that each line of the cache has its own state bits and therefore its own realization of the state diagram. Figure 17.6a shows the transitions that occur due to actions initiated by the processor attached to this cache. Figure 17.6b shows the transitions that occur due to events that are snooped on the common bus. This presentation of separate state diagrams for processor-initiated and bus-initiated actions helps to clarify the logic of the MESI protocol.

At any time a cache line is in a single state. If the next event is from the attached processor, then the transition is dictated by Figure 17.6a and if the next event is from the bus, the transition is dictated by Figure 17.6b.

Multithreading and Chip Multiprocessors
Processor performance can be measured by the rate at which it executes instructions
MIPS rate = f * IPC
f = processor clock frequency, in MHz
IPC = average instructions per cycle
Increase performance by increasing clock frequency and increasing instructions that complete during cycle
Multithreading
Allows for a high degree of instruction-level parallelism without increasing circuit complexity or power consumption
Instruction stream is divided into several smaller streams, known as threads, that can be executed in parallel

The most important measure of performance for a processor is the rate at which it executes instructions. This can be expressed as
MIPS rate = f * IPC
where f is the processor clock frequency, in MHz, and IPC (instructions per cycle) is the average number of instructions executed per cycle. Accordingly, designers have pursued the goal of increased performance on two fronts: increasing clock frequency and increasing the number of instructions executed or, more properly, the number of instructions that complete during a processor cycle. As we have seen in earlier chapters, designers have increased IPC by using an instruction pipeline and then by using multiple parallel instruction pipelines in a superscalar architecture. With pipelined and multiple-pipeline designs, the principal problem is to maximize the utilization of each pipeline stage. To improve throughput, designers have created ever more complex mechanisms, such as executing some instructions in a different order from the way they occur in the instruction stream and beginning execution of instructions that may never be needed. But as was discussed in Section 2.2, this approach may be reaching a limit due to complexity and power consumption concerns.

An alternative approach, which allows for a high degree of instruction-level parallelism without increasing circuit complexity or power consumption, is called multithreading. In essence, the instruction stream is divided into several smaller streams, known as threads, such that the threads can be executed in parallel.

The variety of specific multithreading designs, realized in both commercial systems and experimental systems, is vast. In this section, we give a brief survey of the major concepts.

Definitions of Threads
and Processes

The concept of thread used in discussing multithreaded processors may or may not be the same as the concept of software threads in a multiprogrammed operating system. It will be useful to define terms briefly:

• Process: An instance of a program running on a computer. A process embodies two key characteristics:

—Resource ownership: A process includes a virtual address space to hold the process image; the process image is the collection of program, data, stack, and attributes that define the process. From time to time, a process may be allocated control or ownership of resources, such as main memory, I/O channels, I/O devices, and files.

—Scheduling/execution: The execution of a process follows an execution path (trace) through one or more programs. This execution may be inter- leaved with that of other processes. Thus, a process has an execution state (Running, Ready, etc.) and a dispatching priority and is the entity that is scheduled and dispatched by the operating system.

Process switch: An operation that switches the processor from one process to another, by saving all the process control data, registers, and other information for the first and replacing them with the process information for the second.2

Thread: A dispatchable unit of work within a process. It includes a processor context (which includes the program counter and stack pointer) and its own data area for a stack (to enable subroutine branching). A thread executes sequentially and is interruptible so that the processor can turn to another thread.

Thread switch: The act of switching processor control from one thread to an- other within the same process. Typically, this type of switch is much less costly than a process switch.

Thus, a thread is concerned with scheduling and execution, whereas a process is concerned with both scheduling/execution and resource ownership. The multiple threads within a process share the same resources. This is why a thread switch is much less time consuming than a process switch. Traditional operating systems, such as earlier versions of UNIX, did not support threads. Most modern operating systems, such as Linux, other versions of UNIX, and Windows, do support thread. A distinction is made between user-level threads, which are visible to the application program, and kernel-level threads, which are visible only to the operating sys- tem. Both of these may be referred to as explicit threads, defined in software.

Thread in multithreaded processors may or may not be the same as the concept of software threads in a multiprogrammed operating system

Thread is concerned with scheduling and execution, whereas a process is concerned with both scheduling/execution and resource and resource ownership

Process:

An instance of program running on computer

Two key characteristics:

Resource ownership

Scheduling/execution

Process switch

Operation that switches the processor from one process to another by saving all the process control data, registers, and other information for the first and replacing them with the process information for the second

Thread:

Dispatchable unit of work within a process

Includes processor context (which includes the program counter and stack pointer) and data area for stack

Executes sequentially and is interruptible so that the processor can turn to another thread

Thread switch

The act of switching processor control between threads within the same process

Typically less costly than process switch

Implicit and Explicit Multithreading
All commercial processors and most experimental ones use explicit multithreading
Concurrently execute instructions from different explicit threads
Interleave instructions from different threads on shared pipelines or parallel execution on parallel pipelines
Implicit multithreading is concurrent execution of multiple threads extracted from single sequential program
Implicit threads defined statically by compiler or dynamically by hardware

All of the commercial processors and most of the experimental processors so far have used explicit multithreading. These systems concurrently execute instructions from different explicit threads, either by interleaving instructions from different threads on shared pipelines or by parallel execution on parallel pipelines. Implicit multithreading refers to the concurrent execution of multiple threads extracted from a single sequential program. These implicit threads may be defined either statically by the compiler or dynamically by the hardware. In the remainder of this section we consider explicit multithreading.

Approaches to Explicit Multithreading
Interleaved
Fine-grained
Processor deals with two or more thread contexts at a time
Switching thread at each clock cycle
If thread is blocked it is skipped
Simultaneous (SMT)
Instructions are simultaneously issued from multiple threads to execution units of superscalar processor
Blocked
Coarse-grained
Thread executed until event causes delay
Effective on in-order processor
Avoids pipeline stall
Chip multiprocessing
Processor is replicated on a single chip
Each processor handles separate threads
Advantage is that the available logic area on a chip is used effectively

Broadly speaking, there are four principal approaches to multithreading:

• Interleaved multithreading: This is also known as fine-grained multithreading. The processor deals with two or more thread contexts at a time, switching from one thread to another at each clock cycle. If a thread is blocked because of data dependencies or memory latencies, that thread is skipped and a ready
thread is executed.

Blocked multithreading: This is also known as coarse-grained multithreading. The instructions of a thread are executed successively until an event occurs that may cause delay, such as a cache miss. This event induces a switch to another thread. This approach is effective on an in-order processor that would stall the pipeline for a delay event such as a cache miss.

Simultaneous multithreading (SMT): Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor. This combines the wide superscalar instruction issue capability with the use of multiple thread contexts.

Chip multiprocessing: In this case, the entire processor is replicated on a single chip and each processor handles separate threads. The advantage of this approach is that the available logic area on a chip is used effectively without depending on ever-increasing complexity in pipeline design.

Approaches to Executing Multiple Threads

For the first two approaches, instructions from different threads are not executed simultaneously. Instead, the processor is able to rapidly switch from one thread to another, using a different set of registers and other context information. This results in a better utilization of the processor’s execution resources and avoids a large penalty due to cache misses and other latency events. The SMT approach involves true simultaneous execution of instructions from different threads, using replicated execution resources. Chip multiprocessing also enables simultaneous execution of instructions from different threads.

Figure 17.7, based on one in [UNGE02], illustrates some of the possible pipe- line architectures that involve multithreading and contrasts these with approaches that do not use multithreading. Each horizontal row represents the potential issue slot or slots for a single execution cycle; that is, the width of each row corresponds to the maximum number of instructions that can be issued in a single clock cycle.3 The vertical dimension represents the time sequence of clock cycles. An empty (shaded) slot represents an unused execution slot in one pipeline. A no-op is indicated by N.

The first three illustrations in Figure 17.7 show different approaches with a scalar (i.e., single-issue) processor:

Single-threaded scalar: This is the simple pipeline found in traditional RISC and CISC machines, with no multithreading.

Interleaved multithreaded scalar: This is the easiest multithreading approach to implement. By switching from one thread to another at each clock cycle, the pipeline stages can be kept fully occupied, or close to fully occupied. The hardware must be capable of switching from one thread context to another between cycles.

• Blocked multithreaded scalar: In this case, a single thread is executed until a latency event occurs that would stop the pipeline, at which time the processor
switches to another thread.

In the case of interleaved multithreading, it is assumed that there are no control or data dependencies between threads, which simplifies the pipeline design and there- fore should allow a thread switch with no delay. However, depending on the specific design and implementation, block multithreading may require a clock cycle to per- form a thread switch, as illustrated in Figure 17.7. This is true if a fetched instruction triggers the thread switch and must be discarded from the pipeline [UNGE03].

Although interleaved multithreading appears to offer better processor utilization than blocked multithreading, it does so at the sacrifice of single-thread performance. The multiple threads compete for cache resources, which raises the probability of a cache miss for a given thread.

More opportunities for parallel execution are available if the processor can issue multiple instructions per cycle. Figures 17.7d through 17.7i illustrate a number of variations among processors that have hardware for issuing four instructions per cycle.

Figure 17.7c shows a situation in which the time to perform a thread switch is one cycle, whereas Figure 17.7b shows that thread switching occurs in zero cycles.

Example Systems
More recent models of the Pentium 4 use a multithreading technique that Intel refers to as hyperthreading
Approach is to use SMT with support for two threads
Thus the single multithreaded processor is logically two processors
Chip used in high-end PowerPC products
Combines chip multiprocessing with SMT
Has two separate processors, each of which is a multithreaded processor capable of supporting two threads concurrently using SMT
Designers found that having two two-way SMT processors on a single chip provided superior performance to a single four-way SMT processor
Pentium 4
IBM Power5

More recent models of the Pentium 4 use a multithreading technique that the Intel literature refers to as hyperthreading [MARR02]. In essence, the Pentium 4 approach is to use SMT with support for two threads. Thus, the single multithreaded processor is logically two processors.

The IBM Power5 chip, which is used in high-end PowerPC products, combines chip multiprocessing with SMT [KALL04]. The chip has two separate processors, each of which is a multithreaded processor capable of supporting two threads concurrently using SMT. Interestingly, the designers simulated various alternatives and found that having two two-way SMT processors on a single chip provided superior performance to a single four-way SMT processor. The simulations showed that additional multithreading beyond the support for two threads might decrease performance because of cache thrashing, as data from one thread displaces data needed by another thread.

Power5 Instruction Data Flow

Figure 17.8 shows the IBM Power5’s instruction flow diagram. Only a few of the elements in the processor need to be replicated, with separate elements dedicated to separate threads. Two program counters are used. The processor alternates fetching instructions, up to eight at a time, between the two threads. All the instructions are stored in a common instruction cache and share an instruction translation facility, which does a partial instruction decode. When a conditional branch is encountered, the branch prediction facility predicts the direction of the branch and, if possible, calculates the target address. For predicting the target of a subroutine return, the processor uses a return stack, one for each thread.

Instructions then move into two separate instruction buffers. Then, on the basis of thread priority, a group of instructions is selected and decoded in parallel. Next, instructions flow through a register-renaming facility in program order. Logical registers are mapped to physical registers. The Power5 has 120 physical general-purpose registers and 120 physical floating-point registers. The instructions are then moved into issue queues. From the issue queues, instructions are issued using symmetric multithreading. That is, the processor has a superscalar architecture and can issue instructions from one or both threads in parallel. At the end of the pipeline, separate thread resources are needed to commit the instructions.

Clusters
Alternative to SMP as an approach to providing high performance and high availability
Particularly attractive for server applications
Defined as:
A group of interconnected whole computers working together as a unified computing resource that can create the illusion of being one machine
(The term whole computer means a system that can run on its own, apart from the cluster)
Each computer in a cluster is called a node
Benefits:
Absolute scalability
Incremental scalability
High availability
Superior price/performance

29
An important and relatively recent development computer system design is clustering. Clustering is an alternative to symmetric multiprocessing as an approach to providing high performance and high availability and is particularly attractive for server applications. We can define a cluster as a group of interconnected, whole computers working together as a unified computing resource that can create the illusion of being one machine. The term whole computer means a system that can run on its own, apart from the cluster; in the literature, each computer in a cluster is typically referred to as a node.

[BREW97] lists four benefits that can be achieved with clustering. These can also be thought of as objectives or design requirements:

• Absolute scalability: It is possible to create large clusters that far surpass the power of even the largest standalone machines. A cluster can have tens, hundreds, or even thousands of machines, each of which is a multiprocessor.

• Incremental scalability: A cluster is configured in such a way that it is possible to add new systems to the cluster in small increments. Thus, a user can start out with a modest system and expand it as needs grow, without having to go through a major upgrade in which an existing small system is replaced with a larger system.

• High availability: Because each node in a cluster is a standalone computer, the failure of one node does not mean loss of service. In many products, fault tolerance is handled automatically in software.

• Superior price/performance: By using commodity building blocks, it is possible to put together a cluster with equal or greater computing power than a single large machine, at much lower cost.

Cluster Configurations

30
In the literature, clusters are classified in a number of different ways. Perhaps the simplest classification is based on whether the computers in a cluster share access to the same disks. Figure 17.9a shows a two-node cluster in which the only interconnection is by means of a high-speed link that can be used for message exchange to coordinate cluster activity. The link can be a LAN that is shared with other computers that are not part of the cluster or the link can be a dedicated interconnection facility. In the latter case, one or more of the computers in the cluster will have a link to a LAN or WAN so that there is a connection between the server cluster and remote client systems. Note that in the figure, each computer is depicted as being a multiprocessor. This is not necessary but does enhance both performance and availability.

In the simple classification depicted in Figure 17.9, the other alternative is a shared-disk cluster. In this case, there generally is still a message link between
nodes. In addition, there is a disk subsystem that is directly linked to multiple computers within the cluster. In this figure, the common disk subsystem is a RAID system. The use of RAID or some similar redundant disk technology is common in clusters so that the high availability achieved by the presence of multiple computers is not compromised by a shared disk that is a single point of failure.

Table 17.2
Clustering Methods: Benefits and Limitations

A clearer picture of the range of cluster options can be gained by looking at functional alternatives. Table 17.2 provides a useful classification along functional lines.
31

Operating System Design Issues
How failures are managed depends on the clustering method used
Two approaches:
Highly available clusters
Fault tolerant clusters
Failover
The function of switching applications and data resources over from a failed system to an alternative system in the cluster
Failback
Restoration of applications and data resources to the original system once it has been fixed
Load balancing
Incremental scalability
Automatically include new computers in scheduling
Middleware needs to recognize that processes may switch between machines

How failures are managed by a cluster depends on the clustering method used (Table 17.2). In general, two approaches can be taken to dealing with failures: highly available clusters and fault-tolerant clusters. A highly available cluster offers a high probability that all resources will be in service. If a failure occurs, such as a system goes down or a disk volume is lost, then the queries in progress are lost. Any lost query, if retried, will be serviced by a different computer in the cluster. However, the cluster operating system makes no guarantee about the state of partially executed transactions. This would need to be handled at the application level.

A fault-tolerant cluster ensures that all resources are always available. This is achieved by the use of redundant shared disks and mechanisms for backing out uncommitted transactions and committing completed transactions.

The function of switching applications and data resources over from a failed system to an alternative system in the cluster is referred to as failover. A related function is the restoration of applications and data resources to the original system once it has been fixed; this is referred to as failback. Failback can be automated, but this is desirable only if the problem is truly fixed and unlikely to recur. If not, automatic failback can cause subsequently failed resources to bounce back and forth between computers, resulting in performance and recovery problems.

A cluster requires an effective capability for balancing the load among available computers. This includes the requirement that the cluster be incrementally scalable. When a new computer is added to the cluster, the load-balancing facility should automatically include this computer in scheduling applications. Middleware mechanisms need to recognize that services can appear on different members of the cluster and may migrate from one member to another.

Parallelizing Computation

In some cases, effective use of a cluster requires executing software from a single application in parallel. [KAPP00] lists three general approaches to the problem:

• Parallelizing compiler: A parallelizing compiler determines, at compile time, which parts of an application can be executed in parallel. These are then split off to be assigned to different computers in the cluster. Performance depends on the nature of the problem and how well the compiler is designed. In general, such compilers are difficult to develop.

• Parallelized application: In this approach, the programmer writes the application from the outset to run on a cluster, and uses message passing to move data, as required, between cluster nodes. This places a high burden on the programmer but may be the best approach for exploiting clusters for some applications.

• Parametric computing: This approach can be used if the essence of the application is an algorithm or program that must be executed a large number of times, each time with a different set of starting conditions or parameters. A good example is a simulation model, which will run a large number of different scenarios and then develop statistical summaries of the results. For this approach to be effective, parametric processing tools are needed to organize, run, and manage the jobs in an effective manner.

Effective use of a cluster requires executing software from a single application in parallel

Three approaches are:

Parallelizing complier

Determines at compile time which parts of an application can be executed in parallel

These are then split off to be assigned to different computers in the cluster

Parallelized application

Application written from the outset to run on a cluster and uses message passing to move data between cluster nodes

Parametric computing

Can be used if the essence of the application is an algorithm or program that must be executed a large number of times, each time with a different set of starting conditions or parameters

Cluster Computer Architecture

Figure 17.10 shows a typical cluster architecture. The individual computers are connected by some high-speed LAN or switch hardware. Each computer is capable of operating independently. In addition, a middleware layer of software is installed in each computer to enable cluster operation. The cluster middleware provides a unified system image to the user, known as a single-system image. The middleware is also responsible for providing high availability, by means of load balancing and responding to failures in individual components.

A cluster will also include software tools for enabling the efficient execution of programs that are capable of parallel execution.

Example
100-Gbps Ethernet Configuration for Massive Blade Server Site

A common implementation of the cluster approach is the blade server. A blade server is a server architecture that houses multiple server modules (“blades”) in a single chassis. It is widely used in data centers to save space and improve system management. Either self-standing or rack mounted, the chassis provides the power supply, and each blade has its own processor, memory, and hard disk.

An example of the application is shown in Figure 17.11, taken from [NOWE07]. The trend at large data centers, with substantial banks of blade servers, is the deployment of 10-Gbps ports on individual servers to handle the massive multimedia traffic provided by these servers. Such arrangements are stressing the on-site Ethernet switches needed to interconnect large numbers of servers. A 100-Gbps rate provides the bandwidth required to handle the increased traffic load. The 100-Gbps Ethernet switches are deployed in switch uplinks inside the data center as well as providing interbuilding, intercampus, wide area connections for enterprise networks.

Clusters Compared to SMP
Easier to manage and configure
Much closer to the original single processor model for which nearly all applications are written
Less physical space and lower power consumption
Well established and stable
Far superior in terms of incremental and absolute scalability
Superior in terms of availability
All components of the system can readily be made highly redundant
SMP
Clustering
Both provide a configuration with multiple processors to support high demand applications
Both solutions are available commercially

Both clusters and symmetric multiprocessors provide a configuration with multiple processors to support high-demand applications. Both solutions are commercially available, although SMP schemes have been around far longer.

The main strength of the SMP approach is that an SMP is easier to manage and configure than a cluster. The SMP is much closer to the original single-processor
model for which nearly all applications are written. The principal change required in going from a uniprocessor to an SMP is to the scheduler function. Another benefit of the SMP is that it usually takes up less physical space and draws less power than a comparable cluster. A final important benefit is that the SMP products are well established and stable.

Over the long run, however, the advantages of the cluster approach are likely to result in clusters dominating the high-performance server market. Clusters are far superior to SMPs in terms of incremental and absolute scalability. Clusters are also superior in terms of availability, because all components of the system can readily be made highly redundant.

Nonuniform Memory Access (NUMA)
Alternative to SMP and clustering
Uniform memory access (UMA)
All processors have access to all parts of main memory using loads and stores
Access time to all regions of memory is the same
Access time to memory for different processors is the same
Nonuniform memory access (NUMA)
All processors have access to all parts of main memory using loads and stores
Access time of processor differs depending on which region of main memory is being accessed
Different processors access different regions of memory at different speeds
Cache-coherent NUMA (CC-NUMA)
A NUMA system in which cache coherence is maintained among the caches of the various processors

In terms of commercial products, the two common approaches to providing a multiple-processor system to support applications are SMPs and clusters. For some years, another approach, known as nonuniform memory access (NUMA), has been the subject of research and commercial NUMA products are now available.

Before proceeding, we should define some terms often found in the NUMA literature.

• Uniform memory access (UMA): All processors have access to all parts of main memory using loads and stores. The memory access time of a processor to all regions of memory is the same. The access times experienced by different processors are the same. The SMP organization discussed in Sections 17.2 and 17.3 is UMA.

• Nonuniform memory access (NUMA): All processors have access to all parts of main memory using loads and stores. The memory access time of a processor differs depending on which region of main memory is accessed. The last statement is true for all processors; however, for different processors, which memory regions are slower and which are faster differ.

• Cache-coherent NUMA (CC-NUMA): A NUMA system in which cache coherence is maintained among the caches of the various processors.

A NUMA system without cache coherence is more or less equivalent to a cluster. The commercial products that have received much attention recently are CC-NUMA systems, which are quite distinct from both SMPs and clusters. Usually, but unfortunately not always, such systems are in fact referred to in the commercial literature as CC-NUMA systems.

Motivation

With an SMP system, there is a practical limit to the number of processors that can be used. An effective cache scheme reduces the bus traffic between any one processor and main memory. As the number of processors increases, this bus traffic also increases. Also, the bus is used to exchange cache-coherence signals, further adding to the burden. At some point, the bus becomes a performance bottleneck. Performance degradation seems to limit the number of processors in an SMP
configuration to somewhere between 16 and 64 processors. For example, Silicon Graphics’ Power Challenge SMP is limited to 64 R10000 processors in a single system; beyond this number performance degrades substantially.

The processor limit in an SMP is one of the driving motivations behind the development of cluster systems. However, with a cluster, each node has its own private main memory; applications do not see a large global memory. In effect, coherency is maintained in software rather than hardware. This memory granularity affects performance and, to achieve maximum performance, software must be tailored to this environment. One approach to achieving large-scale multiprocessing while retaining the flavor of SMP is NUMA. For example, the Silicon Graphics Origin NUMA system is designed to support up to 1024 MIPS R10000 processors [WHIT97] and the Sequent NUMA-Q system is designed to support up to 252 Pentium II processors [LOVE96].

The objective with NUMA is to maintain a transparent system wide memory while permitting multiple multiprocessor nodes, each with its own bus or other internal interconnect system.

SMP has practical limit to number of processors that can be used

Bus traffic limits to between 16 and 64 processors

In clusters each node has its own private main memory

Applications do not see a large global memory

Coherency is maintained by software rather than hardware

NUMA retains SMP flavor while giving large scale multiprocessing

Objective with NUMA is to maintain a transparent system wide memory while permitting multiple multiprocessor nodes, each with its own bus or internal interconnect system

CC-NUMA Organization

Figure 17.12 depicts a typical CC-NUMA organization. There are multiple independent nodes, each of which is, in effect, an SMP organization. Thus, each node contains multiple processors, each with its own L1 and L2 caches, plus main memory. The node is the basic building block of the overall CC-NUMA organization. For example, each Silicon Graphics Origin node includes two MIPS R10000 processors; each Sequent NUMA-Q node includes four Pentium II processors. The nodes are interconnected by means of some communications facility, which could be a switching mechanism, a ring, or some other networking facility.

Each node in the CC-NUMA system includes some main memory. From the point of view of the processors, however, there is only a single addressable memory, with each location having a unique system wide address. When a processor initiates a memory access, if the requested memory location is not in that processor’s cache, then the L2 cache initiates a fetch operation. If the desired line is in the local portion of the main memory, the line is fetched across the local bus. If the desired line is in a remote portion of the main memory, then an automatic request is sent out to fetch that line across the interconnection network, deliver it to the local bus, and then deliver it to the requesting cache on that bus. All of this activity is automatic and transparent to the processor and its cache.

In this configuration, cache coherence is a central concern. Although implementations differ as to details, in general terms we can say that each node must maintain some sort of directory that gives it an indication of the location of various portions of memory and also cache status information.

NUMA Pros and Cons
Main advantage of a CC-NUMA system is that it can deliver effective performance at higher levels of parallelism than SMP without requiring major software changes
Bus traffic on any individual node is limited to a demand that the bus can handle
If many of the memory accesses are to remote nodes, performance begins to break down
Does not transparently look like an SMP
Software changes will be required to move an operating system and applications from an SMP to a CC-NUMA system
Concern with availability

The main advantage of a CC-NUMA system is that it can deliver effective performance at higher levels of parallelism than SMP, without requiring major software changes. With multiple NUMA nodes, the bus traffic on any individual node is limited to a demand that the bus can handle. However, if many of the memory accesses are to remote nodes, performance begins to break down. There is reason to believe that this performance breakdown can be avoided. First, the use of L1 and L2 caches is designed to minimize all memory accesses, including remote ones. If much of the software has good temporal locality, then remote memory accesses should not be excessive. Second, if the software has good spatial locality, and if virtual memory is in use, then the data needed for an application will reside on a limited number of frequently used pages that can be initially loaded into the memory local to the running application. The Sequent designers report that such spatial locality does appear in representative applications [LOVE96]. Finally, the virtual memory scheme can be enhanced by including in the operating system a page migration mechanism that will move a virtual memory page to a node that is frequently using it; the Silicon Graphics designers report success with this approach [WHIT97].

Even if the performance breakdown due to remote access is addressed, there are two other disadvantages for the CC-NUMA approach [PFIS98]. First, a CC-NUMA does not transparently look like an SMP; software changes will be required to move an operating system and applications from an SMP to a CC-NUMA system. These include page allocation, already mentioned, process allocation, and load balancing by the operating system. A second concern is that of availability. This is a rather complex issue and depends on the exact implementation of the CC-NUMA system; the interested reader is referred to [PFIS98].

Vector Computation
There is a need for computers to solve mathematical problems of physical processes in disciplines such as aerodynamics, seismology, meteorology, and atomic, nuclear, and plasma physics
Need for high precision and a program that repetitively performs floating point arithmetic calculations on large arrays of numbers
Most of these problems fall into the category known as continuous-field simulation
Supercomputers were developed to handle these types of problems
However they have limited use and a limited market because of their price tag
There is a constant demand to increase performance
Array processor
Designed to address the need for vector computation
Configured as peripheral devices by both mainframe and minicomputer users to run the vectorized portions of programs

Although the performance of mainframe general-purpose computers continues to improve relentlessly, there continue to be applications that are beyond the reach of the contemporary mainframe. There is a need for computers to solve mathematical problems of physical processes, such as occur in disciplines including aerodynamics, seismology, meteorology, and atomic, nuclear, and plasma physics.

Typically, these problems are characterized by the need for high precision and a program that repetitively performs floating-point arithmetic operations on large arrays of numbers. Most of these problems fall into the category known as continuous-field simulation. In essence, a physical situation can be described by a surface or region in three dimensions (e.g., the flow of air adjacent to the surface of a rocket). This surface is approximated by a grid of points. A set of differential equations defines the physical behavior of the surface at each point. The equations are represented as an array of values and coefficients, and the solution involves repeated arithmetic operations on the arrays of data.

Supercomputers were developed to handle these types of problems. These machines are typically capable of billions of floating-point operations per second. In contrast to mainframes, which are designed for multiprogramming and intensive I/O, the supercomputer is optimized for the type of numerical calculation just described.

The supercomputer has limited use and, because of its price tag, a limited market. Comparatively few of these machines are operational, mostly at research centers and some government agencies with scientific or engineering functions. As with other areas of computer technology, there is a constant demand to increase the performance of the supercomputer. Thus, the technology and performance of the supercomputer continues to evolve.

There is another type of system that has been designed to address the need for vector computation, referred to as the array processor. Although a supercomputer is optimized for vector computation, it is a general-purpose computer, capable of handling scalar processing and general data processing tasks. Array processors do not include scalar processing; they are configured as peripheral devices by both mainframe and minicomputer users to run the vectorized portions of programs.

Vector Addition Example

The key to the design of a supercomputer or array processor is to recognize that the main task is to perform arithmetic operations on arrays or vectors of floating-point numbers. In a general-purpose computer, this will require iteration through each element of the array. For example, consider two vectors (one-dimensional arrays) of numbers, A and B. We would like to add these and place the result in C. In the example of Figure 17.13, this requires six separate additions. How could we speed up this computation? The answer is to introduce some form of parallelism.

Matrix Multiplication
(C = A * B)

Figure 17.14a shows a FORTRAN program that can be run on an ordinary scalar processor.

One approach to improving performance can be referred to as vector processing. This assumes that it is possible to operate on a one-dimensional vector of data. Figure 17.14b is a FORTRAN program with a new form of instruction that allows vector computation to be specified. The notation (J = 1, N) indicates that operations on all indices J in the given interval are to be carried out as a single operation. How this can be achieved is addressed shortly.

The program in Figure 17.14b indicates that all the elements of the ith row are to be computed in parallel. Each element in the row is a summation, and the summations (across K) are done serially rather than in parallel. Even so, only N2 vector multiplications are required for this algorithm as compared with N3 scalar multiplications for the scalar algorithm.

Another approach, parallel processing, is illustrated in Figure 17.14c. This approach assumes that we have N independent processors that can function in parallel. To utilize processors effectively, we must somehow parcel out the computation to the various processors. Two primitives are used. The primitive FORK n causes an independent process to be started at location n. In the meantime, the original process continues execution at the instruction immediately following the FORK. Every execution of a FORK spawns a new process. The JOIN instruction is essentially the inverse of the FORK. The statement JOIN N causes N independent processes to be merged into one that continues execution at the instruction following the JOIN. The operating system must coordinate this merger, and so the execution does not continue until all N processes have reached the JOIN instruction.

The program in Figure 17.15c is written to mimic the behavior of the vector- processing program. In the parallel processing program, each column of C is computed by a separate process. Thus, the elements in a given row of C are computed in parallel.

Approaches to
Vector
Computation

The preceding discussion describes approaches to vector computation in logical or architectural terms. Let us turn now to a consideration of types of processor organization that can be used to implement these approaches. A wide variety of organizations have been and are being pursued. Three main categories stand out:

• Pipelined ALU

• Parallel ALUs

• Parallel processors

Figure 17.15 illustrates the first two of these approaches.

Pipelined Processing of Floating-Point Operations

Because floating-point operations are rather complex, there is opportunity for decomposing a floating-point operation into stages, so that different stages can operate on different sets of data concurrently. This is illustrated in Figure 17.16a. Floating-point addition is broken up into four stages (see Figure 10.22): compare, shift, add, and normalize. A vector of numbers is presented sequentially to the first stage. As the processing proceeds, four different sets of numbers will be operated on concurrently in the pipeline.

The mechanism illustrated in Figure 17.16 could be referred to as pipelining within an operation. That is we have a single arithmetic operation (e.g.,C = A + B) that is to be applied to vector operands, and pipelining allows multiple vector elements to be processed in parallel. This mechanism can be augmented with pipelining across operations. In this latter case, there is a sequence of arithmetic vector operations, and instruction pipelining is used to speed up processing. One approach to this, referred to as chaining, is found on the Cray supercomputers. The basic rule for chaining is this: A vector operation may start as soon as the first element of the operand vector(s) is available and the functional unit (e.g., add, subtract, multiply, divide) is free. Essentially, chaining causes results issuing from one functional unit to be fed immediately into another functional unit and so on. If vector registers are used, intermediate results do not have to be stored into memory and can be used even before the vector operation that created them runs to completion

It should be clear that this organization is suitable for vector processing. To see this, consider the instruction pipelining described in Chapter 14. The processor goes through a repetitive cycle of fetching and processing instructions. In the absence of branches, the processor is continuously fetching instructions from sequential locations. Consequently, the pipeline is kept full and a savings in time is achieved. Similarly, a pipelined ALU will save time only if it is fed a stream of data from sequential locations. A single, isolated floating-point operation is not speeded up by a pipeline. The speedup is achieved when a vector of operands is presented to the ALU. The control unit cycles the data through the ALU until the entire vector is processed.

Another way to achieve vector processing is by the use of multiple ALUs in a single processor, under the control of a single control unit. In this case, the control unit routes data to ALUs so that they can function in parallel. It is also possible to use pipelining on each of the parallel ALUs. This is illustrated in Figure 17.16b. The example shows a case in which four ALUs operate in parallel.

As with pipelined organization, a parallel ALU organization is suitable for vector processing. The control unit routes vector elements to ALUs in a round- robin fashion until all elements are processed. This type of organization is more complex than a single-ALU CPI.

Finally, vector processing can be achieved by using multiple parallel processors. In this case, it is necessary to break the task up into multiple processes to be executed in parallel. This organization is effective only if the software and hardware for effective coordination of parallel processors is available.

A Taxonomy of
Computer Organizations

We can expand our taxonomy of Section 17.1 to reflect these new structures, as shown in Figure 17.17. Computer organizations can be distinguished by the presence of one or more control units. Multiple control units imply multiple processors. Following our previous discussion, if the multiple processors can function cooperatively on a given task, they are termed parallel processors.

The reader should be aware of some unfortunate terminology likely to be encountered in the literature. The term vector processor is often equated with a pipelined ALU organization, although a parallel ALU organization is also designed for vector processing, and, as we have discussed, a parallel processor organization may also be designed for vector processing. Array processing is sometimes used to refer to a parallel ALU, although, again, any of the three organizations is optimized for the processing of arrays. To make matters worse, array processor usually refers to an auxiliary processor attached to a general-purpose processor and used to per- form vector computation. An array processor may use either the pipelined or parallel ALU approach.

At present, the pipelined ALU organization dominates the marketplace. Pipelined systems are less complex than the other two approaches. Their control unit and operating system design are well developed to achieve efficient resource allocation and high performance. The remainder of this section is devoted to a more detailed examination of this approach, using a specific example.

IBM 3090 with Vector Facility

Figure 17.18 shows the general organization of the vector facility. Although the vector facility is seen to be a physically separate add-on to the processor, its architecture is an extension of the System/370 architecture and is compatible with it. The vector facility is integrated into the System/370 architecture in the following ways:

* Existing System/370 instructions are used for all scalar operations.

*Arithmetic operations on individual vector elements produce exactly the same result as do corresponding System/370 scalar instructions. For example, one design decision concerned the definition of the result in a floating-point DIVIDE operation. Should the result be exact, as it is for scalar floating-point division, or should an approximation be allowed that would permit higher- speed implementation but could sometimes introduce an error in one or more low-order bit positions? The decision was made to uphold complete compatibility with the System/370 architecture at the expense of a minor performance degradation.

*Vector instructions are interruptible and their execution can be resumed from the point of interruption after appropriate action has been taken, in a manner compatible with the System/370 program-interruption scheme.

*Arithmetic exceptions are the same as, or extensions of, exceptions for the scalar arithmetic instructions of the System/370, and similar fix-up routines can be used. To accommodate this, a vector interruption index is employed that indicates the location in a vector register that is affected by an exception (e.g., overflow). Thus, when execution of the vector instruction resumes, the proper place in a vector register is accessed.

*Vector data reside in virtual storage with page faults being handled in a standard manner.
This level of integration provides a number of benefits. Existing operating systems can support the vector facility with minor extensions. Existing application programs, language compilers, and other software can be run unchanged. Software that could take advantage of the vector facility can be modified as desired.

Alternative
Programs
for Vector
Calculation

A key issue in the design of a vector facility is whether operands are located in registers or memory. The IBM organization is referred to as register to register, because the vector operands, both input and output, can be staged in vector registers. This approach is also used on the Cray supercomputer. An alternative approach, used on Control Data machines, is to obtain operands directly from memory. The main disadvantage of the use of vector registers is that the programmer or compiler must take them into account for good performance. For example, suppose that the length of the vector registers is K and the length of the vectors to be processed is N > K. In this case, a vector loop must be performed, in which the operation is performed on K elements at a time and the loop is repeated N/K times. The main advantage of the vector register approach is that the operation is decoupled from slower main memory and instead takes place primarily with registers.

The speedup that can be achieved using registers is demonstrated in Figure 17.19. The FORTRAN routine multiplies vector A by vector B to produce vector C, where each vector has a real part (AR, BR, CR) and an imaginary part (AI, BI, CI). The 3090 can perform one main-storage access per processor, or clock, cycle (either read or write); has registers that can sustain two accesses for reading and one for writing per cycle; and produces one result per cycle in its arithmetic unit. Let us assume the use of instructions that can specify two source operands and a result. Part (a) of the figure shows that, with memory-to-memory instructions, each iteration of the computation requires a total of 18 cycles. With a pure register-to-register architecture (part (b)), this time is reduced to 12 cycles. Of course, with register- to-register operation, the vector quantities must be loaded into the vector registers prior to computation and stored in memory afterward. For large vectors, this fixed penalty is relatively small. Figure 17.19c shows that the ability to specify both storage and register operands in one instruction further reduces the time to 10 cycles per iteration. This latter type of instruction is included in the vector architecture.

Registers for the IBM 3090 Vector Facility

Figure 17.20 illustrates the registers that are part of the IBM 3090 vector facility. There are sixteen 32-bit vector registers. The vector registers can also be coupled to form eight 64-bit vector registers. Any register element can hold an integer or floating-point value. Thus, the vector registers may be used for 32-bit and 64-bit integer values, and 32-bit and 64-bit floating-point values.

The architecture specifies that each register contains from 8 to 512 scalar elements. The choice of actual length involves a design trade-off. The time to do a vector operation consists essentially of the overhead for pipeline startup and register filling plus one cycle per vector element. Thus, the use of a large number of register elements reduces the relative startup time for a computation. However, this efficiency must be balanced against the added time required for saving and restoring vector registers on a process switch and the practical cost and space limits. These considerations led to the use of 128 elements per register in later 3090 implementations.

Three additional registers are needed by the vector facility. The vector-mask register contains mask bits that may be used to select which elements in the vector registers are to be processed for a particular operation. The vector-status register contains control fields, such as the vector count, that determine how many elements in the vector registers are to be processed. The vector-activity count keeps track of the time spent executing vector instructions.

Table 17.3
IBM 3090 Vector Facility:
Arithmetic and Logical Instructions

Table 17.3 summarizes the arithmetic and logical operations that are defined for the vector architecture. In addition, there are memory-to-register
load and register-to-memory store instructions. Note that many of the instructions use a three-operand format. Also, many instructions have a number of variants, depending on the location of the operands. A source operand may be a vector register (V), storage (S), or a scalar register (Q). The target is always a vector register, except for comparison, the result of which goes into the vector-mask register. With all these variants, the total number of opcodes (distinct instructions) is 171. This rather large number, however, is not as expensive to implement as might be imagined. Once the machine provides the arithmetic units and the data paths to feed operands from storage, scalar registers, and vector registers to the vector pipelines, the major hardware cost has been incurred. The architecture can, with little difference in cost, provide a rich set of variants on the use of those registers and pipelines.

Most of the instructions in Table 17.3 are self-explanatory. The two summation instructions warrant further explanation. The accumulate operation adds together the elements of a single vector (ACCUMULATE) or the elements of the product of two vectors (MULTIPLY-AND-ACCUMULATE). These instructions present an interesting design problem. We would like to perform this operation as rapidly as possible, taking full advantage of the ALU pipeline. The difficulty is that the sum of two numbers put into the pipeline is not available until several cycles later. Thus, the third element in the vector cannot be added to the sum of the first two elements until those two elements have gone through the entire pipeline. To overcome this problem, the elements of the vector are added in such a way as to produce four partial sums. In particular, elements 0, 4, 8, 12, . . . , 124 are added in that order to produce partial sum 0; elements 1, 5, 9, 13, . . . , 125 to partial sum 1; elements 2, 6, 10, 14, . . . , 126 to partial sum 2; and elements 3, 7, 11, 15, . . . , 127 to partial sum 4. Each of these partial sums can proceed through the pipeline at top speed, because the delay in the pipeline is roughly four cycles. A separate vector register is used to hold the partial sums. When all elements of the original vector have been processed, the four partial sums are added together to produce the final result. The performance of this second phase is not critical, because only four vector elements are involved.

Summary
Multiple processor organizations
Types of parallel processor systems
Parallel organizations
Symmetric multiprocessors
Organization
Multiprocessor operating system design considerations
Cache coherence and the MESI protocol
Software solutions
Hardware solutions
The MESI protocol
Multithreading and chip multiprocessors
Implicit and explicit multithreading
Approaches to explicit multithreading
Example systems
Clusters
Cluster configurations
Operating system design issues
Cluster computer architecture
Blade servers
Clusters compared to SMP
Nonuniform memory access
Motivation
Organization
NUMA Pros and cons
Vector computation
Approaches to vector computation
IBM 3090 vector facility

Chapter 17

Parallel
Processing

51
Chapter 17 summary.

Processor Organizations

Single Instruction,

Single Data Stream

(SISD)

Single Instruction,

Multiple Data Stream

(SIMD)

Multiple Instruction,

Single Data Stream

(MISD)

Multiple Instruction,

Multiple Data Stream

(MIMD)

Vector

Processor

Clusters

Uniprocessor

Array

Processor

Symmetric

Multiprocessor

(SMP)

Nonumiform

Memory

Access

(NUMA)

Shared Memory

(tightly coupled)

Distributed Memory

(loosely coupled)

Figure 17.1 A Taxonomy of Parallel Processor Architectures

LM
n

LM
1

LM
2

PUn LMn
DS

PU
1

LM
1

PU
2

LM
2

(b) SIMD (with distributed memory)

(a) SISD

PU MU
DS

CU
1

CU
2

CUn PUn

IS DS

PU
1

PU
2

CU
1

CU
2

CUn PUn

PU
1

PU
2

I
n

t
e
r
c
o
n

n
e
c
t
io

N
e
t
w

o
r
k

S
h

a
r
e
d

M
e
m

o
r
y

(d) MIMD (with distributed memory)

Figure 17.2 Alternative Computer Organizations

CU = control unit
IS = instruction stream
PU = processing unit
DS = data stream
MU = memory unit
LM = local memory

SISD = single instruction,
single data stream
SIMD = single instruction,
multiple data stream
MIMD = multiple instruction,
multiple data stream

Process 1

Figure 17.3 Multiprogramming and Multiprocessing

Process 2

Process 3

(a) Interleaving (multiprogramming, one processor)

Process 1

Process 2

Process 3

(b) Interleaving and overlapping (multiprocessing; two processors)

Blocked Running

Time

Processor

Main Memory

Interconnection

Network

Figure 17.4 Generic Block Diagram of a Tightly Coupled Multiprocessor

Processor Processor

I/O

L1 Cache

Processor

Main

Memory I/O

Subsytem

shared bus

I/O

Adapter

Processor Processor

Figure 17.5 Symmetric Multiprocessor Organization

L1 Cache L1 Cache

L2 Cache L2 Cache L2 Cache

I/O

Adapter

I/O

Adapter

M
Modified

E
Exclusive

S
Shared

I
Invalid

This cache line
valid?

Yes Yes Yes No

The memory
copy is…

out of date valid valid —

Copies exist in
other caches?

No No Maybe Maybe

A write to this
line…

does not go to
bus

goes to bus and
updates cache

goes directly to
bus

Dirty line copyback

Invalidate transaction

Read-with-intent-to-modify

Cache line fill

RH Read hit

RMS Read miss, shared

RME Read miss, exclusive

WH Write hit

WM Write miss

SHR Snoop hit on read

SHW Snoop hit on write or

read-with-intent-to-modify

Figure 17.6 MESI State Transition Diagram

Invalid Shared

Modified

(a) Line in cache at initiating processor

Exclusive

RMS

SHW

S
H

R
M

S
H

Invalid Shared

Modified

(b) Line in snooping cache

Exclusive

S
H

W
M

SHR

W
H

th
re

ad
s

w
it

ch
es

B
C
D
A
B

B C D

th
re

ad
s

w
it

ch
es

A
B

D
A

A
B
C
D
A
B

B C D

th
re

ad
s

w
it

ch
es

A
B

D
A

A N
N
NN

NN
NNN

N
B
C
D
A
B

B C D

th
re

ad
s

w
it

ch
es

B
N N N

A NN

B
B
C

B C D A

A
D

D
D

A AA
D
D
B
C
A

B
A

A
A

B C
A D
A C

B
A
A A BD D
A
A
D

B C D A

B
B

B
BA

A
A

A A

D D

D
C

C
C
C C

B C D

th
re

ad
s

w
it

ch
es

A
A

issue bandwidth

latency
cycle

cy
cl

(a) single-threaded
scalar

(g) VLIW (h) interleaved
multithreading

VLIW

(i) blocked
multithreading

VLIW

(j) simultaneous
multithreading

(SMT)

(k) chip multiprocessor
(multicore)

Figure 17.7 Approaches to Executing Multiple Threads

(b) interleaved
multithreading

scalar

(d) superscalar

(e) interleaved
multithreading

superscalar

(f) blocked
multithreading

superscalar

issu
e slo

A
A

B
B
C

B C D

th
re

ad
s

w
it

ch
es

A
A

B
B

B C D

A A

A A A A

A
N

N
N

AA NN

N N N
AD AA

B DB DDD

B B
B

D
D

Figure 17.8 Power5 Instruction Data Flow

BXU = branch execution unit and
CRL = condition register logical execution unit
FPU = floating-point execution unit
FXU = fixed-point execution unit
LSU = load/store unit

program
counter

alternate

thread
priority

dynamic
instruction

selection

branch
history
tables

return
stack

shared by two threads thread 0 resource thread 1 resource

target
cache

shared-
register
mappers

shared
issue

queues

shared
execution

units

read shared-
register files

write shared-
register files

data
translation

data
cache

branch prediction

instruction
cache

instruction
translation

instruction
buffer 0

instruction
buffer 1

group formation
instruction decode

dispatch

LSU0
FXU0

FPU0
FPU1
BXU
CRL

LSU1
FXU1 group

completion
store
queue

data
translation

data
cache

L2
cache

P P

High-speed message link

M I/O I/O

P P

I/OI/O M

(a) Standby server with no shared disk

P P

RAID

M I/O I/O

P P

I/OI/O M

(b) Shared disk

Figure 17.9 Cluster Configurations

I/O I/O

Clustering Method Description Benefits Limitations

Passive Standby A secondary server
takes over in case of
primary server failure.

Easy to implement. High cost because the
secondary server is
unavailable for other
processing tasks.

Active Secondary: The secondary server is
also used for processing
tasks.

Reduced cost because
secondary servers can
be used for processing.

Increased complexity.

Separate Servers Separate servers have
their own disks. Data is
continuously copied
from primary to
secondary server.

High availability. High network and
server overhead due to
copying operations.

Servers Connected
to Disks

Servers are cabled to the
same disks, but each
server owns its disks. If
one server fails, its disks
are taken over by the
other server.

Reduced network and
server overhead due to
elimination of copying
operations.

Usually requires disk
mirroring or RAID
technology to
compensate for risk of
disk failure.

Servers Share Disks Multiple servers
simultaneously share
access to disks.

Low network and server
overhead. Reduced risk
of downtime caused by
disk failure.

Requires lock manager
software. Usually used
with disk mirroring or
RAID technology.

Net. Interface HW

Comm SW

PC/Workstation

Net. Interface HW

Comm SW

PC/Workstation

Net. Interface HW

Comm SW

PC/Workstation

Net. Interface HW

Comm SW

PC/Workstation

Net. Interface HW

Comm SW

PC/Workstation

Cluster Middleware

(Single System Image and Availability Infrastructure)

Sequential Applications

High Speed Network/Switch

Parallel Applications

Parallel Programming Environment

Figure 17.10 Cluster Computer Architecture [BUYY99a]

N 100 Gbps

10 Gbps
&

40 Gbps

blade computer

Ethernet
switch

Figure 17.11 Example 100-Gbps Ethernet
Configuration for Massive Blade Server Site

L1 Cache

Processor

1-1

Main

Memory 1

Processor

1-m

Figure 17.12 CC-NUMA Organization

L1 Cache

L2 Cache L2 Cache Directory

I/O

L1 Cache

Processor

N-1

Main

Memory N

Processor

N-m
L1 Cache

L2 Cache L2 Cache

Related Posts