CS计算机代考程序代写 compiler cuda GPU computer architecture concurrency cache arm COPE-09 Architecture.indd

COPE-09 Architecture.indd

9
Architecture

Uwe R. Zimmer – The Australian National University

Computer Organisation & Program Execution 2021

Architecture

References for this chapter

[Patterson17]
David A. Patterson & John L. Hennessy
Computer Organization and Design – The Hardware/Software Interface
Chapter 4 “The Processor”,
Chapter 6 “Parallel Processors from Client to Cloud”
ARM edition, Morgan Kaufmann 2017

Architecture

Defi nition: Processor

Hardware origins
18th century machines

L’Ecrivain
1770

Programmable,
yet not a computer in today’s

defi nition (not Turing complete)

L’Ecrivain (1770)
Pierre Jaquet-Droz, Henri-Louis Jaquet-Droz & Jean-Frédéric Lescho

Architecture

Defi nition: Processor

Digital Computers
Hardware origins

• Patents by Konrad Zuse (Germany), 1936.

• First digital computer: Z1 (Germany), 1937: Re-
lays, programmable via punch tape, clock: 1 Hz,
64 words memory à 22-bit, 2 registers, fl oating point unit, weight: 1 t.

• First freely programmable (Turing complete) relays computer: Z3 (Germany), 1941: 5.3 Hz

• Atanasoff Berry Computer (US) 1942: Vacuum tubes, (not Turing complete).

• Colossus Mark 1 (UK) 1944: Vacuum tubes (not Turing complete).

• “First Draft of a Report on the EDVAC” (Electronic Discrete Variable Automatic Computer)
by John von Neumann (US), 1945: Infl uential article about core elements of a computer:
Arithmetic unit, control unit (Sequencer), memory (holding data and program), and I/O.

• First high level programming language: Plankalkül (“Plan Calculus”) by Konrad Zuse, 1945.

• ENIAC (Electronic Numerical Integrator And Computer) (US) 1946: programed by plugboard,
First Turing complete vacuum tubes based computer, clock: 100 kHz, weight: 27 t on 167 m2.

Konrad Zuse with Z1, © Dr. Horst Zuse
(replica of the 1937 computer)

ENIAC 1946,
Glen Beck (background), Betty Jennings (foreground)

Architecture

Computer Architectures

Harvard Architecture

• Control unit
Concurrently addresses program and data
memory and fetches next instruction.
Controls next ALU operations and
determines the next instruction
(based on ALU status).

• Arithmetic Logic Unit (ALU)
Fetches data from memory.
Executes arithmetic/logic operation.
Writes data to memory.

• Input/Output

• Program memory

• Data memory

P
ro

gr
am

m
e
m

o
ry

Control unit

Arithmetic Logic Unit

D
at

a
m

e
m

o
ry

Control
Status

A
d
d
re

In
p
u
t/

O
u
tp

u
t

D
at

a
A

d
d
re

ss
In

st
ru

ct
io

n
s

Architecture

Computer Architectures

von Neumann Architecture

• Control unit
Sequentially addresses program and data
memory and fetches next instruction.
Controls next ALU operations and
determines the next instruction
(based on ALU status).

• Arithmetic Logic Unit (ALU)
Fetches data from memory.
Executes arithmetic/logic operation.
Writes data to memory.

• Input/Output

• Memory
Program and data is not distinguished

Programs can change themselves.

M
e
m

o
ry

Control unit

Arithmetic Logic Unit

Control Status

A
d
d
re

In
p
u
t/

O
u
tp

u
t

D
at

a
In

st
ru

ct
io

n
s

Architecture

Computer Architectures

A simple processor (CPU)

• Decoder/Sequencer
Can be a machine in itself which breaks CPU
instructions into concurrent micro code.

• Execution Unit / Arithmetic-Logic-Unit (ALU)
A collection of transformational logic.

• Memory

• Registers
Instruction pointer, stack pointer,
general purpose and specialized registers.

• Flags
Indicating the states of the
latest calculations.

• Code/Data management
Fetching, Caching, Storing.

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

Architecture

Processor Architectures

Pipeline
Some CPU actions are naturally sequential
(e.g. instructions need to be fi rst loaded, then
decoded before they can be executed).

More fi ne grained sequences can
be introduced by breaking CPU
instructions into micro code.

Overlapping those sequences in time
will lead to the concept of pipelines.

Same latency, yet higher throughput.

(Conditional) branches
might break the pipelines

Branch predictors become essential.

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

ennnnttttt

Sequencer

Decoder

Code management

Data managementData management

Se
Int.

Architecture

Processor Architectures

Parallel pipelines
Filling parallel pipelines
(by alternating incoming commands between
pipelines) may employ multiple ALU’s.

(Conditional) branches might
again break the pipelines.

Interdependencies might limit
the degree of concurrency.

Same latency, yet even higher throughput.

Compilers need to be aware of the options. ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

Sequencer

Decoder

Code management

Data managementData management

Fla

AAAAALLUUU

ererr

ALU ALU

Data managementData management

Se
Int.

Architecture

Processor Architectures

Pipeline hazards
Structural hazard
Lack of hardware to run

operations in parallel,

… e.g. load an new instruction and
load new data in parallel.

Control hazard
A decision depends on the

previous instruction.

… e.g. a conditional branch based on the
fl ags from the previous instruction.

Data hazard
Needed data is not yet available

… e.g. the result of an arithmetic operation
is needed in the next instruction.

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

ennnnttttt

Sequencer

Decoder

Code management

Data managementData management

Se
Int.

Architecture

Processor Architectures

Out of order execution
Breaking the sequence inside each pipe-
line leads to ‘out of order’ CPU designs.

Replace pipelines with hardware scheduler.

Results need to be
“re-sequentialized” or possibly discarded.

“Conditional branch prediction” executes
the most likely branch or multiple branches.

Works better if the presented code
sequence has more independent
instructions and fewer conditional branches.

This hardware will require (extensive)
code optimization to be fully utilized.

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

Fla

AAAAALLUU

ALU ALU

Data managementData management

eCode management

Sequencer

Decoder

Data managementData management

Se
Int.

Architecture

Processor Architectures

SIMD ALU units
Provides the facility to apply the same in-
struction to multiple data concurrently.
Also referred to as “vector units”.

Examples: Altivec, MMX, SSE[2|3|4], …

Requires specialized compilers
or programming languages with
implicit concurrency.

GPU processing
Graphics processor as a vector unit.

Unifying architecture languages are
used (OpenCL, CUDA, GPGPU).

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data managementegee

LUALAALUALLUALAAAA UUUUUAAAAAAAA A

eeemmmmmmmenttmentttmmeennttttttttttttttttttttttt

ALU ALU
ALU ALU

Int.

Architecture

Processor Architectures

Hyper-threading
Emulates multiple virtual CPU cores
by means of replication of:

• Register sets

• Sequencer

• Flags

• Interrupt logic
while keeping the “expensive” resources
like the ALU central yet accessible by
multiple hyper-threads concurrently.

Requires programming languages with
implicit or explicit concurrency.

Examples: Intel Pentium 4, Core i5/i7, Xeon,
Atom, Sun UltraSPARC T2 (8 threads per core)

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

egistersRRegisRRegi

SSPSPSPPPPPP

Cod

IIPPPIIIPIIIPIPIPIPIIPIPIPIPIIPIPIPIPI

Data

SSSPPP

ggssggFFFFFFFFFFFllllllllllllllaaaaagggsgFlllaaaggggggsssssgggggggssgggFFFFFllllllaaaaaaaagggggFlaaggg

Sequencer

Decoder

FlagsFlagsRegisters

Registers

Sequencer

Decoder
Int.

Architecture

Processor Architectures

Multi-core CPUs
Full replication of multiple CPU cores
on the same chip package.

• Often combined with hyper-thread-
ing and/or multiple other means (as
introduced above) on each core.

• Cleanest and most explicit implementation
of concurrency on the CPU level.

Requires synchronized atomic operations.

Requires programming languages with
implicit or explicit concurrency.

Historically the introduction of multi-core
CPUs ended the “GHz race” in the early 2000’s.

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

egistersRRegisRRegi

SSPSPSPPPPPP

Cod

IIIIPPPIPPIPPIPPPIPIIPIIPIPPIIIPPIPIIPI

Data

SSSPPP

ggssggFFFFFFFFFFFllllllllllllllaaaaagggsgFlllaaaggggggsssssgggggggssgggFFFFFllllllaaaaaaaagggggFlaaggg

Sequencer

Decoder

FlagsFlagsRegisters

Registers

Sequencer

Decoder
Int.

Architecture

Processor Architectures

Virtual memory
Translates logical memory addresses
into physical memory addresses
and provides memory protection features.

• Does not introduce concurrency by itself.

Is still essential for concurrent programming
as hardware memory protection
guarantees memory integrity for
individual processes / threads.

ALU

M
e
m

o
ry

Sequencer

Decoder

Code management

Registers

Flags

Data management

P
h
ys

ic
al

m
e
m

o
ry

V
ir

tu
al

m
e
m

o
ry

Int.

Architecture

Alternative Processor Architectures: Parallax Propeller

Architecture

Alternative Processor Architectures: Parallax Propeller (2006)

Low cost 32 bit process
or ($8)

8 cores with 2 kB local memory

40 kB shared memory

No interrupts!
8 sema

phores

Architecture

Alternative Processor Architectures: IBM Cell processor (2001)

theoretical 25.6 GFLOP
S

at 3.2 GHz

8 cores for specialized high-
bandwidth fl oating point

operations and 128 bit registers

Multiple interconnect topologies

64 bit

PowerP
C core

Cache

Architecture

Multi-CPU systems

Scaling up:

• Multi-CPU on the same memory
multiple CPUs on same motherboard and memory bus, e.g. servers, workstations

• Multi-CPU with high-speed interconnects
various supercomputer architectures, e.g. Cray XE6:

• 12-core AMD Opteron, up to 192 per cabinet (2304 cores)

• 3D torus interconnect (160 GB/sec capacity, 48 ports per node)

• Cluster computer (Multi-CPU over network)
multiple computers connected by network interface,

e.g. Sun Constellation Cluster at ANU:

• 1492 nodes, each: 2x Quad core Intel Nehalem, 24 GB RAM

• QDR Infi niband network, 2.6 GB/sec

Architecture

Architecture
• History

• Architectures

• Pipelines

• Parallel pipelines

• Out of order execution

• Vector machines

• Multi-core CPUs

• Virtual memory

Summary

Related Posts