Distributed
Peripherals
ctional units
Copyright By PowCoder代写 加微信 powcoder
Central Processing Unit
System’s Interconnection
Input Output
Communication lines
Main Memory
mplicit Parallelis
Higher lev
s of device
made available a large number of transistors.
How best to utilize these resources?
Conventionally, use these resources in multi functional units and execute multiple
instructions in the same
Level Parall
ration have
cycle (Instruction
ps various stag
F: : Execute
e pipelining
ges, the faster
Also used in arithmetic
unit, e.g. fp multiply
Limitations:
e speed of
the slowest stage. –
pipeline is even
branch predict
deep pipelines
However, in typical program traces, every 5-6th
The penalty of a misprediction grows with
of the pipeline
instructions will have to be
tually limited by
e stages, or very
– requires ve
, since a larger
FPUs, as well
multiple redundant functional units
ch CPU so that multiple
instructions can be executed on
separate data items concurrently.
Early ones:
modern one
ALUs and a single
s: have more, e.g. the
as two SIMD units.
The scheduler – a at a large number
uction queue
number of instructions to execute
piece of hardware looks of instructions in an
ong Instruction Word (VLIW)
sors – rely
to identify and bundle together instructions that can be executed
selects appr
e time analys
checking logic.
The degrees of intrinsic parallelism the instruction stream, i.e. limited
truction-level pa
The complexity and time cost of the
operation is a
Dependency
the perform
pendency: The
n input to the next.
Resource Dependency: Two operati the same resource.
The performa
ependency: Scheduling instructions
across conditional branch statements cannot be done deterministically a-priori.
will suffer if unable to
keep all of the units
Technology
Power and Heat: I
ntel Embraces Multicore
May 17, 2004 … Intel, the world’s largest chip maker, publicly acknowledged that it had hit a ”thermal wall” on its microprocessor line. As a result, the company is changing its product strategy and disbanding one of its most advanced design groups. Intel also said that it would
abandon two
advanced chip development projects …
Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor … Intel’s decision to change course and embrace a ”dual
core’‘ processor structure shows the challenge of overcoming the effec of heat generated by the constant on-off movement of tiny switches in
modern computers … some analysts and former Intel designers said that
Intel was coming to terms with escalating heat problems so severe they threatened to cause its chips to fracture at extreme temperatures…
Times, May 17,
On-chip cache
Shared global DRAM)
Multicore chip
on-chip cache
hared cache
of processors,
on-chip cache
(external cache and
All PCs have a GPU – the main chip inside a computer which calculates and generates the positioning of graphics on a computer screen.
Games typically renders 10 000s triangles @ 60 fps
Screen resolution is typically recalculated every frame
This correspond
s to processing 115 200 000 pps
to make these operations
New types of GPUs contain multiple cores (or many- cores) that utilise hardware multithreading and SIMD.
00 and each pixel
SM – Streaming
Register file
Multiprocessor (more-less a core)
SP – Streaming Processor
processor core”)
Shared memory
Constant cache (read only for SM)
Texture cache (read for SM)
processor”
Multicores are complex c
complex, lots latency hiding
simpler core, little internal
concurrency, latency-sensitive (e.g. NVIDIA GPUs), used as accelerators
ores: huge,
of internal concurrency (e.g. intel processors)
general-purpose processing
Prof Michae the method
Rarely used
l Flynn (Stanford University) proposed to classify computers in 60’s
modern computers are combin
Procesor arrays
Pipelined vector processors
Multiprocessors Multicomputers
Instruction stream
Multiple Single
Single Instruction Data architecture
A single instruction
operate on
elements in parallel
the highly
structured nature of the underlying computations
Data parallelism
widely applied in vector
processing, and multimedia processing (e.g., graphics, image and video)
Scalar processing
traditional mode
one operation produces
one result
cessing Unit
SIMD processing: ve
e.g., SSEx on intel
Data parallelism
Multiple operations in parallel
GPU is a type of SIMD
machine, originally designed for graphics processing
It has Potential for
high performance at low cost
and nowadays widely
used for certain kinds of parallel applications (data parallel) – GPGPU
i, 512 Processing Elements (SPs)
.e. distributed
communicate
(based on the memory organization) into shared-memory or distributed-memory architectures (or a combination of both,
shared-memory multiprocessors communicate together through a common memory.
distributed-memory multicomputers
together through communication
classified
allel Architectures
All the processors in the
s in the system (can have
non-uniform memory access
ARED-MEMOR
system can
Y MULTIPROCESSOR
cess all memory
access (UMA)
Interconnection Network
allel Architectures
ocessors have direct acce
mmunication
place via message
DISTRIBUTED-MEMORY MULTICO shared-memory processor)
ach PE can be a
Interconnect Network
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com