CS代写 ITRS 2017, by ~2030 will not be viable to shrink transistor any further!)

Parallel Architectures
Institute for Computing Systems Architecture

Parallel Architectures

How to build computers that execute tasks concurrently
– Tasks can be instructions, methods, threads, programs etc.
▪ Howtoprovidesupportforcoordinationand communication
– coherence protocols, memory consistency model, synchronisation instructions, transactional memory etc.
Parallel Architectures – 2019-20 !2

Parallel Architectures: Why?
Be a good (systems) programmer
– Most computers today are parallel (supercomputers, datacentres, even mobile phones), need to understand them if you want to program them well!
Research future computer architectures and systems
– A job in Intel, ARM, platform/ infrastructure job at Google, Microsoft, Amazon etc.
– Academic researcher.
Appreciate other related courses
– Extreme computing, Distributed systems etc.
Parallel Architectures – 2019-20 !3

General Information
It is recommended (but not required) that students have passed Inf 2C Computer systems. (Talk to me if you have not).
Exam: 75%, Coursework: 25%
Coursework: Coursework1 – out 20-01-20; due 31-01-20 Coursework2 – out 3-02-18; due 6-03-20
▪ Recommended (but not required) Books:
– Culler & Singh – Parallel Computer Architecture: A Hardware/Software Approach –

– Hennessy & Patterson – Computer Architecture: A Quantitative Approach – – 5th edition
Lecture slides (no lecture notes)
▪ More info: http://course.inf.ed.ac.uk/pa
▪ Please interrupt with questions at any time
Enroll on Piazza: piazza.com/ed.ac.uk/spring2020/infr11024 (access
Parallel Architectures – 2019-20 !4

What is a Parallel Architecture?
“A collection of processing elements that cooperate to solve large problems fast”
Almasi and Gottlieb, 1989
Parallel Architectures – 2019-20 !5

Examples: Parallel Architectures
– 8 stage pipeline
– Upto 8 instructions in-flight
Intel Pentium 4
– 31 stage pipeline superscalar
– Upto 124 instructions in-flight
Intel Icelake
– Quad core
– 2 threads per core (SMT) – GPU
Parallel Architectures – 2019-20 !6

Examples: Parallel Architectures
– 9217 Power9 22 cores CPUs – 27648 Nvidia Tesla V100 GPUs – Upto 200 petaflops
Ascend 910 AI SoC – 32 da vinci cores – 256 TFLOPs
Google Network
– ??? linux machines
– Several connected cluster farms
Parallel Architectures – 2019-20 !7

Why Parallel Architectures?
Important applications that require performance (pull)
Technological reasons (push)
– What to do with all those transistors?
Performance of sequential architecture is limited
– Computation/dataflowthroughlogicgates,memorydevices
– At all of these there is a non-zero delay (at least delay of speed of light)
– Thus, the speed of light and the minimum physical feature sizes impose a
hard limit on the speed of any sequential computation
– Nuclearreactorsimulation;Predictingfutureclimateetc. – MachineLearning/DeepLearning
– Cloud/Bigdata:
– Google:40Ksearchespersecond
– Facebookaboutabillionactiveusersperday.
Parallel Architectures – 2019-20 !8

Technological Trends: Moore’s Law
Moore’s law
▪ 1965 – ’s “Law” – Densities double every year (2x)
▪ 1975 – Moore’s Law revised
– Densities double every 2 years (1.42x)
▪ Actually 5x every 5 years (1.35x)
Growth in Transistor Density
1.E+09 10000 1.E+08
1960 1970 1980 1990 2000
Year of Introduction
1000 100 10
Intel CPUs
Siroyan CPU
Transistor size
Growth in Microprocessor Clock Frequency
10000 1000 100 10 1 0.1
Intel Alpha
1960 1970 1980 1990 2000
Year of Introduction
Parallel Architectures – 2019-20
Clock frequency (MHz)
Transistors per device
Transistor size (nm)

Future Technology Predictions (2003)
Moore’s Law will continue to ~2016 Procs. will have 2.2 billion transistors DRAM capacity to reach 128 . clocks should reach 40 GHz
Poly 1/2 pitch
Gate length
Desktop Microprocessor Transistor Count
00 2004 2007 2010 2013 2016
Source: International Technology Roadmap for Semiconductors, 2003
Parallel Architectures – 2019-20 !10
Millions of Transistors

State-of-the-art – January 2020
Qualcomm Centriq 2400
▪ 18B transistors,
▪ ~120W @ 2.2GHz
▪6 CPU cores +   3 GPU cores + “Neural Engine”
▪ 4.3B transistors, 89mm2
▪ ~2W @ ~2.3GHz Parallel Architectures – 2019-20
▪64 cores, 2 threads per core
▪ ~39B transistors, ~1000mm2
▪ ~280W @ ~2.6GHz

End of the Uniprocessor?
Frequency has stopped scaling : Power Wall – End of Dennard scaling
Memory wall
– Instructions and data must be fetched – Memory becomes the bottleneck
End of performance scaling for uniprocessors has forced industry to turn to chip-multiprocessors (Multicores)
– Dependencies between instruction limit ILP
Parallel Architectures – 2019-20 !12

Multicores
▪ Use transistors for adding cores
▪ (But Note: Moore’s law slowly ending, now for real!)
▪ (According to ITRS 2017, by ~2030 will not be viable to shrink transistor any further!)
▪ Lot of effort on making it easier to write parallel programs
– For e.g. Transactional memory
▪ But, software must be parallel! – Remember Amdahl’s law
Parallel Architectures – 2019-20 !13

Amdahl’s Law
Let: F → fraction of problem that can be optimized
Sopt → speedup obtained on optimized fraction
1 (1 – F) +
Soverall =
∴ Soverall = e.g.:F=0.5(50%),Sopt =10
Soverall = 1 =1.8 (1 – 0.5) + 0.5
1 =2 (1 – 0.5) + 0
▪ Bottom-line:performanceimprovementsmustbebalanced
Parallel Architectures – 2019-20 !14

Amdahl’s Law and Efficiency Let: F → fraction of problem that can be parallelized
Spar → speedup obtained on parallelized fraction P → number of processors
Soverall =
0.9 =6.4 E= 16 16
S= 1 overall
Soverall E= P
e.g.: 16 processors (Spar = 16), F = 0.9 (90%),
(1 – 0.9) +
For good scalability: E>50%; when resources are “free” then lower efficiencies are acceptable
Parallel Architectures – 2019-20

The era of Specialization
Multicores are not a panacea (Dark Silicon!) Moore’s law slowing down
The importance of AI applications (vision, ML etc.) Thus, specialized domain-specific processors!
Parallel Architectures – 2019-20 !16

Accelerator level Parallelism
Parallel architectures that contain a bunch of accelerators (e.g. GPUs, DSPs, and other accelerators)
Workloads execute concurrently across these accelerators
Case in point: mobile SoC (shown below 12 SoC)
Parallel Architectures – 2019-20 !17

Types of Parallelism ▪ Parallelism in Hardware
▪ Parallelism in a Uniprocessor – Pipelining
– Superscalar, VLIW etc.
SIMD instructions, Vector processors, GPUs
▪ Accelerators
▪ Multiprocessor
– Symmetricshared-memorymultiprocessors – Distributed-memorymultiprocessors
– Chip-multiprocessorsa.k.a.Multi-cores
Multicomputers a.k.a. clusters
▪ Parallelism in Software
▪ Instruction level parallelism
Task-level parallelism
▪ Data parallelism
▪ Transaction level parallelism
Parallel Architectures – 2019-20 !18

Taxonomy of Parallel Computers
According to instruction and data streams (Flynn):
– Single instruction single data (SISD): this is the standard uniprocessor
– Singleinstruction,multipledatastreams(SIMD):
▪ Same instruction is executed in all processors with different data ▪ E.g., Vector processors, SIMD instructions, GPUs
– Multipleinstruction,singledatastreams(MISD):
▪ Different instructions on the same data
▪ Fault-tolerant computers, Systollic arrays, Near memory computing (Micron Automata processor).
– Multipleinstruction,multipledatastreams(MIMD):the“common” multiprocessor
▪ Each processor uses it own data and executes its own program
▪ Most flexible approach
▪ Easier/cheaper to build by putting together “off-the-shelf” processors
Parallel Architectures – 2019-20 !19

Taxonomy of Parallel Computers According to physical organization of processors and memory:
– Physically centralized memory, uniform memory access (UMA)
▪ All memory is allocated at same distance from all processors
▪ Also called symmetric multiprocessors (SMP)
▪ Memory bandwidth is fixed and must accommodate all processors → does not scale to large number of processors
▪ Used in CMPs today (single-socket ones) CPU CPU CPU CPU
Interconnection
Main memory
Parallel Architectures – 2019-20

Taxonomy of Parallel Computers According to physical organization of processors and memory:
Physically distributed memory, non-uniform memory access (NUMA)
▪ A portion of memory is allocated with each processor (node)
▪ Accessing local memory is much faster than remote memory
▪ If most accesses are to local memory than overall memory bandwidth increases linearly with the number of processors
▪ Used in multi-socket CMPs E.g Intel Nehalem
CPU CPU CPU
CPU Node Shanghai
Core 4 Core 5 Core 6 Core 7 L1 L1 L1 L1 L2 L2 L2 L2
Nehalem-EP
Core 0 Core 1 Core 2 Core 3 L1 L1 L1 L1 L2 L2 L2 L2
Shared L3 Cache (inclusive)
Core 0 Core 1 Core 2 Core 3 L1 L1 L1 L1 L2 L2 L2 L2
Cache Cache
Shared L3 Cache (non-inclusive)
Shared L3 Cache (non-inclusive)
igure 1: Block
Dual-socket SMP systems based on AMD Opteron 23** (Shanghai) and Intel Xeon 55** (Nehalem-EP) processors have a similar high level design as depicted in Figure 1.
BACKGROUND AND TEST SYSTEMS
AMD’s last level cache is non-inclusive [6], i.e neither ex-
clusive nor inclusive. If a cache line is transferred from the
L3 cache into the L1 of any core the line can be removed from
the L3. According to AMD this happens if it is “likely” [3]
(further details are undisclosed) that the line is only used
The L1 and L2 caches are implemented per core, while the
by one core, otherwise a copy can be kept in the L3. Both processors use extended versions of the well-known MESI [7]
L3 cache is shared among all cores of one processor. The serial point-to-point links HyperTransport (HT) and Quick
diagram of the AMD (left) and Intel (right) system architecture
Interconnection
Parallel Architectures – 2019-20 !21
Nehalem-EP
Core 4 Core 5 Core 6 Core 7 L1 L1 L1 L1 L2 L2 L2 L2
Shared L3 Cache (inclusive)
DDR2 A DDR2 B
DDR2 C DDR2 D
DDR3 A DDR3 B DDR3 C
DDR3 D DDR3 E DDR3 F
Path Interconnect (QPI) are used for inter-processor and
protocol to ensure cache coherency. AMD Opteron proces-

Taxonomy of Parallel Computers According to memory communication model
– Shared address or shared memory
▪ Processes in different processors can use the same virtual address space ▪ Any processor can directly access memory in another processor node
▪ Communication is done through shared memory variables
▪ Explicit synchronization with locks and critical sections
▪ Arguably easier to program??
– Distributed address or message passing
▪ Processes in different processors use different virtual address spaces ▪ Each processor can only directly access memory in its own node
▪ Communication is done through explicit messages
▪ Synchronization is implicit in the messages
▪ Arguably harder to program??
▪ Some standard message passing libraries (e.g., MPI)
Parallel Architectures – 2019-20 !22

Shared Memory vs. Message Passing Shared memory
Message passing
Producer (p1)
Producer (p1) Consumer (p2)
flag = 0; flag = 0; ……
a = 10; while (!flag) {} flag = 1; x = a * y;
a = 10; receive(p1, b, label); send(p2, a, label); x = b * y;
Consumer (p2)
Parallel Architectures – 2019-20 !23

Types of Parallelism in Applications
Instruction-level parallelism (ILP)
– Multiple instructions from the same instruction stream can be executed
concurrently
– Generated and managed by hardware (superscalar) or by compiler (VLIW)
– Limited in practice by data and control dependences
Thread-level or task-level parallelism (TLP)
– Multiple threads or instruction sequences from the same application can be executed concurrently
– Generated by compiler/user and managed by compiler and hardware
– Limited in practice by communication/synchronization overheads and by algorithm characteristics
Parallel Architectures – 2019-20 !24

Types of Parallelism in Applications
Transaction-level parallelism
– Multiple threads/processes from different transactions can be executed concurrently
– Limited by concurrency overheads
Data-level parallelism (DLP)
– Instructions from a single stream operate concurrently on several data
– Limited by non-regular data manipulation patterns and by memory bandwidth
Parallel Architectures – 2019-20 !25

▪ Hardware Multithreading
Fundamental concepts – Introduction
– Types of parallelism
Uniprocessor parallelism – Pipelining,Superscalars
Shared memory multiprocessors
– Cache coherence and memory consistency – Synchronization and Transactional Memory
Vector, SIMD Processors, GPUs
AI accelerators
Datacenter datastores (if time permits)
Parallel Architectures – 2019-20 !26

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts