SEC204
Computer architectures and low level programming
Dr. Vasilios Kelefouras
Email: v.kelefouras@plymouth.ac.uk Website: https://www.plymouth.ac.uk/staff/vasilios- kelefouras
1
School of Computing (University of Plymouth)
Date
04/11/2019
2
Computer Architectures – Last Pieces of the Puzzle
Too many puzzling words:
• x86, RISC, CISC, EPIC, VLIW, Harvard architecture
• SIMD
• Microcontrollers, ASIC, ASIP, FPGA, GPU, DSP
• Pipeline, vector processing, superscalar, hyper-threading, multi-threading
• Heterogeneous systems
3
Outline
Different computer architectures – classified regarding purpose General Purpose Processors
Application Specific Processors
Coprocessors / accelerators
Multi-core processors
Many-core processors
Simultaneous Multithreading
Single Instruction Multiple Data Heterogeneous Systems
4
5
Computer architectures – classified regarding purpose (1)
Fig.1. CPU market analysis
6
1.
2.
3.
Computer architectures – classified regarding purpose (2)
General Purpose Processors
Specific Purpose Processors Accelerators, also called co-processors
7
General Purpose Processors (GPP)
They are classified into:
1.
2.
General purpose microprocessors – general purpose computers, e.g., desktop PCs, laptops
Very powerful CPUs, e.g., Intel, AMD and now Arm too
Superscalar and Out of Order, big cache memories, lots of pipeline stages
Microcontrollers – Embedded systems
Less powerful CPUs, e.g., ARM, Texas Instruments
They are usually designed for specific tasks in embedded systems
They usually have control oriented peripherals
They have on chip CPU, fixed amount of RAM, ROM, I/O ports
Lower cost, lower performance, lower power consumption, smaller than microprocessors
Appropriate for applications in which cost, power consumption and chip area are critical
8
GPP – General Purpose Microprocessor
General Purpose Microprocessor – general purpose computers
They are designed for general purpose computers such as PCs, workstations,
Laptops, notepads etc
Higher CPU frequency than microcontrollers
Higher cost than microcontrollers
Higher performance than microcontrollers
Higher power consumption than microcontrollers
General purpose processors are designed to execute multiple applications and perform multiple tasks
9
GPP – Microcontrollers
Fig.2. Microcontrollers
Fig.3. Components of the Microcontroller
10
Application Specific Processors (1)
General purpose processors offer good performance for all different applications but specific purpose processors offer better for a specific task
Application specific processors emerged as a solution for higher performance
lower power consumption
Lower cost
Application specific processors have become a part of our life and can be found almost in every device we use on a daily basis
Devices such as TVs, mobile phones and GPSs they all have application specific processors
They are classified into
1. Digital Signal Processor (DSPs)
2. Application Specific Instruction Set Processors (ASIPs)
3. Application Specific Integrated Circuit (ASICs)
11
1.
Digital Signal Processors (DSPs)
DSP: Programmable microprocessor for extensive real-time mathematical computations
specialized microprocessor with its architecture optimized for the operational needs of digital signal processing
DSP processors are designed specifically to perform large numbers of complex arithmetic calculations and as quickly as possible
DSPs tend to have a different arithmetic Unit architecture;
specialized hardware units, such bit reversal, multiple Multiply-Accumulate
(MAC) units etc
Normally DSPs have a small instruction cache but no data cache memory
12
2.
Application Specific Instruction set Processor (ASIP)
ASIP: Programmable microprocessor where hardware and instruction set are designed together for one special application
Instruction set, micro architecture and/or memory system are customised for an application or family of applications
Usually, they are divided into two parts: static logic which defines a minimum ISA and configurable logic which can be used to design new instructions
The configurable logic can be programmed and extend the instruction set similar to FPGAs
better performance, lower cost, and lower power consumption than GPP
13
3.
Application Specific Integrated Circuit (ASIC)
ASIC: Algorithm completely implemented in hardware
An Integrated Circuit (IC) designed for a specific line of a company – full
custom
It cannot modified – it is produced as a single, specific product for a particular application only
Proprietary by nature and not available to the general public
ASICs are full custom therefore they require very high development costs
ASIC is just built for one and only one customer
ASIC is used only in one product line
Only volume production of ASICs for one product can make sense which means low unit cost for high volume products, otherwise the cost is not efficient
There is a lot of effort to implement an ASIC – there are specific languages such as VHDL and Verilog
14
Building an application specific system on an embedded system (1)
Consider that we want to build and application specific system. We can choose:
1. GPP
Functionality of the system is exclusively build on the software level
it is not efficient in term of performance, power consumption, cost, chip area and heat dissipation
2. ASIC:
No flexibility and extensibility
3. ASIP:
a compromise between the two extremes
used in embedded and system-on-chip solutions
GPP
ASIP
ASIC
Performance
Fig.4. Comparison between Performance and flexibility
Flexibility
15
GPP
ASIP
ASIC
Performance
Low
High
Very High
Flexibility
Excellent
Good
Poor
HW design
None
Large
Very large
SW design
Small
Large
None
Power
Large
Medium
Small
reuse
Excellent
Good
Pure
market
Very large
Relatively large
Small
Cost
High
Medium
Volume sensitive
Building an application specific system on an embedded system (2)
Table 1. Comparison between different approaches for Building Embedded Systems [1]
16
Accelerators – coprocessors
Accelerators / co-processors are used to perform some functions more efficiently than the CPU
They offer
Higher performance
Lower power consumption
High Performance per Watt
But they are harder to program
17
Field Programmable Gate Arrays (FPGAs)
FPGAs are devices that allow us to create our own digital circuits
An FPGA (Field Programmable Gate Array) is an array of logic gates that
can be hardware-programmed to fulfill user-specified tasks
FPGAs contain programmable logic components called “logic blocks”, and a hierarchy of reconfigurable interconnects that allow the blocks to be “wired together”
An application can be implemented entirely in HW
The FPGA configuration is generally specified using a hardware
description language (HDL) like VHDL and Verilog – hard to program
High Level Synthesis (HLS) provides a solution to this problem. Engineers write C/C++ code instead, but it is not that efficient yet
18
FPGAs (2)
FPGAs come on a board. This board is connected to a PC and programmed. Then, it can work as a standalone component
19
FPGAs (3)
Unlike an ASIC the circuit design is not set and you can reconfigure an FPGA as many times as you like!
Creating an ASIC also costs potentially millions of dollars and takes weeks or months to create.
However, the recurring cost is lower than the cost of the FPGA (no silicon area is wasted in ASICs).
ASICs are cheaper only when the production number is very high Intel plans hybrid CPU-FPGA chips
20
GPUs (1)
The GPU’s advanced capabilities were originally used primarily for 3D game graphics. But now those capabilities are being harnessed more broadly to accelerate computational workloads in other areas too
GPUs are very efficient for
Data parallel applications
Throughput intensive applications – the algorithm is going to process lots of data elements
Graphics Processing Unit (GPU)
21
GPUs (2) – why do we need GPUs?
22
GPUs (3)
A GPU is always connected to a CPU – GPUs are coprocessors
GPUs work in lower frequencies than CPUs
GPUs have many processing elements (up to 1000)
GPUs have smaller and faster cache memories
OpenCL is the dominant open general-purpose GPU computing language, and is an open standard
The dominant proprietary framework is Nvidia CUDA
23
Schematic of Nvidia GPU architecture
24
Multi-core CPUs
Multiple cores on the same chip using a shared cache
Typically from 2-8 cores
Both cores compete for the same
hardware resources
Both cores are identical
Every core is a superscalar out of order CPU
25
Multi-core CPUs – ARM Cortex-A15
26
ARM Cortex- A15
27
Multi-core CPUs – Intel i7 architecture
In the figure below there is the Intel i7 CPU, where four CPU cores and the GPU reside in the same chip
28
Many core Processors – Intel Xeon Phi
They are intended for use in supercomputers, servers, and high-end workstations
57-61 in-order simpler than i7 cores
1-1.7 Ghz
512bit vector instructions
each core is connected to a ring interconnect via the Core Ring Interface
29
Intel CPU
DSP
MultiCore
GPU
Flexibility, Programming Abstraction
CPU:
FPGA:
Performance, Area and Power Efficiency
ASIC
• Market-specific
• Fewer programmers
• Rigid, less programmable • Hard to build (physical)
• Market-agnostic
• Accessible to many
programmers (Python, C++)Verilog)
• Flexible, portable • More efficient than SW
Comparison
ManyCore
FPGA ASIC
• Somewhat Restricted Market
• Harder to Program (VHDL, • More expensive than ASIC
30
Fetch
Decode
Dispatch
Load unit
ALU
ALU
FPU
Commit
Superscalar and Out of Order is not enough (1)
The approach of exploiting ILP through superscalar execution is seriously weakened by the fact that normally programs don’t have a lot of fine- grained parallelism in them
Because of this, CPUs normally don’t exceed more than 3 instructions per cycle when running most mainstream, real-world software, due to a combination of load latencies, cache misses, branching and dependencies between instructions
Reservation Stations
Reorder Buffer
…
Store Buffer
Execute
31
Fetch
Decode
Dispatch
Load unit
ALU
ALU
FPU
Commit
Superscalar and Out of Order is not enough (2)
Issuing many instructions in the same cycle only ever happens for short bursts
Moreover, the dispatch logic of a 5-issue processor is more than 50% larger than a 4- issue design (chip area), with 6-issue being more than twice as large, 7-issue over 3 times the size, 8-issue more than 4 times larger than 4-issue (for only 2 times the width), and so on
• Exploitinginstructionlevel parallelism is expensive
Reservation Stations
Reorder Buffer
…
Store Buffer
Execute
32
Superscalar and Out of Order is not enough (3)
Very important features that further improve the performance of CPUs are:
Simultaneous multi threading (SMT) or Hyper Threading in Intel processors Single Instruction Multiple Data (SIMD) – vectorization
33
Simultaneous multi-threading (SMT) as a solution to improve CPU’s performance (1)
SMTistheprocessofaCPUsplittingeachofitsphysicalcoresintovirtualcores
Normally 2 threads are executed in one physical CPU core
If additional independent instructions aren’t available within the program being executed, there is another potential source of independent instructions – other running programs, or other threads within the same program
Simultaneous multi-threading (SMT) is a processor design technique which exploits exactly this type of thread-level parallelism
Fill the empty bubbles in the pipelines with useful instructions, but this time rather than using instructions from further down in the same code, the instructions come from multiple threads running at the same time, all on the one processor core
So, an SMT processor appears to the rest of the system as if it were multiple independent processors, just like a true multi-processor system
34
Simultaneous multi-threading (SMT) as a solution to improve CPU’s performance (2)
From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the “execution state” of each thread
These parts only constitute a tiny fraction of the overall processor’s hardware The really large and complex parts, such as the decoders and dispatch logic,
the functional units, and the caches, are all shared between the threads
On top of this, the fact that the threads in an SMT design are all sharing just one processor core and just one set of caches, has major performance downsides compared to a true multi-processor (or multi-core)
SMT performance can actually be worse than single-thread performance
Speedups from SMT on the Pentium 4 ranged from around -10% to +30% depending on the application(s)
35
Single Instruction Multiple Data (SIMD) – Vectorization (1)
In addition to instruction-level parallelism, there is yet another source of parallelism – data parallelism
Rather than looking for ways to execute groups of instructions in parallel, the idea is to look for ways to make one instruction apply to a group of data values in parallel
This is sometimes called SIMD parallelism (single instruction multiple data). More often, it’s called vector processing
36
Single Instruction Multiple Data (SIMD) – Vectorization (2)
37
Single Instruction Multiple Data (SIMD) – Vectorization (3)
There is specific hardware (HW) supporting a variety of vector instructions as well as wide registers
General Purpose Microprocessors
Laptops, desktops, servers
From 64-bit up to 512-bit vector instructions – all kind of instructions are supported, e.g., load/store, add/multiply, if-conditions
Microprocessors for embedded systems or Microcontrollers
From 32-bit up to 128-bit vector instructions
limited instruction set for Microcontrollers, but not for microprocessors
38
Single Instruction Multiple Data (SIMD) – Vectorization (4)
Modern compilers use auto-vectorization – the compiler does this for us
For applications where this type of data parallelism is available and
easy to extract, SIMD vector instructions can produce amazing speedups
Unfortunately, it’s quite difficult for a compiler to automatically make use of vector instructions
hand written code is more efficient
The key problem is that the way programmers write programs tends to serialize everything, which makes it difficult for a compiler to prove two given operations are independent and can be done in parallel.
Rewriting just a small amount of code in key places has a widespread effect across many applications
Almost every CPU has now added SIMD vector extensions
39
Current trend
40
Hardware Trends
From single core processors to heterogeneous systems on a chip
Taken from https://embb.io/downloads/MTAPI_EMBB.pdf
41
The CPU frequency has ceased to grow
42
43
Hardware Evolution
Scalar Processors
Pipelined Processors
Superscalar and VLIW Processors
Out of order Processors Time Vectorization
Multicore Processors
Heterogeneous systems
44
Heterogeneous computing (1)
Single core Era -> Multi-core Era -> Heterogeneous Systems Era
Heterogeneous computing refers to systems that use more than one
kind of processors or cores
These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar (co)- processors, usually incorporating specialized processing capabilities to handle particular tasks
Systems with General Purpose Processors (GPPs), GPUs, DSPs, ASIPs etc.
Heterogeneous systems offer the opportunity to significantly increase system performance and reduce system power consumption
45
Software issues: Offloading
Heterogeneous computing (2)
Programmability – think about CPU code (C code), GPU code (CUDA), FPGA code (VHDL)
Portability – What happens if your code runs on a machine with an FPGA instead of a GPU
46
Heterogeneous computing (3) – A mobile phone system
4 cores
4 cores
47
Think-Pair-Share Exercise
What is in your opinion the most appropriate computer architecture for a smart phone and why?
a. 1 microcontroller
b. 1 normal speed GPP, e.g., Pentium II
c. 1 quad-core Intel i7
d. A heterogeneous computer architecture with 1 normal speed GPP, 1 DSP, 1 GPU and a few Microcontrollers
48
Conclusions
Modern Computer Systems include Parallel Heterogeneous Computer Architectures
General purpose processors + specific purpose processors + co- processors
Heterogeneous systems offer the opportunity to significantly increase performance
reduce power consumption
reduce cost
Issues:
Programmability
Portability
Design good Compilers – optimize the code
49
References and Further Reading
[1] Nohl, A, Schirrmeister, F & Taussig, D. “Application specific processor design: Architectures, design methods and tools” Computer-Aided Design (ICCAD), 2010 IEEE/ACM International Conference on Nov. 2010.
[2] Tom Spyrou Challenges in the Static Timing Analysis of FPGA’s, ALTERA, TAU 2015 3/2015
[3] Modern Microprocessors A 90-Minute Guide!. Lighterra,
http://www.lighterra.com/papers/modernmicroprocessors/
[4] Introduction to GPU computing, available at http://www.int.washington.edu/PROGRAMS/12-2c/week3/clark_01.pdf
[5] Yousef Qasim, P Radyumna Janga, Sharath Kumar, Hani Alesaimi, APPLICATION SPECIFIC PROCESSORS, ECE/CS 570 PROJECT FINAL REPORT, available at http://web.engr.oregonstate.edu/~qassimy/index_files/Final_ECE570_ASP_2012_Pro ject_Report.pdf
[6] William Stallings, Computer Organization & Architecture. Designing for Performance, Seventh Edition
[7] Andrew S. Tanenbaum, Todd Austin, Structured Computer Organization. Sixth Edition, PEARSON
Thank you
School of Computing (University of Plymouth)
Date 04/11/2019