程序代写代做代考 compiler cache Excel computer architecture c/c++ arm cuda python algorithm GPU flex SEC204

SEC204
Computer architectures and low level programming
Dr. Vasilios Kelefouras
Email: v.kelefouras@plymouth.ac.uk Website: https://www.plymouth.ac.uk/staff/vasilios- kelefouras
1
School of Computing (University of Plymouth)
Date
04/11/2019

2
Computer Architectures – Last Pieces of the Puzzle
Too many puzzling words:
• x86, RISC, CISC, EPIC, VLIW, Harvard architecture
• SIMD
• Microcontrollers, ASIC, ASIP, FPGA, GPU, DSP
• Pipeline, vector processing, superscalar, hyper-threading, multi-threading
• Heterogeneous systems

3
Outline
 Different computer architectures – classified regarding purpose  General Purpose Processors
 Application Specific Processors
 Coprocessors / accelerators
 Multi-core processors
 Many-core processors
 Simultaneous Multithreading
 Single Instruction Multiple Data  Heterogeneous Systems

4

5
Computer architectures – classified regarding purpose (1)
Fig.1. CPU market analysis

6
1.
2.
3.
Computer architectures – classified regarding purpose (2)
General Purpose Processors
Specific Purpose Processors Accelerators, also called co-processors

7
General Purpose Processors (GPP)
 They are classified into:
1.
2.
General purpose microprocessors – general purpose computers, e.g., desktop PCs, laptops
 Very powerful CPUs, e.g., Intel, AMD and now Arm too
 Superscalar and Out of Order, big cache memories, lots of pipeline stages
Microcontrollers – Embedded systems
 Less powerful CPUs, e.g., ARM, Texas Instruments
 They are usually designed for specific tasks in embedded systems
 They usually have control oriented peripherals
 They have on chip CPU, fixed amount of RAM, ROM, I/O ports
 Lower cost, lower performance, lower power consumption, smaller than microprocessors
 Appropriate for applications in which cost, power consumption and chip area are critical

8
GPP – General Purpose Microprocessor
 General Purpose Microprocessor – general purpose computers
 They are designed for general purpose computers such as PCs, workstations,
Laptops, notepads etc
 Higher CPU frequency than microcontrollers
 Higher cost than microcontrollers
 Higher performance than microcontrollers
 Higher power consumption than microcontrollers
 General purpose processors are designed to execute multiple applications and perform multiple tasks

9
GPP – Microcontrollers
Fig.2. Microcontrollers
Fig.3. Components of the Microcontroller

10
Application Specific Processors (1)
 General purpose processors offer good performance for all different applications but specific purpose processors offer better for a specific task
 Application specific processors emerged as a solution for  higher performance
 lower power consumption
 Lower cost
 Application specific processors have become a part of our life and can be found almost in every device we use on a daily basis
 Devices such as TVs, mobile phones and GPSs they all have application specific processors
 They are classified into
1. Digital Signal Processor (DSPs)
2. Application Specific Instruction Set Processors (ASIPs)
3. Application Specific Integrated Circuit (ASICs)

11
1.
Digital Signal Processors (DSPs)
DSP: Programmable microprocessor for extensive real-time mathematical computations
 specialized microprocessor with its architecture optimized for the operational needs of digital signal processing
 DSP processors are designed specifically to perform large numbers of complex arithmetic calculations and as quickly as possible
 DSPs tend to have a different arithmetic Unit architecture;
 specialized hardware units, such bit reversal, multiple Multiply-Accumulate
(MAC) units etc
 Normally DSPs have a small instruction cache but no data cache memory

12
2.
Application Specific Instruction set Processor (ASIP)
ASIP: Programmable microprocessor where hardware and instruction set are designed together for one special application
 Instruction set, micro architecture and/or memory system are customised for an application or family of applications
 Usually, they are divided into two parts: static logic which defines a minimum ISA and configurable logic which can be used to design new instructions
 The configurable logic can be programmed and extend the instruction set similar to FPGAs
 better performance, lower cost, and lower power consumption than GPP

13
3.
Application Specific Integrated Circuit (ASIC)
ASIC: Algorithm completely implemented in hardware
 An Integrated Circuit (IC) designed for a specific line of a company – full
custom
 It cannot modified – it is produced as a single, specific product for a particular application only
 Proprietary by nature and not available to the general public
 ASICs are full custom therefore they require very high development costs
 ASIC is just built for one and only one customer
 ASIC is used only in one product line
 Only volume production of ASICs for one product can make sense which means low unit cost for high volume products, otherwise the cost is not efficient
 There is a lot of effort to implement an ASIC – there are specific languages such as VHDL and Verilog

14
Building an application specific system on an embedded system (1)
Consider that we want to build and application specific system. We can choose:
1. GPP
Functionality of the system is exclusively build on the software level
 it is not efficient in term of performance, power consumption, cost, chip area and heat dissipation
2. ASIC:
No flexibility and extensibility
3. ASIP:
a compromise between the two extremes
 used in embedded and system-on-chip solutions
GPP
ASIP
ASIC
Performance
Fig.4. Comparison between Performance and flexibility
Flexibility

15
GPP
ASIP
ASIC
Performance
Low
High
Very High
Flexibility
Excellent
Good
Poor
HW design
None
Large
Very large
SW design
Small
Large
None
Power
Large
Medium
Small
reuse
Excellent
Good
Pure
market
Very large
Relatively large
Small
Cost
High
Medium
Volume sensitive
Building an application specific system on an embedded system (2)
Table 1. Comparison between different approaches for Building Embedded Systems [1]

16
Accelerators – coprocessors
 Accelerators / co-processors are used to perform some functions more efficiently than the CPU
 They offer
 Higher performance
 Lower power consumption
 High Performance per Watt
 But they are harder to program

17
Field Programmable Gate Arrays (FPGAs)
 FPGAs are devices that allow us to create our own digital circuits
 An FPGA (Field Programmable Gate Array) is an array of logic gates that
can be hardware-programmed to fulfill user-specified tasks
 FPGAs contain programmable logic components called “logic blocks”, and a hierarchy of reconfigurable interconnects that allow the blocks to be “wired together”
 An application can be implemented entirely in HW
 The FPGA configuration is generally specified using a hardware
description language (HDL) like VHDL and Verilog – hard to program
 High Level Synthesis (HLS) provides a solution to this problem. Engineers write C/C++ code instead, but it is not that efficient yet

18
FPGAs (2)
 FPGAs come on a board. This board is connected to a PC and programmed. Then, it can work as a standalone component

19
FPGAs (3)
 Unlike an ASIC the circuit design is not set and you can reconfigure an FPGA as many times as you like!
 Creating an ASIC also costs potentially millions of dollars and takes weeks or months to create.
 However, the recurring cost is lower than the cost of the FPGA (no silicon area is wasted in ASICs).
 ASICs are cheaper only when the production number is very high  Intel plans hybrid CPU-FPGA chips

20
GPUs (1)
 The GPU’s advanced capabilities were originally used primarily for 3D game graphics. But now those capabilities are being harnessed more broadly to accelerate computational workloads in other areas too
 GPUs are very efficient for
 Data parallel applications
 Throughput intensive applications – the algorithm is going to process lots of data elements
 Graphics Processing Unit (GPU)

21
GPUs (2) – why do we need GPUs?

22
GPUs (3)
 A GPU is always connected to a CPU – GPUs are coprocessors
 GPUs work in lower frequencies than CPUs
 GPUs have many processing elements (up to 1000)
 GPUs have smaller and faster cache memories
 OpenCL is the dominant open general-purpose GPU computing language, and is an open standard
 The dominant proprietary framework is Nvidia CUDA

23
Schematic of Nvidia GPU architecture

24
Multi-core CPUs
 Multiple cores on the same chip using a shared cache
 Typically from 2-8 cores
 Both cores compete for the same
hardware resources
 Both cores are identical
 Every core is a superscalar out of order CPU

25
Multi-core CPUs – ARM Cortex-A15

26
ARM Cortex- A15

27
Multi-core CPUs – Intel i7 architecture
 In the figure below there is the Intel i7 CPU, where four CPU cores and the GPU reside in the same chip

28
Many core Processors – Intel Xeon Phi
 They are intended for use in supercomputers, servers, and high-end workstations
 57-61 in-order simpler than i7 cores
 1-1.7 Ghz
 512bit vector instructions
 each core is connected to a ring interconnect via the Core Ring Interface

29
Intel CPU
DSP
MultiCore
GPU
Flexibility, Programming Abstraction
CPU:
FPGA:
Performance, Area and Power Efficiency
ASIC
• Market-specific
• Fewer programmers
• Rigid, less programmable • Hard to build (physical)
• Market-agnostic
• Accessible to many
programmers (Python, C++)Verilog)
• Flexible, portable • More efficient than SW
Comparison
ManyCore
FPGA ASIC
• Somewhat Restricted Market
• Harder to Program (VHDL, • More expensive than ASIC

30
Fetch
Decode
Dispatch
Load unit
ALU
ALU
FPU
Commit
Superscalar and Out of Order is not enough (1)
 The approach of exploiting ILP through superscalar execution is seriously weakened by the fact that normally programs don’t have a lot of fine- grained parallelism in them
 Because of this, CPUs normally don’t exceed more than 3 instructions per cycle when running most mainstream, real-world software, due to a combination of load latencies, cache misses, branching and dependencies between instructions
Reservation Stations
Reorder Buffer

Store Buffer
Execute

31
Fetch
Decode
Dispatch
Load unit
ALU
ALU
FPU
Commit
Superscalar and Out of Order is not enough (2)
 Issuing many instructions in the same cycle only ever happens for short bursts
 Moreover, the dispatch logic of a 5-issue processor is more than 50% larger than a 4- issue design (chip area), with 6-issue being more than twice as large, 7-issue over 3 times the size, 8-issue more than 4 times larger than 4-issue (for only 2 times the width), and so on
• Exploitinginstructionlevel parallelism is expensive
Reservation Stations
Reorder Buffer

Store Buffer
Execute

32
Superscalar and Out of Order is not enough (3)
Very important features that further improve the performance of CPUs are:
 Simultaneous multi threading (SMT) or Hyper Threading in Intel processors  Single Instruction Multiple Data (SIMD) – vectorization

33
Simultaneous multi-threading (SMT) as a solution to improve CPU’s performance (1)
 SMTistheprocessofaCPUsplittingeachofitsphysicalcoresintovirtualcores
 Normally 2 threads are executed in one physical CPU core
 If additional independent instructions aren’t available within the program being executed, there is another potential source of independent instructions – other running programs, or other threads within the same program
 Simultaneous multi-threading (SMT) is a processor design technique which exploits exactly this type of thread-level parallelism
 Fill the empty bubbles in the pipelines with useful instructions, but this time rather than using instructions from further down in the same code, the instructions come from multiple threads running at the same time, all on the one processor core
 So, an SMT processor appears to the rest of the system as if it were multiple independent processors, just like a true multi-processor system

34
Simultaneous multi-threading (SMT) as a solution to improve CPU’s performance (2)
 From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the “execution state” of each thread
 These parts only constitute a tiny fraction of the overall processor’s hardware  The really large and complex parts, such as the decoders and dispatch logic,
the functional units, and the caches, are all shared between the threads
 On top of this, the fact that the threads in an SMT design are all sharing just one processor core and just one set of caches, has major performance downsides compared to a true multi-processor (or multi-core)
 SMT performance can actually be worse than single-thread performance
 Speedups from SMT on the Pentium 4 ranged from around -10% to +30% depending on the application(s)

35
Single Instruction Multiple Data (SIMD) – Vectorization (1)
 In addition to instruction-level parallelism, there is yet another source of parallelism – data parallelism
 Rather than looking for ways to execute groups of instructions in parallel, the idea is to look for ways to make one instruction apply to a group of data values in parallel
 This is sometimes called SIMD parallelism (single instruction multiple data). More often, it’s called vector processing

36
Single Instruction Multiple Data (SIMD) – Vectorization (2)

37
Single Instruction Multiple Data (SIMD) – Vectorization (3)
 There is specific hardware (HW) supporting a variety of vector instructions as well as wide registers
 General Purpose Microprocessors
 Laptops, desktops, servers
 From 64-bit up to 512-bit vector instructions – all kind of instructions are supported, e.g., load/store, add/multiply, if-conditions
 Microprocessors for embedded systems or Microcontrollers
 From 32-bit up to 128-bit vector instructions
 limited instruction set for Microcontrollers, but not for microprocessors

38
Single Instruction Multiple Data (SIMD) – Vectorization (4)
 Modern compilers use auto-vectorization – the compiler does this for us
 For applications where this type of data parallelism is available and
easy to extract, SIMD vector instructions can produce amazing speedups
 Unfortunately, it’s quite difficult for a compiler to automatically make use of vector instructions
 hand written code is more efficient
 The key problem is that the way programmers write programs tends to serialize everything, which makes it difficult for a compiler to prove two given operations are independent and can be done in parallel.
 Rewriting just a small amount of code in key places has a widespread effect across many applications
 Almost every CPU has now added SIMD vector extensions

39
Current trend

40
Hardware Trends
From single core processors to heterogeneous systems on a chip
Taken from https://embb.io/downloads/MTAPI_EMBB.pdf

41
The CPU frequency has ceased to grow

42

43
Hardware Evolution
 Scalar Processors
 Pipelined Processors
 Superscalar and VLIW Processors
 Out of order Processors Time  Vectorization
 Multicore Processors
 Heterogeneous systems

44
Heterogeneous computing (1)
Single core Era -> Multi-core Era -> Heterogeneous Systems Era
 Heterogeneous computing refers to systems that use more than one
kind of processors or cores
 These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar (co)- processors, usually incorporating specialized processing capabilities to handle particular tasks
 Systems with General Purpose Processors (GPPs), GPUs, DSPs, ASIPs etc.
 Heterogeneous systems offer the opportunity to significantly increase system performance and reduce system power consumption

45
 Software issues:  Offloading
Heterogeneous computing (2)
 Programmability – think about CPU code (C code), GPU code (CUDA), FPGA code (VHDL)
 Portability – What happens if your code runs on a machine with an FPGA instead of a GPU

46
Heterogeneous computing (3) – A mobile phone system
4 cores
4 cores

47
Think-Pair-Share Exercise
 What is in your opinion the most appropriate computer architecture for a smart phone and why?
a. 1 microcontroller
b. 1 normal speed GPP, e.g., Pentium II
c. 1 quad-core Intel i7
d. A heterogeneous computer architecture with 1 normal speed GPP, 1 DSP, 1 GPU and a few Microcontrollers

48
Conclusions
 Modern Computer Systems include Parallel Heterogeneous Computer Architectures
 General purpose processors + specific purpose processors + co- processors
 Heterogeneous systems offer the opportunity to significantly  increase performance
 reduce power consumption
 reduce cost
 Issues:
 Programmability
 Portability
 Design good Compilers – optimize the code

49
References and Further Reading
[1] Nohl, A, Schirrmeister, F & Taussig, D. “Application specific processor design: Architectures, design methods and tools” Computer-Aided Design (ICCAD), 2010 IEEE/ACM International Conference on Nov. 2010.
[2] Tom Spyrou Challenges in the Static Timing Analysis of FPGA’s, ALTERA, TAU 2015 3/2015
[3] Modern Microprocessors A 90-Minute Guide!. Lighterra,
http://www.lighterra.com/papers/modernmicroprocessors/
[4] Introduction to GPU computing, available at http://www.int.washington.edu/PROGRAMS/12-2c/week3/clark_01.pdf
[5] Yousef Qasim, P Radyumna Janga, Sharath Kumar, Hani Alesaimi, APPLICATION SPECIFIC PROCESSORS, ECE/CS 570 PROJECT FINAL REPORT, available at http://web.engr.oregonstate.edu/~qassimy/index_files/Final_ECE570_ASP_2012_Pro ject_Report.pdf
[6] William Stallings, Computer Organization & Architecture. Designing for Performance, Seventh Edition
[7] Andrew S. Tanenbaum, Todd Austin, Structured Computer Organization. Sixth Edition, PEARSON

Thank you
School of Computing (University of Plymouth)
Date 04/11/2019