代写代考 MEC302 Embedded Computer Systems

MEC302 Embedded Computer Systems
Parallelism: Instruction level and multicore Architecture
Types of Processors
􏰀 Microprocessors and Microcontrollers 􏰀 DSP Processors

Copyright By PowCoder代写 加微信 powcoder

􏰀 Graphics Processors
Dr. Sanghyuk Lee
Parallelism
􏰀 Parallelism vs Concurrency
􏰀 Pipelining
􏰀 Instruction-Level Parallelism 􏰀 Multicore Architectures

Parallelism in Hardware
• Achieving high performance requires parallelism in the hardware
• Thus takes two broad forms:
• Instruction Level parallelism (ILP) • Multicore architectures

Instruction Level Parallelism (ILP)
• A processor supporting ILP is able to perform multiple independent operations in each instruction cycle.
• There are 4 major forms of ILP:
􏰀 CISC instructions; complex instruction set computer/ RISC(reduced instruction set somputers)
􏰀 Subword parallelism 􏰀 Superscalar

Complex Instruction Set Computer (CISC)
• Processor with complex instructions − CISC
• The philosophy behind such processors is different from that of RISC
(reduced instruction set computers) When do we use CISC machines?
• DSPs are CISC machines, include instructions supporting FIR filtering
• In fact to qualify as a DSP a processor must be able to perform FIR filtering in one instruction cycle per tap.
• Disadvantages: Extremely challenging for a compiler to make optimal use of such instruction set
− DSPs used with code libraries written and optimized in assembly language

• In the problem 𝑦𝑦 𝑛𝑛 = ∑𝑁𝑁−1 𝑎𝑎 𝑥𝑥(𝑛𝑛 − 𝑖𝑖), Inner loop of an FIR computation; 𝑖𝑖=0 𝑖𝑖
1 RPT numberOfTaps – 1 /% zero-overhead loops %/ 2 MAC *AR2+, *AR3+, A
– The instruction that comes after it will execute a number of times equal to one plus the argument of the RPT instruction
– The MAC instruction is a multiply accumulate instruction, also prevalent in DSP architectures
– It has three arguments specifying the following calculation, a := a + x * y ;
where a is the contents of an accumulator register named A, and x and y are values found in memory.
– The addresses of these values are contained by auxiliary registers AR2 and AR3. These registers are incremented automatically after the access.

RISC Vs CISC: An Example Multiplying Two Numbers in Memory
On the right is a diagram representing the storage scheme for a generic computer. The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4.
The execution unit is responsible for carrying out all computations. However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F).
Let’s say we want to find the product of two numbers – one stored in location 2:3 and another stored in location 5:2 – and then store the product back in the location 2:3.

CISC Approach
• The primary goal of CISC architecture is to complete a task in as few lines of assembly as possible.
• This is achieved by building processor hardware that is capable of understanding and executing a series of operations. For this particular task, a CISC processor would come prepared with a specific instruction (we’ll call it “MULT”).
• When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register. Thus, the entire task of multiplying two numbers can be completed with one instruction:
MULT 2:3, 5:2
• MULT is what is known as a “complex instruction.” It operates directly on the computer’s memory banks and does not require the programmer to explicitly call any loading or storing functions.
• One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly. Because the length of code is relatively short, very little RAM is required to store instructions.
• The emphasis is put on building complex instructions directly into the hardware.

RISC Approach
• RISC processors only use simple instructions that can be executed within one clock cycle. “MULT” command divided into 3 commands:
• “LOAD,” moves data from the memory bank to a register, “PROD,” which finds the product of two operands located within the registers, “STORE,” moves data from a register to the memory banks.
• In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly.

RISC Vs CISC (Advantages)
• At first, the RISC approach may seem like a much less efficient way of completing the operation. Because there are more lines of code, more RAM is needed to store the assembly level instructions.
• The compiler must also perform more work to convert a high-level language statement into code of this form
􏰀 Because each instruction requires only one clock cycle to execute, the entire program will execute in approximately the same amount of time as the multi-cycle “MULT” command.
􏰀 RISC “reduced instructions” require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers. Because all of the instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible.

Subword Parallelism
Example 1:
• Refer to previous example, where the data are typically 8-bit integers, each represent a colour intensity
• The colour of a pixel can be represented by 3 bytes in RGB format.
• Each of the RGB bytes has a value ranging from 0 to 255, representing the intensity of the corresponding colour.
• It is not efficient to use a 64-bit ALU to process a single 8-bit number. (waste of resources!)
• To support such data types some processors support subword parallelism, where a wide ALU is divided into narrow slices enabling arithmetic (or logical) operations on smaller words.

Subword Parallelism
Example 2:
• Intel introduced subword parallelism into Pentium processor (MMX) • MMX instructions divide 64-bit datapath into slices (8 bits)
• Supports identical operations on multiple bytes of image pixel data • Enhances the performance of image applications
• Many processors including DSPs support subword parallelism
• Vector processor: instruction set includes operations on multiple data elements simultaneously
− Subword parallelism is a form of vector processing

Superscalar
• In Superscalar processors, the hardware can simultaneously dispatch multiple instructions to distinct hardware units when it detects that this will not change the behavior of the program.
• Superscalar processors are rarely (if ever) used for embedded systems.
− Execution times are very difficult to predict and not be repeatable.
• Instead processors intended for embedded applications will use VLIW architectures.

Very Large Instruction Word (VLIW)
• VLIW processors include multiple function units
• Instead of dynamically determining which instructions can be executed simultaneously, each instruction specifies what each function unit should do in a particular cycle.
• Effectively a VLIW instruction set combines multiple independent operations into a single instruction.
• Multiple operations are executed simultaneously on distinct hardware
• Up to the compiler to ensure simultaneous operations are independent

Multicore Architectures
• Multicore machine is combination of several processors on single chip
• For embedded applications, multicore architectures have a significant potential advantage over single-core in that real-time and safety critical tasks can have a dedicated processor.
• The reason is multicore architectures are now popular in mobile phones since radio and speech processing are hard real-time functions that have a considerable computational load.
• In such multicore architectures, user applications cannot interfere with real-time functions.

The End of Lecture
Parallelism: Instruction level and multicore Architecture

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com