程序代写 COMP 30080 Computer Systems

COMP 30080 Computer Systems
4. Microarchitecture Dr
UCD School of Computer Science. Scoil na Ríomheolaíochta UCD.

Copyright By PowCoder代写 加微信 powcoder

What is microarchitecture?
• “Computer Architecture is the conceptual design and operational structure of a computer system.” It includes:
• “The Instruction Set Architecture (ISA) is the programmer-visible components of the computer, their operation and their interaction.” This includes the list of instructions, the registers, addressing modes, memory architecture, data structures, interrupt handlers, and I/O interfaces.
• “The microarchitecture (or computer organisation) is the specific arrangement and interconnection of hardware components that implement the processor.”

1. Processor Performance
– Measuring Performance
– Understanding Performance – Power Consumption
2. Performance Trends
3. Single cycle Microarchitecture (already covered)
4. Multi-cycle Microarchitecture
6. Multiple Issue Microarchitecture – Static Multiple Issue
– Dynamic Multiple Issue
7. Advanced Microarchitectures

Performance

main: loop:
beq $t0,$t1,exit
addi $t0,$t0,1
mul $t2,$a0,2
add $v0,$t2,3
beq $t0,$t1,exit
addi $t0,$t0,1
mul $t2,$a0,2
add $v0,$t2,3
beq $t0,$t1,exit
addi $t0,$t0,1
mul $t2,$a0,2
add $v0,$t2,3
beq $t0,$t1,exit
addi $t0,$t0,1
mul $t2,$a0,2
add $v0,$t2,3
beq $t0,$t1,exit
Instruction count:
2 4 3 3 3 3
Program Listing
Program Trace
Program Profile

Program Execution
• “A computer program is a sequence of instructions written to perform a specific task on a computer.”
• “A program trace is a log of the instructions performed during program execution.”
• “A program profile is an analysis of the usage of instructions during program execution”

How do you measure the performance of a processor?
• Option 1:
• Number of clock cycles per second
• But… what if the processor doesn’t do one instruction per clock cycle? 🙁

How do you measure the performance of a processor?
• Option 2:
• Number of instructions per second
• MIPS = Millions of Instructions Per Second
• But… how do you know the instructions are equally powerful?
• Different computers could have different ISAs. Is the comparison fair? 🙁

How do you measure the performance of a processor?
• Option 3:
• If you are interested in mathematics, you could settle on floating point calculations as a your standard instruction.
• Floating Point Operations Per Second (flops)

How do you measure the performance of a processor?
• Option 4:
• Measure the execution time of a benchmark program
• Rate performance by comparing the execution time to that the same program on a reference computer.
• Too specific to a particular program. 🙁

How do you measure the performance of a processor?
• Option 5:
• Measure the total execution time (Ts) over a suite of N benchmark programs:
• Score performance relative to a reference computer.
• Best option. 🙂

Example Benchmark Suites
• A number of standard suites of benchmark programs are available. These are written in high-level languages to facilitate
portability between computers. The results are usually expressed relative to a specific reference computer.
• SPEC (Standard Performance Evaluation Corporation) – www.spec.org
• LINPACK: Linear algebra benchmark
‣ www.top500.org
• EEMBC (Embedded Microprocessor Benchmark Consortium)
‣ www.eembc.org
• UL Benchmarks (former Futuremark)
‣ 3DMark, UL Procyon
• Geekbench
‣ browser.primatelabs.com

1000 000 000 000 000 000
1000 000 000 000 000
1000 000 000 000
1000 000 000
0.000 000 001
0.000 000 000 001
0.000 000 000 000 001

537,212,000,000,000,000 flop/s

LINPACK Benchmarks on Your (Intel) PC
• Google “Intel Math Kernel Library Benchmarks Suite”
‣ https://software.intel.com/en-us/articles/intel-mkl- benchmarks-suite
‣ download
‣ open terminal window
‣ change directory to m_mklb/benchmarks_//mkl/be nchmarks/linpack
‣ close all other apps & services
‣ ./runme64

LINPACK Benchmarks for my PC
• Late 2016 MacBook Pro
• 2.7 GHz Intel Core i7 processor (4 core)
• Peak 166.6928 GFlops = 166,692,800,000 Flops

https://www.top500.org/

http://www.irishsupercomputerlist.org/

Power Consumption

Power Consumption
• Power consumption determines:
– Battery life in mobile systems
– Cost in server and supercomputer systems
• Power is dissipated as heat. If processors get to hot, they will fail. Expensive cooling systems (heat sinks, fans, air conditioning) are needed for recent processors.

Power Consumption
• In CMOS circuits:
• Switching activity arises when the voltage level on a wire switches from 0 to 1 (charge) and from 1 to 0 (discharge). The total activity is dependent on:
‣ the number of wires that are switching
‣ the frequency with which the wires are switching
• Capacitance determined by manufacturing technology and chip design.
• Supply voltage is determined by the process technology but can be adjusted.

Power Consumption
• But… lower voltage means slower transistor switching

Transistors per Chip Over Time
• Moore’s Law: “The number of transistor on a single integrated circuit doubles approximately every one-two years”

Transistor Size over Time

Performance over Time
• Roughly 2/3 up due to process shrink, 1/3 due to using the extra transistors for better microarchitectures.

Performance over Time
• Pollack’s Rule: “Performance increase due to microarchitecture advances is roughly proportional to the square root of the increase in complexity (i.e. transistors available)”

Clock Frequency Scaling with Time
• has stopped…

Voltage over Time
• Supply voltage has not scaled at the same rate as feature shrink.

Power Consumption
• Look at the power equation
• Large increases in:
‣ number of transistors per mm2
‣ frequency of switching of these transistors
• Small decreases in ‣ capacitance
• Overall large increases in power density (W/mm2)

HAD TO STOP FREQUENCY SCALING

Power Density Over Time
• Power density plateaued because designers stopped increasing clock frequency and adjusted chip layouts to avoid problems.

Performance Gains?
• Use the increased numbers of transistors to:
‣ improve microarchitectures
‣ add parallelism (multi-core)
‣ put more memory (cache) in the chip with the CPU to reduce main memory access times

Single Cycle MIPS

rt field 0 rd field
to Reg File
ALUcontrol

Instruction Type
Number Executed
Clock Frequency, f = 2 MHz Clock period, t = 0.5 μs Execution time, T = 18 * 0.5 = 9 μs

Performance
• All instructions take 1 clock cycle.
• Therefore total execution time E of a program on a single cycle processor is equal to the total number of instructions executed N multiplied by the the clock period T.
• Of course, clock period is the inverse of clock frequency F:

Multicycle MIPS

Reg. & ALU
Data Memory
Reg. Write
Total (ps)

Single Cycle
TOTAL = 1350 ps
Multi Cycle
100 ps 100 ps 100 ps
TOTAL = 1200 ps

Multi-cycle Processor
• In the single-cycle implementation:
‣ one instruction is performed every clock cycle.
‣ for ease of digital implementation, the clock period must be fixed.
‣ clock must be set to allow for the slowest (worst case) instruction (load instruction)
‣ time is wasted in all other cases.
• In the multi-cycle implementation:
‣ Split up an instruction into stages.
‣ Perform one stage every clock cycle.
‣ Start the next instruction as soon as all the stages needed for the current instruction are finished.

Understanding Performance
• We can say that, the execution time E on a multi-cycle processor equals the number of times each instruction n(i) is executed multiplied by the number of clock cycles execution takes for that instruction c(i) summed over all instructions in the ISA M and multiplied by the clock period T.

Instruction Type
Number executed n(i)
Number clocks each c(i)
18 instructions
67 clock cycles
Clock period, t = 0.1 μs
Clock Frequency, f = 10 MHz Execution time, T = 67 * 0.1 = 6.7 μs

Understanding Performance
• A figure of merit for a processor is the average number of Clock cycles per Instruction (CPI) I. CPI is the number of clock cycles performed C divided by the number of instructions performed N for a typical benchmark program:

Instruction Type
18 instructions
67 clock cycles
Number of instructions, N = 18
Number of clock cycles, C = 67
Clock cycles Per Instruction, I = 67/18 = 3.72

Comparison
• Multicycle:
• Single cycle:
• Note that, the basis of the equations is the same, except that for a single cycle processor, CPI I=1.

Comparison
• We can say:
• So comparing multi-cycle to single cycle performance, P:
i.e. 40% faster

Performance Factors
• Sometimes, processor datasheets refer to Instructions Per Clock cycle (IPC):
• Handy when CPI is less than 1, e.g. CPI=0.5, IPC=2.

• For the given workload, what is the CPI for each implementation?
• How much faster is the multi-cycle compared to the single- cycle implementation?
Single cycle processor
Multi-cycle processor
Clock period
Branch or Jump

Multicycle Implementation Approach
• Break up the instructions into steps, each stage takes a cycle
• Balance the amount of work to be done in each cycle.
• Restrict each cycle to use only one major functional unit.
• At the end of a cycle store values for use in later cycles -> introduce additional “internal” register

Multicycle Implementation
Memory Instruction or data
Register # Registers
Register #
Register #
ALU ALUOut
Instruction register
Memory data

Multicycle Implementation Changes
• Single memory unit for data and instructions.
• Single ALU, no separate adders.
• Registers after every functional unit to hold output until value is used in next clock cycle.
– Instruction Register (IR)
– Memory Data Register (MDR) – Register data registers (A, B) – ALU output register (ALUOut)
• More multiplexers to control flow of data.

Multicycle Implementation Not Changed
• The Instruction Set Architecture is unchanged.
• The same executable programs will run.
• However, they way they are executed is changed.
• The additional intermediate registers are not visible to the programmer.

Multicycle Implementation

Control New Signals
Control signal name
Memory address is PC (Instruction).
Memory address is ALUOut (Data).
Update IR on positive edge of the clock.
Update PC on positive edge of the clock.
PCWriteCond
Update PC on positive edge of the clock if Zero=1.

Control New Signals
Control signal name
First ALU input is the PC.
First ALU input is the A register.
Second ALU input is the B register.
Second ALU input is the constant 4.
Second ALU input is sign extended IR[15:0].
Second ALU input is sign extended IR[15:0] left shifted by 2.
PC input is ALU output (bypass ALUOut register).
PC input is ALUOut.
PC input is {PC+4[31:28],IR[15:0],2’b00}

Instruction
add $t1, $t2, $t3

Instruction
add $t1, $t2, $t3
1. Fetch & PC update
• IR <= Memory[PC]; • PC <= PC+4; • IorD = 0 (Instruction) • IRWrite = 1 (update IR) • ALUSrcA = 0 (PC) • ALUSrcB = 01 (+4) • ALUOp = 00 (add) • PCSource = 00 (ALU output) • PCWrite = 1 (do write) • other control signals are 'neutral'. indicates parallel register updates Instruction add $t1, $t2, $t3 Instruction add $t1, $t2, $t3 2. Decode & register fetch • A <= Reg[IR[25:21]]; • B <= Reg[IR[20:16]]; • ALUOut <= PC + (sign-extend(IR[15:0] << 2); • ALUSrcA = 0 (PC) • ALUSrcB = 11 (IR[15:0] << 2) • ALUOp = 00 (add) Instruction add $t1, $t2, $t3 3. Execute Instruction add $t1, $t2, $t3 3. Execute • ALUOut <= A op B; • ALUSrcA = 1 (A register) • ALUSrcB = 00 (B register) • ALUOp = 10 (follow funct field) Instruction add $t1, $t2, $t3 Instruction add $t1, $t2, $t3 4. Register update • Reg[IR[15:11]] <= ALUOut • RegDst = 1 (rd field) • MemtoReg = 0 (ALUOut) • RegWrite = 1 Instruction lw $t1, offset($t2) • Cycle 1 Fetch unchanged • Cycle 2 Decode unchanged Instruction lw $t1, offset($t2) 3. Address Compute Instruction lw $t1, offset($t2) 3. Memory address computation • ALUOut <= A + sign_extend( IR[15:0] ) • ALUSrcA = 1 (A register) • ALUSrcB = 10 (sign_extend ( IR[15:0] ) • ALUOp = 00 (add) Instruction lw $t1, offset($t2) 4. Memory Access Instruction lw $t1, offset($t2) 4. Memory access • MDR <= Memory [ALUOut]; • IorD = 1 (Data) • MemRead = 1 Instruction lw $t1, offset($t2) 5. Complete Instruction lw $t1, offset($t2) 5. Memory read completion • Reg[IR[20:16]] <= MDR; • MemtoReg = 1 (data from memory) • RegDst = 0 (rt field) • RegWrite = 1 Instruction beq $t1, $t2, offset • Cycle 1 Fetch unchanged – PC <= PC+4 • Cycle 2 Decode unchanged – Branch target address computed and placed in ALUOut Instruction beq $t1, $t2, offset Instruction beq $t1, $t2, offset • if (A == B) PC <= ALUOut; • ALUSrcA = 1 (A register) • ALUSrcB = 00 (B register) • ALUOp = 01 (subtract) • PCSource = 01 (ALUOut) • PCWriteCond = 1 (write if Zero=1) Instruction • Cycle 1 Fetch unchanged – PC <= PC+4 • Cycle 2 Decode unchanged concatenates the bits • Cycle 3 Jump – PC <= {PC [31:28], IR[25:0], 2'b00}; – PCSource = 10 (jump) – PCWrite = 1 Control Changes • Single cycle implementation: – Control only depends on the instruction read during that cycle. – No memory of previous cycles is needed. – Therefore, Control Unit is built from combinational logic. • Multicycle implementation: – Control depends on instruction read several cycles ago. – Memory is needed. – Therefore, Control Unit is Finite State Machine. Control Decode • Control Unit uses two stage decode. op field Instruction[31:26] Reg Reg RegWrite MemRead MemWrite ALUSrcA ALUSrcB IorD IRWrite PCWrite PCWriteCond PCSource ALUControl State register funct field Instruction[5:0] Finite State Machine State = = Decode (Op=lw) or (Op=sw) Memory write Address compute Memory access Reg Update Op=j Op=beq Branch Jump Reg Update Control Exceptions • "An exception is an unscheduled event that disrupts program execution." E.g. error detected by the processor. • "An interrupt is an exception that comes from outside the processor." E.g. signal from the keyboard. • The Control Unit must handle these exceptions. Typically: – The current instruction is completed. – Registers which are being used but which need not be preserved by a subroutine are put on the stack. – Look up a table to find the address of the Interrupt Service Routine (ISR) associated with that exception number. – Jumps and link to the ISR. Control Implementation • In modern processors the Control Unit is very complex. • A FSM implemented using logic gates is inflexible. • Modern Control Units are implemented using a mini-processor which executes microinstructions. The technique is called microprogramming. Pipelined MIPS • “Instruction Level Parallelism (ILP) is executing more than one instruction at a time”. • We consider two main approaches in this chapter: – Pipelining (this section). There is a single datapath that executes stages of multiple instructions in parallel (temporal parallelism). – Multiple issue (next section). • Split instructions into stages (like multi-cycle). • Design the hardware so that the stages are independent and can be performed in parallel (like multi-cycle). • Make all instructions last for 5 stages (unlike multi-cycle). • As soon as the first stage of the current instruction is finished, start the first stage of the next instruction (because the hardware for the first stage is not busy) (unlike multi-cycle). MIPS Pipeline • 5 stage pipeline: ‣ IF: Instruction Fetch ‣ ID: Instruction Decode ‣ EX: Execute ‣ MEM: Memory access ‣ WB: Write Back (i.e. register update) Single Cycle Processor Write Back 10 instructions take 10 * 500 = 5,000 ps to execute Multi Cycle Processor Write Back Pipelined Processor Memory Write MIPS Pipeline Organisation • Theoretically, how much faster is the following program on the pipeline processor than on a single clock processor? – 10,000 instruction to be executed • Single cycle processor – clock period = 600 ps • Pipeline processor – clock period = 200 ps – assume there are no stalls Pipelined Processor • In the pipeline: ‣ Same ISA, so same number of instructions to perform ‣ Stages are similar to multicycle, so clock frequency is 4.5 times single cycle (approx). ‣ Ideally, one instruction finishes every clock cycle so CPI is 1. • But... – “Hazards are situations in pipelining when the next instruction cannot execute in the following cycle". – “A pipeline stall (or bubble) is a delay in execution of an instruction in a pipeline to allow time to resolve a hazard.” Comparison • The pipeline stall rate S is the fraction of clock cycles experiencing a stall. • The execution time of a program on a pipelined processor is then: • Multicycle: • Single cycle: Comparison • We can say: • Comparing performance between pipelined and single cycle: • What happens when there is a hazard? – The Control Unit determines in advance that a hazard will occur and delays the start of the next instruction until the hazard is resolved. – This is called a pipeline stall or bubble. – This increases CPI, but, on average, it is still better than in the multicycle implementation. • Types of hazards: ‣ Structural ‣ Data Hazards Structural Hazard • “A structural hazard is an occurrence in which a planned instruction cannot execute in the proper clock cycle because the hardware cannot support the combination of instructions that are set to execute in the given clock cycle". • This is due to resource conflicts e.g. two successive instructions want to use the data memory in the same cycle. • MIPS was designed s 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com