CSE141-s121-L5
Week 2 — Wed July 7
Performance (Cont.) & Single Cycle Processor
CSE-141 Summer Session I 2021
Many slides adapted from Dean Tullsen
Administrative
• Weekly Check-In is due midnight
• Do not forget
• Homework 1 Will be posted tonight
• Due next Thursday 11:59 pm
• Make sure you have MIPS Ref Sheet ready
Previously on CSE-141…
Our definition of Performance
• only has meaning in the context of a program or workload
• Not very intuitive as an absolute measure, but most of the time we’re more
interested in relative performance.
PerformanceX =
1
Execution TimeX
, for program X
Relative Performance (Speedup), the Definition
PerformanceX
Execution TimeXPerformanceY
Speedup Execution TimeY= = = n(X/Y)
What is Time?
CPU Execution Time = CPU clock cycles * Clock cycle time
• Every conventional processor has a clock with an associated clock cycle time or clock rate
• Every program runs in an integral number of clock cycles
Cycle Time
MHz = millions of cycles/second, GHz = billions of cycles/second
X MHz = 1000/X nanoseconds cycle time
Y GHz = 1/Y nanoseconds cycle time
All Together Now
CPU Execution
Time
Instruction
Count
CPI Clock Cycle
Time= X X
instructions
cycles/instruction seconds/cycle
seconds
Putting it all together
• IC = 4 billion, 2 GHz processor, execution time of 3 seconds. What is the CPI
for this program?
• Suppose we reduce CPI to 1.2 (through an architectural improvement). What
is the new execution time?
CPU Execution
Time
Instruction
Count
CPI Clock Cycle
Time= X X
An aside…
Cycle Time/Clock Rate
An aside…
Cycle Time/Clock Rate
• Increasingly, modern processors can execute at multiple clock rates (cycle times).
An aside…
Cycle Time/Clock Rate
• Increasingly, modern processors can execute at multiple clock rates (cycle times).
• Why?
An aside…
Cycle Time/Clock Rate
• Increasingly, modern processors can execute at multiple clock rates (cycle times).
• Why?
• However, the granularity at which we can change the cycle time tends to be
fairly coarse, so all of these principles and formulas still apply.
Who Affects Performance?
CPU Execution Time Instruction Count CPI Clock Cycle Time= X X
Who Affects Performance?
• programmer
• compiler
CPU Execution Time Instruction Count CPI Clock Cycle Time= X X
Who Affects Performance?
• programmer
• compiler
• instruction-set architect
• machine architect
CPU Execution Time Instruction Count CPI Clock Cycle Time= X X
Who Affects Performance?
• programmer
• compiler
• instruction-set architect
• machine architect
• hardware designer
CPU Execution Time Instruction Count CPI Clock Cycle Time= X X
Who Affects Performance?
• programmer
• compiler
• instruction-set architect
• machine architect
• hardware designer
• materials scientist/physicist/silicon engineer
CPU Execution Time Instruction Count CPI Clock Cycle Time= X X
Performance Variation
CPU Execution Time Instruction Count CPI Clock Cycle Time= X X
Number of
Instructions
CPI Clock Cycle Time
Same machine,
different programs
Same programs,
different machine,
same ISA
Same programs,
different machines
Other Performance Metrics
• MIPS
• MFLOPS, GigaFLOPS, PetaFLOPS (million, billion, quadrillion Floating Point
Operations/second)
MIPS
MIPS = Millions of Instructions Per Second
• program-independent
• deceptive
=
Instruction_Count
Execution_Time × 106
=
Clock_Rate
CPI × 106
Which Programs?
• peak throughput measures (simple programs)?
Which Programs?
• peak throughput measures (simple programs)?
• synthetic benchmarks (whetstone, dhrystone,…)?
Which Programs?
• peak throughput measures (simple programs)?
• synthetic benchmarks (whetstone, dhrystone,…)?
• Real applications
Which Programs?
• peak throughput measures (simple programs)?
• synthetic benchmarks (whetstone, dhrystone,…)?
• Real applications
• SPEC (best of both worlds, but with problems of their own)
• System Performance Evaluation Cooperative
• Provides a common set of real applications along with strict guidelines for how to run them.
• provides a relatively unbiased means to compare machines.
• Now, we also have more specialized benchmarks for big data, for datacenter
applications, for machine learning, …
Amdahl’s Law
• The impact of a performance improvement is limited by the fraction of
execution time affected by the improvement
Amdahl’s Law
• The impact of a performance improvement is limited by the fraction of
execution time affected by the improvement
Execution time
after improvement
=
Execution Time Affected
Amount of Improvement
Execution Time Unaffected+
Amdahl’s Law
• The impact of a performance improvement is limited by the fraction of
execution time affected by the improvement
• Make the common case fast!! (usually)
Execution time
after improvement
=
Execution Time Affected
Amount of Improvement
Execution Time Unaffected+
Amdahl’s Law and Massive Parallelism
.9 .1
Amdahl’s Law and Massive Parallelism
.9 .1
.1.45
Speedup
1.0
1/.55 = 1.82
Amdahl’s Law and Massive Parallelism
.9 .1
.1
.1
.45
.225
Speedup
1.0
1/.55 = 1.82
1/.325 = 3.07
Amdahl’s Law and Massive Parallelism
.9 .1
.1
.1
.1
.45
.225
Speedup
1.0
1/.55 = 1.82
1/.325 = 3.07
< 10 Performance Key Points Performance Key Points • Be careful how you specify performance Performance Key Points • Be careful how you specify performance • Execution time = instructions * CPI * cycle time Performance Key Points • Be careful how you specify performance • Execution time = instructions * CPI * cycle time • Use real applications Performance Key Points • Be careful how you specify performance • Execution time = instructions * CPI * cycle time • Use real applications • Use standards, if possible Performance Key Points • Be careful how you specify performance • Execution time = instructions * CPI * cycle time • Use real applications • Use standards, if possible • Make the common case fast Just 2 min break! Designing a Single Cycle Datapath or The Do-It-Yourself CPU Kit Computer Architecture Computer Architecture = Instruction Set Architecture + Machine Organization Computer Architecture Computer Architecture = Instruction Set Architecture + Machine Organization How you talk to the machine Computer Architecture Computer Architecture = Instruction Set Architecture + Machine Organization What the machine hardware looks like How you talk to the machine The Big Picture: Where are We Now? • The Five Classic Components of a Computer • Today’s Topic: the ALU, Datapath Design, then Control Design Control Datapath Memory Processor Input Output The Big Picture: The Performance Perspective • Processor design (datapath and control) will determine: – Clock cycle time – Clock cycles per instruction • Starting today: – Single cycle processor: ▪ Advantage: One clock cycle per instruction ▪ Disadvantage: long cycle time •ET = Insts * CPI * Cycle Time Execute an entire instruction Clocking Methodology • All storage elements are clocked by the same clock edge • This picture is valid whether we’re talking about a single-cycle processor, multi-cycle processor, pipelined processor, … Clk Don’t Care Setup Hold . . . . . . . . . . . . Setup Hold The Processor: Datapath & Control • We're ready to look at an implementation of MIPS simplified to contain only: – memory-reference instructions: lw, sw – arithmetic-logical instructions: add, sub, and, or, slt – control flow instructions: beq • Generic Implementation: – use the program counter (PC) to supply instruction address – get the instruction from memory – read registers – use the instruction to decide exactly what to do Arithmetic -- The heart of instruction execution 32 32 32 operation result a b ALU Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Okay, let’s stop and recall a little math… Recall: 2’s complement • We would like a number system that provides – obvious representation of 0,1,2... – uses an adder for both unsigned and signed addition – single value of 0 – equal coverage of positive and negative numbers – easy detection of sign – easy negation Two’s Complement Representation • 2’s complement representation of negative numbers – Take the bitwise inverse and add 1 • Biggest 4-bit Binary Number: 7 • Smallest 4-bit Binary Number: -8 Decimal -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 Two’s Complement Binary 1000 1001 1010 1011 1100 1101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0111 Two’s Complement Arithmetic • Examples: 7 + (- 6) = 1 3 + (- 5) = -2 2’s Complement Binary2’s Complement BinaryDecimal 0 0000 1 0001 2 0010 3 0011 1111 1110 1101 Decimal -1 -2 -3 4 0100 5 0101 6 0110 7 0111 1100 1011 1010 1001 -4 -5 -6 -7 1000-8 0 1 1 1 1 0 1 0+ 0 0 1 1 1 0 1 1+ Designing an Arithmetic Logic Unit ALU Control Lines (ALUop) Function 000 And 001 Or 010 Add 110 Subtract 111 Set-on-less-than A LU N N N A B Result Overflow Zero 3 ALUop CarryOut Start Small: A One Bit ALU • This 1-bit ALU will perform AND, OR, and ADD 1 1 0 0 1 1 1 0+ 1 0 1 0 1 - 4 - 2 - 6 1 0 0 Start Small: A One Bit ALU • This 1-bit ALU will perform AND, OR, and ADD 1 1 0 0 1 1 1 0+ 1 0 1 0 1 - 4 - 2 - 6 1 0 0 A 32-bit ALU 1-bit ALU 32-bit ALU How About Subtraction? • Keep in mind the following: • (A - B) is the same as: A + (-B) • 2’s Complement negate: inverse every bit and add 1 • Bit-wise inverse of B is !B: • A - B = A + (-B) = A + (!B + 1) = A + !B + 1 • How does this help us? How About Subtraction? • Keep in mind the following: • (A - B) is the same as: A + (-B) • 2’s Complement negate: inverse every bit and add 1 • Bit-wise inverse of B is !B: • A - B = A + (-B) = A + (!B + 1) = A + !B + 1 • How does this help us? What signals accomplish what? sign bit (adder output from bit 31) Binvert Cin Oper ADD SUB AND OR SLT BEQ MULTIPLY hardware • 64-bit Multiplicand reg, 64-bit ALU, 64-bit Product reg, 32-bit multiplier reg MULTIPLY hardware • 64-bit Multiplicand reg, 64-bit ALU, 64-bit Product reg, 32-bit multiplier reg Okay, we’re not going to go over this. Big point. Multiply requires a bunch of adds to complete, so it’s obviously going to be slower than an add. Divide is similar, but worse. DIVIDE hardware • 64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg Floating Point Floating Point • Typically will have their own ALU (and register file!) Floating Point • Typically will have their own ALU (and register file!) • Because they require normalization, then math, then renormalization, then rounding (a lot of serialized steps)… Floating Point • Typically will have their own ALU (and register file!) • Because they require normalization, then math, then renormalization, then rounding (a lot of serialized steps)… • FP computations are always slower than integer. – Eg, Intel’s Skylake ▪ Int add 1, int mul 4, FP add 3, FP mul 5 Okay, now that we have an ALU… Let’s look beyond just Execute now. Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction CSE 141 CC BY-NC-ND Pat Pannuto – Many slides adapted from Dean Tullsen 52 Our previous view of a computer had no organization • From Part I… So what does this tell us? Register Transfer Language (RTL) • is a mechanism for describing the movement and manipulation of data between storage elements: R[3] <- R[5] + R[7] PC <- PC + 4 + R[5] R[rd] <- R[rs] + R[rt] R[rt] <- Mem[R[rs] + immed] Review: The MIPS Instruction Formats • All MIPS instructions are 32 bits long. The three instruction formats: R-type I-type J-type op target address 02631 6 bits 26 bits op rs rt rd shamt funct 061116212631 6 bits 6 bits5 bits5 bits5 bits5 bits op rs rt immediate 016212631 6 bits 16 bits5 bits5 bits The MIPS Subset • R-type – add rd, rs, rt – sub, and, or, slt • LOAD and STORE – lw rt, rs, imm16 – sw rt, rs, imm16 • BRANCH: – beq rs, rt, imm16 op rs rt rd shamt funct 061116212631 6 bits 6 bits5 bits5 bits5 bits5 bits op rs rt immediate 016212631 6 bits 16 bits5 bits5 bits op rs rt displacement 016212631 6 bits 16 bits5 bits5 bits PC = PC+4 R[rd] = R[rs] OP R[rt] PC = PC+4 R[rt] = Mem[R[rs] + SE(imm)] OR Mem[R[rs] + SE(imm)] = R[rt] ZERO = (R[rs] – R[rt] == 0) if(ZERO) PC = PC + 4+ (SE(Imm)<<2) Else PC = PC+4 Storage elements • RTL describes data movement between storage elements, but we don’t actually have our data elements yet. • So… Storage Element: Register • Register – Similar to the D Flip Flop except ▪ N-bit input and output ▪ Write Enable input – Write Enable: ▪ 0: Data Out will not change ▪ 1: Data Out will become Data In (on the clock edge) Clk Data In Write Enable N N Data Out Let’s try to make a Register File Register File Clk Write Data RegWrite 32 32 Read Data 1 32 Read Data 2 32 32-bit Registers 5 5 5 RR1 RR2 WR Memory Clk Write Data MemWrite 32 32 Read Data Address MemRead Can we layout a high-level design to do all this now? Putting it All Together: A Single Cycle Datapath • We have everything except control signals (later) Key Points • CPU is just a collection of state and combinational logic • We just designed a the datapath for a very rich processor, at least in terms of functionality – where does the single-cycle machine fit in? ET = IC * CPI * CT