CS计算机代考程序代写 x86 chain mips flex assembly Welcome to Computer Organization and Assembly!

Welcome to Computer Organization and Assembly!

Multicycle and Microcode
CS/COE 0447
Jarrett Billingsley

1

Class announcements
in the next few days I’ll be releasing a departmental 447 quiz
it is not required but I do ask that you do it if you can!
it’s not very long and largely covers what we had on the first exam (plus some stuff from this half of the semester)
CS447
2

Multicycle CPUs
CS447
3

Taking advantage of the phases
remember the five phases of execution?
all* instructions have F, D, X, but only some write to memory/regs
CS447
4
F
D
X
beq/j
F
D
X
W
add/sub etc.
F
D
X
M
M
…many more…
M
M
W
lw
you might be thinking “well if data memory is slow, shouldn’t fetching from instruction memory be slow too?”
well…………………………………. yes…………………………………………………………..
but shhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh don’t worry about it right now

* if you’re clever, you can make some instructions skip the X phase by doing their work during decoding (unconditional jumps, for example)
– as with every single point on every slide in every lecture, there’s an entire class’s worth of “actually it’s more complex than that”
– for now, let’s just assume instruction memory is much faster than data memory (which can be true in e.g. some embedded systems)
4

Chop chop
not all instructions take the same amount of time, so…
let’s make different instructions take different amounts of time!
and by that, we mean different numbers of clock cycles!
CS447
5
j
or
lw
j
j
j
or
or
or
or
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
3 cycles
4 cycles
100 cycles?
before, we had to make the clock tick as slow as the longest instruction (lw)…
but now, it only has to tick as slow as the longest phase of any instruction.
so the clock can be much faster.

– this is cause since different parts of the CPU have different propagation delay, and different instructions combine those parts in different ways
– for practical reasons, we can’t change the length of the clock cycle for each instruction, so it’s easier to just make the clock faster and split each instruction into multiple cycles.
5

Why does this work?
because we’re wasting less time on the fast instructions.
CS447
6

Single Cycle

lw’s work

add’s work
doing nothing

doing nothing

beq’s work
Multicycle

lw’s work

add’s work

beq’s work
15 ns
1 ns

– the total time in the single cycle is 45ns (15*3), and the multicycle is only 21ns (15+3+3)
– ofc, if we’re talking a 100:1 speed ratio for the RAM, then we’re saving even more time.
– a 15 ns clock time = ~66MHz, and a 1 ns clock time = 1GHz
6

Changing the CPU design
each phase of execution has its own unit, and between phases, we insert pipeline registers to hold onto the data for the next phase.
CS447
7

Instruction Memory
F
Control
Register File
D

X
Data Memory
M

W

– why they’re called “pipeline registers” is for next time.
– the first pipeline register holds e.g. the PC and the instruction that was fetched
– the next one holds the register values, immediate value, and control signals for all the subsequent phases
– the third one holds the ALU result and one of the register values
– the fourth one holds the loaded value from memory if any
– this is just how THIS PARTICULAR CPU WORKS, THEY ARE NOT ALL DESIGNED EXACTLY LIKE THIS
7

Watching an add (animated)
let’s watch an add instruction flow through the CPU!
CS447
8

Instruction Memory
F
Control
Register File
D

X
Data Memory
M

W
add
come up with control signals…
Clock!
Clock!
add…
Clock!
Clock!
data flows back to registers…

– the “add” thing moving here is just an abstract concept of “the data and control signals needed for the add instruction”
– the encoded machine instruction isn’t actually going to every part of the CPU
8

Doing a Science to it
CS447
9

Cycles Per Instruction (CPI)
CPI is how many cycles it takes to complete one instruction.
lower is better: slower instructions take more cycles.
the best we can do is 1 CPI.
cause how could you even do an instruction in less than 1 cycle…?
one way of calculating CPI is by just measuring a program.
you measure how many cycles it took to finish…
and how many instructions were executed…
and divide!
but that won’t give us good insights about why the CPI is what it is.
for that, we have to look at what instructions the program uses.
CS447
10

– you can’t do an instruction in less than 1 cycle…
– but you CAN do MORE THAN ONE instruction per cycle, which means the CPI < 1!!! - but for that we typically use the reciprocal, IPC (instructions per cycle) 10 Measuring the benefit consider two programs: program A is a billion add instructions in a row. program B is a billion lw instructions in a row. with the single-cycle CPU, which one is faster? neither. they're both a billion instructions, so both a billion cycles. for single-cycle machines, CPI = 1 by definition. but with the multi-cycle CPU, which one is faster? program A. it's 3 billion cycles. program B is like, 100 billion cycles? of course, real programs will be more complex… CS447 11 Calculating Average CPI every program is different, and every program has a different instruction mix – how many of each kind of instruction it uses let's say we have a program where 60% of the instructions are ALU (add/sub/etc), 20% are branches, 15% are lw, and 5% are sw. CS447 12 ALU Branches lw sw Proportion 0.6 0.2 0.15 0.05 Cycles 4 3 10 9 Weighted 1. for each category, multiply the proportion by the number of cycles for that category to get the weighted portion 2.4 0.6 1.5 0.45 2. Now sum them up: = 4.95 this is the Average CPI for THIS program. different mixes give different CPIs! + + + - it's pretty common for loads to outnumber stores. - think about it: if you put a value in a variable, and then use that variable several times, that's one store followed by several loads. 12 The performance equation the question we're really asking in any performance measurement is: how long does it take to finish running a program? we could time it, but we can also calculate it from what we know. CS447 13 or in English, it's the product of the instruction count, the CPI, and the length of one clock cycle. if you have the clock frequency instead, there are two ways to solve that… what are they? (algebra!) - time = 1/frequency, so you can get the frequency's reciprocal. - or just do (n x CPI) ÷ f. - multiplying by the reciprocal is the same as dividing by it. 13 So how much better is it?!??!? say we execute 500 million (5 108) instructions for the single-cycle CPU: CPI = 1 cycle time = 15ns (1.5 x 10-8 s) total time = n × CPI × cycle time = (5 × 108) × 1 ×(1.5 × 10-8) = 5 × 1 × 1.5 = 7.5 seconds. for the multi-cycle CPU: CPI = 4.95 (much worse!) cycle time = 1ns (much better!) total time = 5 × 108 × 4.95 × 1 × 10-9 = 2.475 seconds! CS447 14 Well it's a start not bad! I guess? I mean, we had to increase the clock speed by a factor of 15... but we only got 3 times the performance out of it. if our CPI were also close to 1, it'd be 15 times as fast as the single-cycle machine... HMMMMMMMMMMMMMMMMMMMMMM. CS447 15 - piiiipeliiiiining 15 Multi-cycle control CS447 16 One problem solved (animated) instruction fetch and lw/sw now happen on different cycles! CS447 17 Instruction Memory F Control Register File D X Data Memory M Memory W we've transformed into a Von Neumann architecture. What's a load look like now? (animated) the numbers are clock cycles. CS447 18 Control Register File Memory instruction data (pipeline registers) 1. Fetch using PC PC computed address lw t0, 4(s0) 2. Decode 3. Calculate s0 + 4 s0 4 s0+4 4. Load value using the computed address 17 5. copy into register So how do we control it all? what about other instructions? loads, stores, branches… they all have different sequences of steps CS447 19 fetched instruction step register MemWrite ALUSrc ALUOp etc… RegDataSrc this "step register" keeps track of what step of the instruction we're on Control oh no this looks familiar these FSMs just will not leave us alone in a multicycle implementation, control is an FSM CS447 20 what's the first step of ANY instruction? Fetch and then? Decode and then? lw add etc… but eventually… Ways of implementing that FSM we could hardwire the FSM design. we'd write the transition table, turn that into logic, etc. and that's very similar to what you're doing for your project! except with that extra "step register" or, we could put the transition table in a special kind of memory. then the sequence of steps for each instruction is read from it. and if we get really crazy… we could even change the contents of that memory. CS447 21 Microprogramming CS447 22 Microprogramming is when… the control is an FSM that reads its transition table from memory. that memory contains microcode (µCode): the small "programs" that encode the sequence of steps for each instruction. CS447 23 MIPS instruction µPC µCode ROM Sequencer etc… MemWrite ALUOp RegDataSrc ALUSrc each µInstruction is a set of control signals and FSM control flow information. the µCode ROM (read-only memory) is indexed by the µPC. the sequencer decides what step to do next (i.e. control flow through the µProgram). - in the simplest form, it's just a dumb FSM, but it can be made arbitrarily complex - some historical CPUs made the microcode processor SO complex that you could literally make the CPU act like different CPUs by loading different microcode… when it booted up - yeah it got crazy there for a while 23 Read-only memory? Well… the real power comes when we make it possible to reprogram the µcode ROM! CS447 24 CPU µCode ROM it's inside the CPU, and it can be made of EEPROM or Flash firmware.bin so we can change it! firmware is software which serves a very important function and is hard to change (get it? it's softer than hardware but harder than software…) so what could you do if you could reprogram how your CPU works? - CLEARLY you need to have a VERY STRONG CHAIN OF TRUST when updating your CPU's firmware 24 The sky ROM size is the limit CS447 25 add new instructions! fix gigantic security holes in old ones! vpmultishiftqb phminposuw sqrmaxrstdec improve performance! CPI before: 3.3 CPI now: 2.8 make add instructions subtract instead!! haha pranked - only sqrmaxrstdec is a fake x86 instruction. the other two are real :^) 25 So what's the catch? CS447 26 WHO WOULD WIN? Sequencer µPC µCode ROM a complicated mini-CPU inside your CPU a handful of G A T E Y B O Y S microcoded control is flexible, but that comes at a cost: speed. hardwired (non-microcoded) control – even if it's a hardwired FSM – can be very fast. 26 The origins of RISC by the late 70s and early 80s, CISC architectures were reaching their peak of complexity, and that meant big complex control units. CS447 27 CISC CPU control registers ALU it wasn't unusual for the control to take up the majority of the chip. RISC and MIPS took the risk of simplifying control... RISC CPU control registers ALU …and this was the right risk to take! One tool of many microcode is absolutely still in use, even in RISC machines. most CPUs use some hybrid of hardcoded and microcoded control. for instance, an x86 CPU implementation might do something like… CS447 28 inc [eax + ecx*16 + 4] shl tmp1, ecx, 4 add tmp1, eax add tmp1, 4 load tmp2, [tmp1] add tmp2, 1 store [tmp1], tmp2 what the CPU core might actually execute. the CPU fetches this… then uses µCode to dynamically translate that into… but most CPUs use microcode for complex, multi-step procedures; things that happen outside the "normal flow" of execution. - one example is exceptions, which is "when your program crashes" - like those out of bounds memory accesses - exceptions cause a whole bunch of things to happen under a variety of different circumstances - so having a microcoded program to do that is much easier than making a complex hardcoded FSM - other microcode applications include interrupts (similar to exceptions), switching processes, handling virtual memory, and so on. 28 /docProps/thumbnail.jpeg