Welcome to Computer Organization and Assembly!
Multicycle and Microcode
CS/COE 0447
Jarrett Billingsley
1
Class announcements
in the next few days I’ll be releasing a departmental 447 quiz
it is not required but I do ask that you do it if you can!
it’s not very long and largely covers what we had on the first exam (plus some stuff from this half of the semester)
CS447
2
Multicycle CPUs
CS447
3
Taking advantage of the phases
remember the five phases of execution?
all* instructions have F, D, X, but only some write to memory/regs
CS447
4
F
D
X
beq/j
F
D
X
W
add/sub etc.
F
D
X
M
M
…many more…
M
M
W
lw
you might be thinking “well if data memory is slow, shouldn’t fetching from instruction memory be slow too?”
well…………………………………. yes…………………………………………………………..
but shhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh don’t worry about it right now
* if you’re clever, you can make some instructions skip the X phase by doing their work during decoding (unconditional jumps, for example)
– as with every single point on every slide in every lecture, there’s an entire class’s worth of “actually it’s more complex than that”
– for now, let’s just assume instruction memory is much faster than data memory (which can be true in e.g. some embedded systems)
4
Chop chop
not all instructions take the same amount of time, so…
let’s make different instructions take different amounts of time!
and by that, we mean different numbers of clock cycles!
CS447
5
j
or
lw
j
j
j
or
or
or
or
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
lw
3 cycles
4 cycles
100 cycles?
before, we had to make the clock tick as slow as the longest instruction (lw)…
but now, it only has to tick as slow as the longest phase of any instruction.
so the clock can be much faster.
– this is cause since different parts of the CPU have different propagation delay, and different instructions combine those parts in different ways
– for practical reasons, we can’t change the length of the clock cycle for each instruction, so it’s easier to just make the clock faster and split each instruction into multiple cycles.
5
Why does this work?
because we’re wasting less time on the fast instructions.
CS447
6
Single Cycle
lw’s work
add’s work
doing nothing
doing nothing
beq’s work
Multicycle
lw’s work
add’s work
beq’s work
15 ns
1 ns
– the total time in the single cycle is 45ns (15*3), and the multicycle is only 21ns (15+3+3)
– ofc, if we’re talking a 100:1 speed ratio for the RAM, then we’re saving even more time.
– a 15 ns clock time = ~66MHz, and a 1 ns clock time = 1GHz
6
Changing the CPU design
each phase of execution has its own unit, and between phases, we insert pipeline registers to hold onto the data for the next phase.
CS447
7
Instruction Memory
F
Control
Register File
D
X
Data Memory
M
W
– why they’re called “pipeline registers” is for next time.
– the first pipeline register holds e.g. the PC and the instruction that was fetched
– the next one holds the register values, immediate value, and control signals for all the subsequent phases
– the third one holds the ALU result and one of the register values
– the fourth one holds the loaded value from memory if any
– this is just how THIS PARTICULAR CPU WORKS, THEY ARE NOT ALL DESIGNED EXACTLY LIKE THIS
7
Watching an add (animated)
let’s watch an add instruction flow through the CPU!
CS447
8
Instruction Memory
F
Control
Register File
D
X
Data Memory
M
W
add
come up with control signals…
Clock!
Clock!
add…
Clock!
Clock!
data flows back to registers…
– the “add” thing moving here is just an abstract concept of “the data and control signals needed for the add instruction”
– the encoded machine instruction isn’t actually going to every part of the CPU
8
Doing a Science to it
CS447
9
Cycles Per Instruction (CPI)
CPI is how many cycles it takes to complete one instruction.
lower is better: slower instructions take more cycles.
the best we can do is 1 CPI.
cause how could you even do an instruction in less than 1 cycle…?
one way of calculating CPI is by just measuring a program.
you measure how many cycles it took to finish…
and how many instructions were executed…
and divide!
but that won’t give us good insights about why the CPI is what it is.
for that, we have to look at what instructions the program uses.
CS447
10
– you can’t do an instruction in less than 1 cycle…
– but you CAN do MORE THAN ONE instruction per cycle, which means the CPI < 1!!!
- but for that we typically use the reciprocal, IPC (instructions per cycle)
10
Measuring the benefit
consider two programs:
program A is a billion add instructions in a row.
program B is a billion lw instructions in a row.
with the single-cycle CPU, which one is faster?
neither.
they're both a billion instructions, so both a billion cycles.
for single-cycle machines, CPI = 1 by definition.
but with the multi-cycle CPU, which one is faster?
program A. it's 3 billion cycles.
program B is like, 100 billion cycles?
of course, real programs will be more complex…
CS447
11
Calculating Average CPI
every program is different, and every program has a different instruction mix – how many of each kind of instruction it uses
let's say we have a program where 60% of the instructions are ALU (add/sub/etc), 20% are branches, 15% are lw, and 5% are sw.
CS447
12
ALU Branches lw sw
Proportion 0.6 0.2 0.15 0.05
Cycles 4 3 10 9
Weighted
1. for each category, multiply the proportion by the number of cycles for that category to get the weighted portion
2.4
0.6
1.5
0.45
2. Now sum them up:
= 4.95
this is the Average CPI for THIS program. different mixes give different CPIs!
+
+
+
- it's pretty common for loads to outnumber stores.
- think about it: if you put a value in a variable, and then use that variable several times, that's one store followed by several loads.
12
The performance equation
the question we're really asking in any performance measurement is: how long does it take to finish running a program?
we could time it, but we can also calculate it from what we know.
CS447
13
or in English, it's the product of the instruction count, the CPI, and the length of one clock cycle.
if you have the clock frequency instead, there are two ways to solve that… what are they? (algebra!)
- time = 1/frequency, so you can get the frequency's reciprocal.
- or just do (n x CPI) ÷ f.
- multiplying by the reciprocal is the same as dividing by it.
13
So how much better is it?!??!?
say we execute 500 million (5 108) instructions
for the single-cycle CPU:
CPI = 1
cycle time = 15ns (1.5 x 10-8 s)
total time = n × CPI × cycle time
= (5 × 108) × 1 ×(1.5 × 10-8)
= 5 × 1 × 1.5
= 7.5 seconds.
for the multi-cycle CPU:
CPI = 4.95 (much worse!)
cycle time = 1ns (much better!)
total time = 5 × 108 × 4.95 × 1 × 10-9
= 2.475 seconds!
CS447
14
Well it's a start
not bad! I guess?
I mean, we had to increase the clock speed by a factor of 15...
but we only got 3 times the performance out of it.
if our CPI were also close to 1, it'd be 15 times as fast as the single-cycle machine...
HMMMMMMMMMMMMMMMMMMMMMM.
CS447
15
- piiiipeliiiiining
15
Multi-cycle control
CS447
16
One problem solved (animated)
instruction fetch and lw/sw now happen on different cycles!
CS447
17
Instruction Memory
F
Control
Register File
D
X
Data Memory
M
Memory
W
we've transformed into a Von Neumann architecture.
What's a load look like now? (animated)
the numbers are clock cycles.
CS447
18
Control
Register File
Memory
instruction
data
(pipeline registers)
1. Fetch using PC
PC
computed address
lw t0, 4(s0)
2. Decode
3. Calculate s0 + 4
s0
4
s0+4
4. Load value using
the computed address
17
5. copy into
register
So how do we control it all?
what about other instructions?
loads, stores, branches…
they all have different sequences of steps
CS447
19
fetched instruction
step register
MemWrite
ALUSrc
ALUOp
etc…
RegDataSrc
this "step register" keeps track of what step of the instruction we're on
Control
oh no this looks familiar
these FSMs just will not leave us alone
in a multicycle implementation, control is an FSM
CS447
20
what's the first step of ANY instruction?
Fetch
and then?
Decode
and then?
lw
add
etc…
but eventually…
Ways of implementing that FSM
we could hardwire the FSM design.
we'd write the transition table, turn that into logic, etc.
and that's very similar to what you're doing for your project!
except with that extra "step register"
or, we could put the transition table in a special kind of memory.
then the sequence of steps for each instruction is read from it.
and if we get really crazy…
we could even change the contents of that memory.
CS447
21
Microprogramming
CS447
22
Microprogramming is when…
the control is an FSM that reads its transition table from memory.
that memory contains microcode (µCode): the small "programs" that encode the sequence of steps for each instruction.
CS447
23
MIPS instruction
µPC
µCode ROM
Sequencer
etc…
MemWrite
ALUOp
RegDataSrc
ALUSrc
each µInstruction is a set of control signals and FSM control flow information.
the µCode ROM (read-only memory) is indexed by the µPC.
the sequencer decides what step to do next (i.e. control flow through the µProgram).
- in the simplest form, it's just a dumb FSM, but it can be made arbitrarily complex
- some historical CPUs made the microcode processor SO complex that you could literally make the CPU act like different CPUs by loading different microcode… when it booted up
- yeah it got crazy there for a while
23
Read-only memory? Well…
the real power comes when we make it possible to reprogram the µcode ROM!
CS447
24
CPU
µCode ROM
it's inside the CPU, and it can be made of EEPROM or Flash
firmware.bin
so we can change it!
firmware is software which serves a very important function and is hard to change
(get it? it's softer than hardware but harder than software…)
so what could you do if you could reprogram how your CPU works?
- CLEARLY you need to have a VERY STRONG CHAIN OF TRUST when updating your CPU's firmware
24
The sky ROM size is the limit
CS447
25
add new instructions!
fix gigantic security holes in old ones!
vpmultishiftqb
phminposuw
sqrmaxrstdec
improve performance!
CPI before: 3.3
CPI now: 2.8
make add instructions subtract instead!!
haha pranked
- only sqrmaxrstdec is a fake x86 instruction. the other two are real :^)
25
So what's the catch?
CS447
26
WHO WOULD WIN?
Sequencer
µPC
µCode ROM
a complicated mini-CPU inside your CPU
a handful of
G A T E Y B O Y S
microcoded control is flexible, but that comes at a cost: speed.
hardwired (non-microcoded) control – even if it's a hardwired FSM – can be very fast.
26
The origins of RISC
by the late 70s and early 80s, CISC architectures were reaching their peak of complexity, and that meant big complex control units.
CS447
27
CISC CPU
control
registers
ALU
it wasn't unusual for the control to take up the majority of the chip.
RISC and MIPS took the risk of simplifying control...
RISC CPU
control
registers
ALU
…and this was the right risk to take!
One tool of many
microcode is absolutely still in use, even in RISC machines.
most CPUs use some hybrid of hardcoded and microcoded control.
for instance, an x86 CPU implementation might do something like…
CS447
28
inc [eax + ecx*16 + 4]
shl tmp1, ecx, 4
add tmp1, eax
add tmp1, 4
load tmp2, [tmp1]
add tmp2, 1
store [tmp1], tmp2
what the CPU core might actually execute.
the CPU fetches this…
then uses µCode to dynamically translate that into…
but most CPUs use microcode for complex, multi-step procedures; things that happen outside the "normal flow" of execution.
- one example is exceptions, which is "when your program crashes"
- like those out of bounds memory accesses
- exceptions cause a whole bunch of things to happen under a variety of different circumstances
- so having a microcoded program to do that is much easier than making a complex hardcoded FSM
- other microcode applications include interrupts (similar to exceptions), switching processes, handling virtual memory, and so on.
28
/docProps/thumbnail.jpeg