Welcome to Computer Organization and Assembly!
Performance
CS/COE 0447
Jarrett Billingsley
1
Class announcements
proj2 is out!!
CS447
2
First, a tangent
Scientific Notation
and SI Prefixes
CS447
3
Numbers mean science
when dealing with tiny fractions of time or many operations per second, scientific notation is the best representation.
CS447
4
small numbers have negative exponents. large numbers have positive exponents.
1.83 × 10-5
7.50 × 107
= 0.0000183
= 75,000,000
when you move the decimal point left, you add to the exponent.
18.3 × 10-6
= 1.83 × 10-5
75.0 × 106
= 7.50 × 107
and when you move the decimal point right, you subtract from the exponent.
-6 + 1 = -5!!!
– adding/subtracting to the exponent fucks me up too
4
Reciprocals…
probably one of the most common mistakes:
CS447
5
negate the exponent…
and get the reciprocal of the significand!!!
= 0.394 × 10-3
= 3.94 × 10-4
– do you know how many times I have had to write “1/8 != 8” on exams? do you?
5
Multiplication is commutative and associative
when multiplying, group the significands together, and add the exponents of the bases.
CS447
6
(4.9 × 102) × (7.2 × 108) × (3.3 × 10-5)
= (4.9 × 7.2 × 3.3) × (102 × 108 × 10-5)
= 116.424 × 105
= 1.16424 × 107
6
SI Prefixes
we deal with some pretty big and small numbers.
it’s good to memorize these prefixes.
CS447
7
(T)era 1012
(G)iga 109
(M)ega 106
(k)ilo 103
(m)illi 10-3
(µ)icro 10-6
(n)ano 10-9
(p)ico 10-12
for practical reasons, micro is sometimes typed as u (e.g. 10 us instead of 10 µs)
What is performance?
CS447
8
The layman’s understanding
your ancient computer takes 30 seconds to open the browser
you get a new computer. it opens the browser in 3 seconds.
which computer is faster?
CS447
9
yeah but this is computer science, not computer guessing.
– a lot of science is guessing tho.
9
We had these Dells in SENSQ when I was here
you wanna copy a CD as many times as you can in 12 minutes
both the PC and this… thing take 4 minutes to copy
which device will make more copies in 12 minutes? why?
numbers usually mean we’re getting more science-y, right?
CS447
10
10
Response time and throughput
response time is the length of time from start to finish
throughput is the amount of work you can do in a span of time
CS447
11
30 seconds
3s
response time
(time per task)
throughput
(tasks per time)
12 minutes
with response time, we make comparisons using the same task;
and with throughput, we make comparisons using the same time.
Response time can improve throughput!
you put a brand new 52X CD burner in your sweet Dell. it burns a CD in only 2 minutes.
CS447
12
4min
4min
4min
2min
2min
2min
2min
2min
2min
the response time for a single CD burn is improved, but this also causes our throughput to double!
this is because the measurement period stayed the same (12min)
4min
4min
4min
4min
4min
4min
4min
4min
Throughput can improve response time!
someone wants you to make them 20 copies ASAP
how long would it take with one CD duplicator?
how long would it take with two?
CS447
13
with one:
20min
with two:
12min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
4min
this is because the workload stayed the same (20 copies)
Applying it to a CPU
the CPU’s job is to run instructions. so we could…
do each instruction faster (i.e. reduce latency)
do more instructions at once (i.e. increase throughput)
for a long time, we did the former…
clock speed increased by 3 orders of magnitude from 1980 to 2000
but then we hit a wall.
and that’s when multi-core CPUs became common.
CS447
14
– this is a massive oversimplification of the history of performance but it’s the basic idea.
– we were increasing throughput like crazy in the 90s too, but with instruction-level parallelism rather than multiple cores
14
Reducing latency
you put a new Pentium 4 in your sweet Dell. it executes a single instruction in only 0.8 nanoseconds.
CS447
15
1.6ns
1.6ns
1.6ns
.8ns
.8ns
.8ns
.8ns
.8ns
.8ns
if each instruction takes only 0.8ns, that means 0.8ns between clock pulses. How fast is the clock running?
well how fast could we possibly make it run?
(frequency = 1/time)
– it would be a very good idea to review SI prefixes and what powers they mean.
15
Real-world clocking issues
CS447
16
The Critical Path
the critical path is the path through a circuit that requires the longest series of sequential operations
they depend on each other and can’t be done in parallel!
for the following, what’s the length of the critical path, in gates?
CS447
17
– the top left one is 2 (NOT, then AND)
– the bottom left one is 3 (NOT, AND, OR)
– the right one is 1 (no matter what path, you only go through 1 OR gate)
17
Propagation Delay
propagation delay is the time it takes for a signal to pass from the inputs to the outputs
or in a loop-shaped circuit, how long it takes for the signal to make “one lap” around the loop
during that delay, the outputs are invalid (they can fluctuate)
after that delay, the outputs are valid
if you try to use the output while it’s invalid, things break
stuff like 2 + 2 = 17??
so there are limits to how fast your processor can run
CS447
18
– so it stands to reason: the faster you want your circuit to run, the shorter the propagation delays should be
18
Determining clock speed (animated)
CS447
19
time
A is clocked
A’s Q becomes valid
the adder has finished; clock A to store
0ns
2ns
5ns
1
A
D
Q
+
19
Determining clock speed
it takes 5ns for a signal to propagate through our circuit
how fast can we clock it?
if the time between clocks is less than 5ns, we’ll clock the register too early (while the adder’s outputs are invalid)
if the time between clocks is more than 5ns, no big deal
to summarize, the fastest we can clock a sequential circuit is the reciprocal of the critical path’s propagation delay
CS447
20
– if there is more time between the clocks, we call that “slack”
20
Clock Skew (animated)
the clock signal itself isn’t immune to propagation delay!
CS447
21
IN
CLK
12
???
B
D Q
EN
A
D Q
EN
12
???
watch the input as the clock pulse travels down the wire to B.
this is a race condition: the data and clock are having a race, and the outcome depends on who wins
the winner could change based on temperature, power, etc!
– fortunately you don’t have to deal with this in logisim.
– but this is another reason there’s a limit to the clock speed.
21
Register File
imm field
ALU
MemWrite
ALUSrc
ALUOp
rd
RegWrite
RegDataSrc
rs
rt
Data Memory
Data
Address
How fast can we clock a CPU?
what’s the thing that limits the clock speed?
the critical path. in our case it happens to be…
and here is the Achilles’ heel of a single-cycle datapath:
CS447
22
Memory
is SLOW.
A.
F.
– we’ve been learning about (and making) a single-cycle CPU because it’s easier to understand, but in real life, single-cycle is a Bad Idea™
– 447 is about “how to make a CPU”
– 1541 is “how to make a CPU run at a remotely acceptable speed”
22
It’s bad.
typical access times for modern DDR4 RAM is 12-15 ns
that’s a single-cycle CPU’s critical path time
the inverse of that is… 66-83 MHz
YEESH
the lw and sw instructions are holding us back
all the other instructions are gonna be WAY faster
CS447
23
add
beq
or
lw/sw
– RAM has very good throughput (e.g. 128- or 256-bit wide interfaces), but even then the bandwidth (throughput over time) is not great
– so we have caches
– 1% of the instructions are using 99% of the time
23
The Law of Diminishing Returns
(“Amdahl’s Law” as it’s misnamed)
CS447
24
Don’t waste your time…
suppose you’re trying to get better at time management
you got an app that lets you time how long you do stuff
CS447
25
if you wanted to get more free time by halving the amount of time it takes to do one task, which task would you choose?
25
Time
Watching Youtube Working Meals Commuting Hygiene 6 6 2 1 1
If you cut Youtube by half…
look at all the extra free time you have!
CS447
26
Time
Watching Youtube Working Meals Commuting Hygiene Free time! 3 6 2 1 1 3
If you cut commuting by half…
look at … all the extra free time… you have.
CS447
27
Time
Watching Youtube Working Meals Commuting Hygiene Free time! 6 6 2 0.5 1 0.5
I am the law
say you have a process that has three parts
CS447
28
7s
2s
1s
these three parts sum to 10 seconds.
if we try to improve the blue (right) part, what’s the shortest we can make the whole process?
what about the green (middle) part?
what about the red (left) part?
if you improve a feature’s performance, the overall performance is only affected proportionally to how much the feature is used.
this means…
focus improvement on things that are used the most.
– if we shave off at most 1 second, we can make the whole process 9 seconds.
– the green, 8 seconds.
– the red, 3 seconds.
28
So what should we focus on?
if lw/sw take way longer than the rest…
maybe we should focus on those?
but what if our program is only 0.1% loads and stores? is it worth all the extra effort to make them faster if they don’t contribute much to the overall time?
but even if we make them way better…
they’re still gonna be way slower.
and our clock is still gonna have to be slow.
SO…………………….
CS447
29
/docProps/thumbnail.jpeg