Lecture 4: Instructions: Language of
the Computer (3/3)
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021
From last time …
▪ What instructions look like
– RISC-V: 32 bit instructions, dierent types (R, I, S, and more)
– RISC-V: Instructions either compute something or move something to/from memory
– Last lecture: logical, branch, jump instructions ▪ Calling procedures
– Arguments and return values
– Jump instructions and return addresses
– Saving registers and RISC-V register conventions
– The stack, and memory regions
69 UC Davis EEC 170, Winter 2021 / © John Owens
Character Data
▪ Byte-encoded character sets – ASCII: 128 characters
– 95 graphic, 33 control – Latin-1: 256 characters
– ASCII, +96 more graphic characters ▪ Unicode: 32-bit character set
– Used in Java, C++ wide characters, …
– Most of the world’s alphabets, plus symbols
– UTF-8, UTF-16: variable-length encodings
70
UC Davis EEC 170, Winter 2021 / © John Owens
§2.9 Communicating with People
Byte/Halfword/Word Operations
▪
RISC-V byte/halfword/word load/store
– Load byte/halfword/word: Sign extend to 64 bits in rd
-lb rd, offset(rs1) -lh rd, offset(rs1) -lw rd, offset(rs1)
– Load byte/halfword/word unsigned: Zero extend to 64 bits in rd
-lbu rd, offset(rs1) -lhu rd, offset(rs1) -lwu rd, offset(rs1)
– Store byte/halfword/word: Store rightmost 8/16/32 bits
-sb rs2, offset(rs1) -sh rs2, offset(rs1) -sw rs2, offset(rs1)
71
UC Davis EEC 170, Winter 2021 / © John Owens
String Copy Example ▪ C code:
– Null-terminated string
void strcpy (char x[], char y[]) {
size_t i;
i = 0;
while ((x[i]=y[i]) != ‘\0′)
i += 1; }
// C idiom: while (*x++ = *y++);
72
UC Davis EEC 170, Winter 2021 / © John Owens
String Copy Example ▪ RISC-V code:
x19: save, use as temporary
x6, x7: don’t save, also temporaries arguments: x10 = &y, x11 = &x
no return value
strcpy:
addi sp,sp,-8
sd x19,0(sp)
add x19,x0,x0 // i=0 (x19 contains i)
lbu x6,0(x5) add x7,x19,x11 sb x6,0(x7) beq x6,x0,L2 addi x19,x19,1 jal x0,L1
L2: ld x19,0(sp)
addi sp,sp,8
jalr x0,0(x1)
// x6 = y[i]
// x11 = &x; x7 = addr of x[i]
// x[i] = y[i]
// if y[i] == 0 then exit
// i = i + 1
// next iteration of loop
// restore saved x19
// pop 1 doubleword from stack
// and return
// adjust stack for 1 doubleword
// push x19
L1: add x5,x19,x10 // x10 = &y; x5 = addr of y[i]
73
UC Davis EEC 170, Winter 2021 / © John Owens
32-bit Constants
▪ Most constants are small
– 12-bit immediate is sucient
▪ For the occasional 32-bit constant lui rd, constant
– Copies 20-bit constant to bits [31:12] of rd
– Extends bit 31 to bits [63:32]
– Clears bits [11:0] of rd to 0 lui x19, 976 // 0x003D0
addi x19,x19,1280 // 0x500
0000 0000 0000 0000
0000 0000 0000 0000
0000 0000 0011 1101 0000
0000 0000 0000
0000 0000 0000 0000
0000 0000 0000 0000
74
UC Davis EEC 170, Winter 2021 / © John Owens
0000 0000 0011 1101 0000
0101 0000 0000
§2.10 RISC-V Addressing for Wide Immediates and Addresses
Branch Addressing
▪ Branch instructions specify
– Opcode, two registers, target address
▪ Most branch targets are near branch – Forward or backward
▪ SB format: imm[12]
imm[11]
imm [10:5]
rs2
rs1
funct3
imm [4:1]
opcode
– “The address uses an unusual encoding, which simpli#es data path design but complicates assembly.”
▪ PC-relative addressing
– Target address = PC + immediate $ 2
– Why 2? “The RISC-V architects wanted to support the possibility of instructions that are 2 bytes long.”
75
UC Davis EEC 170, Winter 2021 / © John Owens
Jump Addressing
▪
▪
Use jal x0, Label to jump (goto) to Label UJ format:
Jump and link (jal) target uses 20-bit immediate for larger range – Also uses PC-relative addressing
–
(unconditional jump)
imm[10:1]
imm[19:12]
rd
opcode
▪
For long jumps, eg, to 32-bit absolute address
– lui: load address[31:12] to temp register
– jalr: add address[11:0] and jump to target
imm[20] imm[11]
5 bits
7 bits
76
UC Davis EEC 170, Winter 2021 / © John Owens
RISC-V Addressing Summary
77
UC Davis EEC 170, Winter 2021 / © John Owens
RISC-V Encoding Summary
78
UC Davis EEC 170, Winter 2021 / © John Owens
Synchronization
▪ Two processors sharing an area of memory
– P1 writes, then P2 reads
– Data race if P1 and P2 don’t synchronize
– Result depends of order of accesses ▪ Example (next slide):
– load balance from memory to register
– add $1 to register value
– store balance from register to memory
79
UC Davis EEC 170, Winter 2021 / © John Owens
§2.11 Parallelism and Instructions: Synchronization
Synchronization example
Suppose two cash machines, A and B, are both working on a deposit at the same time. Here’s how
the deposit() step typically breaks down into low-level processor instructions:
get balance (balance=0)
add 1
write back the result (balance=1)
When A and B are running concurrently, these low-level instructions interleave with each other
(some might even be simultaneous in some sense, but let’s just worry about interleaving for now):
A get balance (balance=0)
A add 1
A write back the result (balance=1)
A get balance (balance=0)
A add 1
A write back the result (balance=1)
B get balance (balance=1)
B add 1
B write back the result (balance=2)
This interleaving is fine – we end up with balance 2, so both A and B successfully put in a dollar.
But what if the interleaving looked like this:
B get balance (balance=0)
B add 1
B write back the result (balance=1)
http://web.mit.edu/6.005/www/fa14/classes/17-concurrency/#interleaving 80 UC Davis EEC 170, Winter 2021 / © John Owens
Synchronization
▪ Two processors sharing an area of memory
– P1 writes, then P2 reads
– Data race if P1 and P2 don’t synchronize
– Result depends of order of accesses ▪ Hardware support required
– Atomic read/write memory operation
– No other access to the location allowed between the read and write
▪ Could be a single instruction
–
– Or an atomic pair of instructions
E.g., atomic swap of register ↔ memory 81
UC Davis EEC 170, Winter 2021 / © John Owens
Synchronization in RISC-V
▪ Load reserved: lr.d rd,(rs1)
– Load from address in rs1 to rd
– Place reservation on memory address
▪ Store conditional: sc.d rd,(rs1),rs2
– Store from rs2 (the value to be stored) to address in rs1
–
– Fails if location is changed
– Returns non-zero value in rd
Succeeds if location not changed since the lr.d – Returns 0 in rd
82
UC Davis EEC 170, Winter 2021 / © John Owens
Synchronization in RISC-V
▪
Example 1: atomic swap (to test/set lock variable)
again: lr.d x10,(x20)
sc.d x11,(x20),x23 // X11 = status
bne x11,x0,again // branch if store failed
▪
addi x23,x10,0
Example 2: lock
again:
–
// “locked” == 1
// read lock
// X23 = loaded value
// X23 and Mem[x20] have swapped
addi x12,x0,1
lr.d x10,(x20)
bne x10,x0,again // check if it is 0 yet
sc.d x11,(x20),x12 // attempt to store “locked” == 1 bne x11,x0,again // branch if fails
Unlock:
sd x0,0(x20) // free lock
83
UC Davis EEC 170, Winter 2021 / © John Owens
Translation and Startup
Many compilers produce object modules directly
84
UC Davis EEC 170, Winter 2021 / © John Owens
Static linking
§2.12 Translating and Starting a Program
Assembler tasks
▪ Translate assembly instructions into binary
▪ Do stu that makes assembly writers’ job easier
–
– Pseudoinstructions:
Translate labels to osets (beq a1, a2, Label)
–
not in instruction set
– – –
If it’s small enough, assembler generates addi If it’s bigger, lui then addi
mv is a copy instruction (not in instruction set)
li is “load immediate” (load a number into a register),
85 UC Davis EEC 170, Winter 2021 / © John Owens
Producing an Object Module
▪ Assembler (or compiler) translates program into machine instructions
▪ Provides information for building a complete program from the pieces (following is Unix):
– Header: describes contents of object module
– Text segment: machine code
– Static data segment: data allocated for the life of the program
– Relocation info: which instructions/data words depend on absolute addresses in this program?
– Address space layout randomization (e.g.) requires this
– Symbol table: labels that are not de#ned (external references)
– Debug info: for associating with source code
86 UC Davis EEC 170, Winter 2021 / © John Owens
Linking Object Modules
▪ ▪
▪
Could leave location dependencies for #xing by a relocating loader
– But with virtual memory, no need to do this
– Program can be loaded into absolute location in virtual memory space
▪
Nice example in the book (p. 128)
87 UC Davis EEC 170, Winter 2021 / © John Owens
Much faster to link than recompile Produces an executable image
1. Merges segments
2. Resolve labels (determine their addresses)
3. Patch location-dependent and external refs
Loading a Program
▪ Load from image #le on disk into memory
1. 2. 3.
4. 5. 6.
Read header to determine segment sizes Create virtual address space
Copy text and initialized data into memory
– Or set page table entries so they can be faulted in Set up arguments on stack
Initialize registers (including sp, fp, gp)
Jump to startup routine
– Copies arguments to x10, … and calls main
– When main returns, do exit syscall
88 UC Davis EEC 170, Winter 2021 / © John Owens
Dynamic Linking
▪ Only link/load library procedure when it is called
– Requires procedure code to be relocatable
– Avoids image bloat caused by static linking of all (transitively) referenced libraries
– Automatically picks up new library versions
89 UC Davis EEC 170, Winter 2021 / © John Owens
Eect of Compiler Optimization ▪ Compiled with gcc for Pentium 4 under Linux
2.6 1.95 1.3 0.65
120000
90000
60000
30000
Relative Performance
Instruction count
00
none O1
O2 O3
none O1 160000
120000
80000
40000
0 none
O2 O3
Clock Cycles
CPI
O1
O2 O3
1.8 1.35 0.9 0.45 0
none O1
O2 O3
90
UC Davis EEC 170, Winter 2021 / © John Owens
Eect of Language and Algorithm
3 2.25 1.5 0.75 0
2 1.5 1 0.5 0
3000
2250
1500
750 0
C/none C/O1 C/O2
C/O3 Java/int
Java/JIT
Quicksort Relative Performance
C/none C/O1 C/O2
C/O3 Java/int
Java/JIT
Bubblesort Relative Performance
Quicksort vs. Bubblesort Speedup
C/none C/O1 C/O2
C/O3 Java/int
Java/JIT
91
UC Davis EEC 170, Winter 2021 / © John Owens
Lessons Learnt
▪ Instruction count and CPI are not good performance indicators in isolation
▪ Compiler optimizations are sensitive to the algorithm
▪ Java/JIT compiled code is signi#cantly faster than JVM
interpreted
– Comparable to optimized C in some cases
▪ Nothing can #x a dumb algorithm!
92 UC Davis EEC 170, Winter 2021 / © John Owens
MIPS Instructions
▪ MIPS: commercial predecessor to RISC-V ▪ Similar basic set of instructions
– 32-bit instructions
– 32 general purpose registers, register 0 is always 0
– 32 %oating-point registers
– Memory accessed only by load/store instructions
– Consistent use of addressing modes for all data sizes ▪ Dierent conditional branches
– For <, <=, >, >=
– RISC-V: blt, bge, bltu, bgeu
– MIPS: slt, sltu (set less than, result is 0 or 1)
– Then use beq, bne to complete the branch
96 UC Davis EEC 170, Winter 2021 / © John Owens
§2.16 Real Stu!: MIPS Instructions
Instruction Encoding: RISC-V vs. MIPS
97
UC Davis EEC 170, Winter 2021 / © John Owens
The Intel x86 ISA
▪ Evolution with backward compatibility – 8080 (1974): 8-bit microprocessor
– Accumulator, plus 3 index-register pairs – 8086 (1978): 16-bit extension to 8080
– Complex instruction set (CISC)
– 8087 (1980): %oating-point coprocessor
– Adds FP instructions and register stack – 80286 (1982): 24-bit addresses, MMU
– Segmented memory mapping and protection – 80386 (1985): 32-bit extension (now IA-32)
– Additional addressing modes and operations
– Paged memory mapping as well as segments
98 UC Davis EEC 170, Winter 2021 / © John Owens
§2.17 Real Stu!: x86 Instructions
The Intel x86 ISA
▪ Further evolution…
– i486 (1989): pipelined, on-chip caches and FPU
– Compatible competitors: AMD, Cyrix, …
– Pentium (1993): superscalar, 64-bit datapath
– Later versions added MMX (Multi-Media eXtension) instructions
– The infamous FDIV bug
– Pentium Pro (1995), Pentium II (1997)
– New microarchitecture (see Colwell, The Pentium Chronicles) – Pentium III (1999)
– Added SSE (Streaming SIMD Extensions) and associated registers – Pentium 4 (2001)
– New microarchitecture
– Added SSE2 instructions
99 UC Davis EEC 170, Winter 2021 / © John Owens
The Intel x86 ISA
▪
▪
– AVX: Advanced Vector Extension (announced 2008)
– Longer SSE registers, more instructions
– AVX-512 proposed 2013, implemented in Skylake (x86) 2017
And further…
– AMD64 (2003): extended architecture to 64 bits
– EM64T – Extended Memory 64 Technology (2004)
– AMD64 adopted by Intel (with re#nements)
– Added SSE3 instructions
– Intel Core (2006)
– Added SSE4 instructions, virtual machine support
– AMD64 (announced 2007): SSE5 instructions – Intel declined to follow, instead…
If Intel didn’t extend with compatibility, its competitors would! – Technical elegance & market success
100
UC Davis EEC 170, Winter 2021 / © John Owens
Basic x86 Registers
101
UC Davis EEC 170, Winter 2021 / © John Owens
Basic x86 Addressing Modes ▪ Two operands per instruction
Source/dest operand
Second source operand
Register
Register
Register
Immediate
Register
Memory
Memory
Register
Memory
Immediate
▪ Memory addressing modes
– Address in register
– Address = Rbase + displacement
– Address = Rbase + 2scale $ Rindex (scale = 0, 1, 2, or 3) – Address = Rbase + 2scale $ Rindex + displacement
102
UC Davis EEC 170, Winter 2021 / © John Owens
x86 Instruction Encoding ▪ Variable length encoding
– Post#x bytes specify addressing mode
– Pre#x bytes modify operation
– Operand length, repetition, locking, …
103
UC Davis EEC 170, Winter 2021 / © John Owens
Implementing IA-32
▪ Complex instruction set makes implementation dicult
– Hardware translates instructions to simpler microoperations
– Simple instructions: 1–1
– Complex instructions: 1–many
– Microengine similar to RISC
– Market share makes this economically viable
▪ Comparable performance to RISC
– Compilers avoid complex instructions
104 UC Davis EEC 170, Winter 2021 / © John Owens
Other RISC-V Instructions
▪ Base integer instructions (RV64I)
– Those previously described, plus
– auipc rd, immed // rd = (imm<<12) + pc
- follow by jalr (adds 12-bit immed) for long jump
- slt, sltu, slti, sltui: set less than (like MIPS)
- addw, subw, addiw: 32-bit add/sub
- sllw, srlw, srlw, slliw, srliw, sraiw: 32-bit shift
▪ 32-bit variant: RV32I
- registers are 32-bits wide, 32-bit operations
105 UC Davis EEC 170, Winter 2021 / © John Owens
§2.18 The Rest of the RISC-V Instruction Set
Instruction Set Extensions
▪ M: integer multiply, divide, remainder ▪ A: atomic memory operations
▪ F: single-precision %oating point
▪ D: double-precision %oating point
▪ C: compressed instructions
- 16-bit encoding for frequently used instructions
106 UC Davis EEC 170, Winter 2021 / © John Owens
Fallacies
▪ Powerful instruction ⇒ higher performance
- Fewer instructions required
- But complex instructions are hard to implement
- May slow down all instructions, including simple ones
- Compilers are good at making fast code from simple instructions
▪ Use assembly code for high performance
- But modern compilers are better at dealing with modern
processors
More lines of code ⇒ more errors and less productivity
-
107 UC Davis EEC 170, Winter 2021 / © John Owens
§2.19 Fallacies and Pitfalls
Fallacies
▪ Backward compatibility ⇒ instruction set doesn’t change - But they do accrue more instructions
108
UC Davis EEC 170, Winter 2021 / © John Owens
x86 instruction set
Pitfalls
▪ Sequential words are not at sequential addresses
- MIPS-V addresses are byte addresses
- Increment by 4 or 8, not by 1!
▪ Keeping a pointer to an automatic variable after procedure returns
- e.g., passing pointer back via an argument
- Pointer becomes invalid when stack popped
109 UC Davis EEC 170, Winter 2021 / © John Owens
Concluding Remarks
▪ Design principles
- 1. Simplicity favors regularity
- 2. Smaller is faster
- 3. Good design demands good compromises
▪ Make the common case fast ▪ Layers of software/hardware
- Compiler, assembler, hardware ▪ RISC-V: typical of RISC ISAs
- c.f. x86
110 UC Davis EEC 170, Winter 2021 / © John Owens
§2.20 Concluding Remarks
We likely don’t have time for the next few slides ▪ Great example though!
111 UC Davis EEC 170, Winter 2021 / © John Owens
C Sort Example
▪ Illustrates use of assembly instructions for a C bubble sort function
▪ Swap procedure (leaf)
void swap(long long int v[],
long long int k)
{
long long int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}
- v in x10, k in x11, temp in x5 112
UC Davis EEC 170, Winter 2021 / © John Owens
§2.13 A C Sort Example to Put It All Together
The Procedure Swap
swap:
slli x6,x11,3
add x6,x10,x6
ld x5,0(x6)
ld x7,8(x6)
sd x7,0(x6)
sd x5,8(x6)
jalr x0,0(x1)
// reg x6 = k * 8
// reg x6 = v + (k * 8)
// reg x5 (temp) = v[k]
// reg x7 = v[k + 1]
// v[k] = reg x7
// v[k+1] = reg x5 (temp)
// return to calling routine
113
UC Davis EEC 170, Winter 2021 / © John Owens
The Sort Procedure in C ▪ Non-leaf (calls swap)
void sort (long long int v[], size_t n)
{
size_t i, j;
for (i = 0; i < n; i += 1) {
for (j = i – 1;
j >= 0 && v[j] > v[j + 1];
j -= 1) {
swap(v,j); }
}
}
– v in x10, n in x11, i in x19, j in x20 114
UC Davis EEC 170, Winter 2021 / © John Owens
The Outer Loop ▪ Skeleton of outer loop:
– for (i = 0; i
addi x20,x19,-1 // j = i −1
for2tst:
blt x20,x0,exit2 // go to exit2 if X20 < 0 (j < 0)
slli x5,x20,3
add x5,x10,x5
ld x6,0(x5)
ld x7,8(x5)
ble x6,x7,exit2
mv x21, x10
mv x22, x11
mv x10, x21
mv x11, x20
jal x1,swap
addi x20,x20,-1
j for2tst
exit2:
// reg x5 = j * 8
// reg x5 = v + (j * 8)
// reg x6 = v[j]
// reg x7 = v[j + 1]
// go to exit2 if x6 ≤ x7
// copy parameter x10 into x21
// copy parameter x11 into x22
// first swap parameter is v
// second swap parameter is j
// call swap
// j –= 1
// branch to test of inner loop
116
UC Davis EEC 170, Winter 2021 / © John Owens
Preserving Registers
▪
Preserve saved registers:
addi sp,sp,-40 // make room on stack for 5 regs
sd x1,32(sp) // save x1 on stack
sd x22,24(sp) // save x22 on stack
sd x21,16(sp) // save x21 on stack
sd x20,8(sp) // save x20 on stack ▪sd x19,0(sp) // save x19 on stack
Restore saved registers:
exit1:
sd x19,0(sp) // restore x19 from stack
sd x20,8(sp) // restore x20 from stack
sd x21,16(sp) // restore x21 from stack
sd x22,24(sp) // restore x22 from stack
sd x1,32(sp) // restore x1 from stack
addi sp,sp, 40 // restore stack pointer
jalr x0,0(x1)
117
UC Davis EEC 170, Winter 2021 / © John Owens