CSE 371 Computer Organization and Design
CIS 501 | Dr. | ISAs & Single Cycle
Computer Organization and Design
Copyright By PowCoder代写 加微信 powcoder
Unit 4: Single-Cycle Datapath
Based on slides by Profs. , C.J. Taylor, &
CIS 501 | Dr. | ISAs & Single Cycle
This Unit: Single-Cycle Datapath
Overview of ISAs
Datapath storage elements
MIPS Datapath
MIPS Control
System software
credential:
bring a computer
This can be an hidden slide. I just want to use this to do my own planning.
I have rearranged Culler’s lecture slides slightly and add more slides. This covers everything he covers in his first lecture (and more) but may
We will save the fun part, “ Levels of Organization,” at the end (so student can stay awake): I will show the internal stricture of the SS10/20.
Notes to Patterson: You may want to edit the slides in your section or add extra slides to taylor your needs.
CIS 501 | Dr. | ISAs & Single Cycle
Sections 4.1 – 4.4
Recall from CIS240…
CIS 501 | Dr. | ISAs & Single Cycle
240 Review: Applications
Applications (Firefox, iTunes, Skype, Word, Google)
Run on hardware … but how?
CIS 501 | Dr. | ISAs & Single Cycle
System software
240 Review: I/O
Apps interact with us & each other via I/O (input/output)
With us: display, sound, keyboard, mouse, touch-screen, camera
With each other: disk, network (wired or wireless)
Most I/O proper is analog-digital and domain of EE
I/O devices present rest of computer a digital interface (1s and 0s)
CIS 501 | Dr. | ISAs & Single Cycle
System software
240 Review: OS
I/O (& other services) provided by OS (operating system)
A super-app with privileged access to all hardware
Abstracts away a lot of the nastiness of hardware
Virtualizes hardware to isolate programs from one another
Each application is oblivious to presence of others
Simplifies programming, makes system more robust and secure
Privilege is key to this
Commons OSes are Windows, Linux, MacOS
CIS 501 | Dr. | ISAs & Single Cycle
System software
240 Review: ISA
App/OS are software … execute on hardware
HW/SW interface is ISA (instruction set architecture)
A “contract” between SW and HW
Encourages compatibility, allows SW/HW to evolve independently
Functional definition of HW storage locations & operations
Storage locations: registers, memory
Operations: add, multiply, branch, load, store, etc.
Precise description of how to invoke & access them
Instructions (bit-patterns hardware interprets as commands)
CIS 501 | Dr. | ISAs & Single Cycle
System software
240 Review: LC4 ISA
LC4: a toy ISA you know
16-bit ISA (what does this mean?)
16-bit insns
8 registers (integer)
~30 different insns
Simple OS support
Assembly language
Human-readable ISA representation
CIS 501 | Dr. | ISAs & Single Cycle
System software
array .BLKW #100
sum .FILL #0
CONST R5, #0
LEA R1, array
LEA R2, sum
array_sum_loop
LDR R3, R1, #0
LDR R4, R2, #0
ADD R4, R3, R4
STR R4, R2, #0
ADD R1, R1, #1
ADD R5, R5, #1
CMPI R5, #100
BRn array_sum_loop
371/501 Preview: A Real ISA
MIPS: example of real ISA
32/64-bit operations
32-bit insns
64 registers
32 integer, 32 floating point
~100 different insns
Full OS support
CIS 501 | Dr. | ISAs & Single Cycle
System software
array: .space 100
sum: .word 0
array_sum:
la $1, array
la $2, sum
array_sum_loop:
lw $3, 0($1)
lw $4, 0($2)
add $4, $3, $4
sw $4, 0($2)
addi $1, $1, 1
addi $5, $5, 1
li $6, 100
blt $5, $6, array_sum_loop
Example code is MIPS, but
all ISAs are similar at some level
240 Review: Program Compilation
Program written in a “high-level” programming language
C, C++, Java, C#
Hierarchical, structured control: loops, functions, conditionals
Hierarchical, structured data: scalars, arrays, pointers, structures
Compiler: translates program to assembly
Parsing and straight-forward translation
Compiler also optimizes
Compiler itself another application … who compiled compiler?
CIS 501 | Dr. | ISAs & Single Cycle
System software
int array[100], sum;
void array_sum() {
for (int i=0; i<100;i++) {
sum += array[i];
240 Review: Assembly Language
Assembly language
Human-readable representation
Machine language
Machine-readable representation
1s and 0s (often displayed in “hex”)
Translates assembly to machine
CIS 501 | Dr. | ISAs & Single Cycle
System software
x9A00 CONST R5, #0
x9200 CONST R1, array
xD320 HICONST R1, array
x9464 CONST R2, sum
xD520 HICONST R2, sum
x6640 LDR R3, R1, #0
x6880 LDR R4, R2, #0
x18C4 ADD R4, R3, R4
x7880 STR R4, R2, #0
x1261 ADD R1, R1, #1
x1BA1 ADD R5, R5, #1
x2B64 CMPI R5, #100
x03F8 BRn array_sum_loop
Machine code
Assembly code
240 Review: Insn Execution Model
The computer is just finite state machine
Registers (few of them, but fast)
Memory (lots of memory, but slower)
Program counter (next insn to execute)
Sometimes called “instruction pointer”
A computer executes instructions
Fetches next instruction from memory
Decodes it (figure out what it does)
Reads its inputs (registers & memory)
Executes it (adds, multiply, etc.)
Write its outputs (registers & memory)
Next insn (adjust the program counter)
Program is just “data in memory”
Makes computers programmable (“universal”)
CIS 501 | Dr. | ISAs & Single Cycle
System software
Read Inputs
Write Output
Instruction Insn
Role of the Compiler
CIS 501 | Dr. | ISAs & Single Cycle
CIS 501 | Dr. | ISAs & Single Cycle
Compiler Optimizations
Primarily goal: reduce instruction count
Eliminate redundant computation, keep more things in registers
Registers are faster, fewer loads/stores
An ISA can make this difficult by having too few registers
Reduce branches and jumps (later)
Reduce cache misses (later)
Reduce dependences between nearby insns (later)
An ISA can make this difficult by having implicit dependences
How effective are these?
Can give 4X performance over unoptimized code
Collective wisdom of 40 years (“Proebsting’s Law”): 4% per year
Allows higher-level languages to perform adequately (Javascript)
Compiler Optimization Example (LC4)
Left: common sub-expression elimination
Remove calculations whose results are already in some register
Right: register allocation
Keep temporary in register across statements, avoid stack spill/fill
CIS 501 | Dr. | ISAs & Single Cycle
;; temp = *first
LDR R7, R5, #2 ; R7=first
LDR R4, R7, #0
STR R4, R5, #-1
;; *first = *second
LDR R3, R5, #3 ; R3=second
LDR R2, R3, #0
LDR R7, R5, #2 ; redundant
STR R2, R7, #0
;; *second = temp
LDR R4, R5, #-1
LDR R3, R5, #3 ; redundant
STR R4, R3, #0
;; temp = *first
LDR R7, R5, #2
LDR R4, R7, #0
STR R4, R5, #-1 ; unneeded
;; *first = *second
LDR R3, R5, #3
LDR R2, R3, #0
STR R2, R7, #0
;; *second = temp
LDR R4, R5, #-1 ; unneeded
STR R4, R3, #0
What is an ISA?
CIS 501 | Dr. | ISAs & Single Cycle
CIS 501 | Dr. | ISAs & Single Cycle
What Is An ISA?
ISA (instruction set architecture)
A well-defined hardware/software interface
The “contract” between software and hardware
Functional definition of storage locations & operations
Storage locations: registers, memory
Operations: add, multiply, branch, load, store, etc
Precise description of how to invoke & access them
Not in the “contract”: non-functional aspects
How operations are implemented
Which operations are fast and which are slow and when
Which operations take more power and which take less
Instructions
Bit-patterns hardware interprets as commands
Instruction Insn (instruction is too long to write in slides)
CIS 501 | Dr. | ISAs & Single Cycle
A Language Analogy for ISAs
Communication
Person-to-person software-to-hardware
Similar structure
Narrative program
Sentence insn
Verb operation (add, multiply, load, branch)
Noun data item (immediate, register value, memory value)
Adjective addressing mode
Many different languages, many different ISAs
Similar basic structure, details differ (sometimes greatly)
Key differences between languages and ISAs
Languages evolve organically, many ambiguities, inconsistencies
ISAs are explicitly engineered and extended, unambiguous
CIS 501 | Dr. | ISAs & Single Cycle
LC4 vs Real ISAs
LC4 has the basic features of a real-world ISAs
LC4 lacks a good bit of realism
Address size is only 16 bits
Only one data type (16-bit signed integer)
Little support for system software, none for multiprocessing (later)
Many real-world ISAs to choose from:
Intel x86 (laptops, desktop, and servers)
MIPS (used throughout in book)
ARM (in all your mobile phones)
PowerPC (servers & game consoles)
SPARC (servers)
Intel’s Itanium
Historical: IBM 370, VAX, Alpha, PA-RISC, 68k, …
Some Key Attributes of ISAs
Instruction encoding
Fixed length (16-bit for LC4, 32-bit for MIPS & ARM)
Variable length (1 byte to 16 bytes, average of ~3 bytes)
Number and type of registers
LC-4 has 8 registers
MIPS has 32 “integer” registers and 32 “floating point” registers
ARM & x86 both have 16 “integer” regs and 16 “floating point” regs
Address space
LC4: 16-bit addresses at 16-bit granularity (128KB total)
ARM: 32-bit addresses at 8-bit granularly (4GB total)
Modern x86 and ARM64: 64-bit addresses (16 exabytes!)
Memory addressing modes
MIPS & LC4: address calculated by “reg+offset”
x86 and others have much more complicated addressing modes
CIS 501 | Dr. | ISAs & Single Cycle
Advantages of fixed-length insn over variable-length:
Compact execution
Simple to decode
Easy to add instructions
Access Granularity & Alignment
Byte addressability
An address points to a byte (8 bits) of data
The ISA’s minimum granularity to read or write memory
ISAs also support wider load/stores
“Half” (2 bytes), “Longs” (4 bytes), “Quads” (8 bytes)
Load.byte [6] -> r1 Load.long [12] -> r2
However, physical memory systems operate on even larger chunks
Load.long [4] -> r1 Load.long [11] -> r2 “unaligned”
Access alignment: if address % size != 0, then it is “unaligned”
A single unaligned access may require multiple physical memory accesses
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
01001001 00101101 01101001 11001011 00001001 01011000 00111001 11011101 01001001 00101101 01101001 11001011 00001001 01011000 00111001 11011101
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
01001001 00101101 01101001 11001011 00001001 01011000 00111001 11011101 01001001 00101101 01101001 11001011 00001001 01011000 00111001 11011101
CIS 501 | Dr. | ISAs & Single Cycle
Q: unaligned byte access?
Handling Unaligned Accesses
Access alignment: if address % size != 0, then it is “unaligned”
A single unaligned access may require multiple physical memory accesses
How to handle such unaligned accesses?
1. Disallow (unaligned operations are considered illegal)
MIPS, ARMv5 and earlier took this route
2. Support in hardware? (allow such operations)
x86, ARMv6+ allow regular loads/stores to be unaligned
Unaligned access still slower, adds significant hardware complexity
3. Trap to software routine?
Simpler hardware, but high penalty when unaligned
4. In software (compiler can use regular instructions when possibly unaligned
Load, shift, load, shift, and (slow, needs help from compiler)
CIS 501 | Dr. | ISAs & Single Cycle
What does LC4 do?
How big is this struct?
struct foo {
CIS 501 | Dr. | ISAs & Single Cycle
hint: avoid unaligned accesses
Java doesn’t enforce field ordering
Another Addressing Issue: Endian-ness
Endian-ness: arrangement of bytes in a multi-byte number
Big-endian: sensible order (e.g., MIPS, PowerPC, ARM)
A 4-byte integer: “00000000 00000000 00000010 00000011” is 515
Little-endian: reverse order (e.g., x86)
A 4-byte integer: “00000011 00000010 00000000 00000000” is 515
Why little endian?
00000011 00000010 00000000 00000000
starting address
integer casts are free on little-endian architectures
CIS 501 | Dr. | ISAs & Single Cycle
sign checks cheaper on big-endian
ISA Code Examples
CIS 501 | Dr. | ISAs & Single Cycle
Array Sum Loop: LC4
CIS 501 | Dr. | ISAs & Single Cycle
array .BLKW #100
sum .FILL #0
CONST R5, #0
LEA R1, array
LEA R2, sum
LDR R3, R1, #0
LDR R4, R2, #0
ADD R4, R3, R4
STR R4, R2, #0
ADD R1, R1, #1
ADD R5, R5, #1
CMPI R5, #100
int array[100];
void array_sum() {
for (int i=0; i<100;i++) {
sum += array[i];
Array Sum Loop: LC4 MIPS
CIS 501 | Dr. | ISAs & Single Cycle
array .BLKW #100
sum .FILL #0
CONST R5, #0
LEA R1, array
LEA R2, sum
LDR R3, R1, #0
LDR R4, R2, #0
ADD R4, R3, R4
STR R4, R2, #0
ADD R1, R1, #1
ADD R5, R5, #1
CMPI R5, #100
array: .space 100
sum: .word 0
array_sum:
la $1, array
la $2, sum
lw $3, 0($1)
lw $4, 0($2)
add $4, $3, $4
sw $4, 0($2)
addi $1, $1, 1
addi $5, $5, 1
li $6, 100
blt $5, $6, L1
Syntactic differences:
register names begin with $
immediates are un-prefixed
MIPS (right) similar to LC4
Only simple addressing modes
syntax: displacement(reg)
Left-most register is generally destination register
Array Sum Loop: LC4 x86
CIS 501 | Dr. | ISAs & Single Cycle
array .BLKW #100
sum .FILL #0
CONST R5, #0
LEA R1, array
LEA R2, sum
LDR R3, R1, #0
LDR R4, R2, #0
ADD R4, R3, R4
STR R4, R2, #0
ADD R1, R1, #1
ADD R5, R5, #1
CMPI R5, #100
.comm array,400,32
.comm sum,4,4
.globl array_sum
array_sum:
movl $0, -4(%rbp)
movl -4(%rbp), %eax
movl array(,%eax,4), %edx
movl sum(%rip), %eax
addl %edx, %eax
movl %eax, sum(%rip)
addl $1, -4(%rbp)
cmpl $99,-4(%rbp)
x86 (right) is different
Syntactic differences:
register names begin with %
immediates begin with $
%rbp is base (frame) pointer
Many addressing modes
x86 Operand Model
x86 uses explicit accumulators
Both register and memory
Distinguished by addressing mode
CIS 501 | Dr. | ISAs & Single Cycle
.comm array,400,32
.comm sum,4,4
.globl array_sum
array_sum:
movl $0, -4(%rbp)
movl -4(%rbp), %eax
movl array(,%eax,4), %edx
movl sum(%rip), %eax
addl %edx, %eax
movl %eax, sum(%rip)
addl $1, -4(%rbp)
cmpl $99,-4(%rbp)
Register accumulator: %eax = %eax + %edx
Memory accumulator:
Memory[%rbp-4] = Memory[%rbp-4] + 1
Two operand insns
(right-most is typically source & destination)
“L” insn suffix and “%e…” reg. prefix mean “32-bit value”
CIS 501 | Dr. | ISAs & Single Cycle
Implementing an ISA
CIS 501 | Dr. | ISAs & Single Cycle
CIS 501 | Dr. | ISAs & Single Cycle
Implementing an ISA
Datapath: performs computation (registers, ALUs, etc.)
ISA specific: can implement every insn (single-cycle: in one pass!)
Control: determines which computation is performed
Routes data through datapath (which regs, which ALU op)
Fetch: get insn, translate opcode into control
Fetch Decode Execute “cycle”
CIS 501 | Dr. | ISAs & Single Cycle
Two Types of Components
Purely combinational: stateless computation
ALUs, muxes, control
Arbitrary Boolean functions
Combinational/sequential: storage
PC, insn/data memories, register file
Internally contain some combinational components
Example Datapath
CIS 501 | Dr. | ISAs & Single Cycle
Partial implementation of LC4
Leaves out the condition codes – NZP and privilege bit
insn[11:9]
LC4 Datapath
insn[11:9]
insn[11:9]
216 by 16 bit
CIS 501 | Dr. | ISAs & Single Cycle
MIPS Datapath
CIS 501 | Dr. | ISAs & Single Cycle
CIS 501 | Dr. | ISAs & Single Cycle
Unified vs Split Memory Architecture
Unified architecture: unified insn/data memory
“Harvard” architecture: split insn/data memories
Insn/Data Memory
CIS 501 | Dr. | ISAs & Single Cycle
Datapath for MIPS ISA
MIPS: 32-bit instructions, registers are $0, $2… $31
Consider only the following instructions
add $1,$2,$3 $1 = $2 + $3 (add)
addi $1,$2,3 $1 = $2 + 3 (add immed)
lw $1,4($3) $1 = Memory[4+$3] (load)
sw $1,4($3) Memory[4+$3] = $1 (store)
beq $1,$2,PC_relative_target (branch equal)
j absolute_target (unconditional jump)
Why only these?
Most other instructions are the same from datapath viewpoint
The ones that aren’t are left for you to figure out
MIPS Instruction layout
CIS 501 | Dr. | ISAs & Single Cycle
There is also a J-format instruction
Remember there are 32 registers defined in MIPS so 18, 19 and 17 are valid register indices.
CIS 501 | Dr. | ISAs & Single Cycle
Start With C and instruction memory (split insn/data architecture, for now)
A +4 incrementer computes default next instruction PC
How would Verilog for this look given insn memory as interface?
CIS 501 | Dr. | ISAs & Single Cycle
First Instruction: add
Add register file
Add arithmetic/logical unit (ALU)
Wire Select in Verilog
How to rip out individual fields of an insn? Wire select
wire [31:0] insn;
wire [5:0] op = insn[31:26];
wire [4:0] rs = insn[25:21];
wire [4:0] rt = insn[20:16];
wire [4:0] rd = insn[15:11];
wire [4:0] sh = insn[10:6];
wire [5:0] func = insn[5:0];
CIS 501 | Dr. | ISAs & Single Cycle
CIS 501 | Dr. | ISAs & Single Cycle
Second Instruction: addi
Destination register can now be either Rd or Rt
Add sign extension unit and mux into second ALU input
Verilog Wire Concatenation
Recall two Verilog constructs
Wire concatenation: {bus0, bus1, … , busn}
Wire repeat: {repeat_x_times{w0}}
How do you specify sign extension? Wire concatenation
wire [31:0] insn;
wire [15:0] imm16 = insn[15:0];
wire [31:0] sximm16 = {{16{imm16[15]}}, imm16};
CIS 501 | Dr. | ISAs & Single Cycle
CIS 501 | Dr. | ISAs & Single Cycle
Third Instruction: lw
Add data memory, address is ALU output
Add register write data mux to select memory output or ALU output
CIS 501 | Dr. | ISAs & Single Cycle
Fourth Instruction: sw
Add path from second input register to data memory data input
CIS 501 | Dr. | ISAs & Single Cycle
Fifth Instruction: beq
Add left shift unit and adder to compute PC-relative branch target
Add PC input mux to select PC+4 or branch target
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com