Classifying Instruction Set Architectures
l Using the type of internal storage in the CPU.
Internal Storage in CPU
Explicit operands per ALU instruction
Copyright By PowCoder代写 加微信 powcoder
Destination for result
Access operand by
(operand implicit on stack)
Push or Pop on Stack
B5500 HP3000/70
Accumulator
Accumulator
Load/Store accumulator
Motorola 6809
General-purpose registers
Register or Memory
Load/Store register
l Using the number of explicit operands named per instructions.
l Using operand location.
Can ALU operands be located in memory?
––RISC architecture requires all operands in register. ––Stack architecture requires all operands in stack.
(top portion of the stack inside CPU; the rest in memory) l Using operations provided in the ISA.
l Using types and size of operands.
cs420/520-CH2-5/14/99–Page 1-
Metrics for Evaluating Different ISA
l Code size (How many bytes in a program’s executable code?)
Code density (How many instructions in x K Bytes?)
Instruction length.
––Stack architecture has the best code density. (No operand in ALU ops)
l Code efficiency. (Are there limitation on operand access?) ––Stack can not be randomly accessed.
e.g. Mtop-x x>=2 cannot be directly accessed.
Mtop-Mtop-1 is translated into subtract followed by negate
l Bottleneck in operand traffic.
––Stack will be the bottleneck;
––both input operands from the stack and result goes back to the stack.
l Memory traffic (How many memory references in a program (i+d)?) ––Accumulator Arch: Each ALU operation involves one memory reference.
l Easiness in writing compilers.
––General-Purpose Register Arch: we have more registers to allocate. more choices more difficult to write?
l Easiness in writing assembly programs.
––Stack Arch: you need to use reverse polish expression.
chow cs420/520-CH2-5/14/99–Page 2-
Comparison of Instruction Set Architectures
In the following let us compare five different ISAs. l Stack Architecture
l Accumulator Architecture
l Load/Store Architecture: GPR(0,3) Arch. –– e.g. MIPS R2000, SPARC 0→no memory operand allowed in ALU instruction;
3→max. 3 operands in ALU instructions;
GPR stands for General-Purpose Register
l Register-Memory Architecture: GPR(1,2) Arch.–– e.g. M68020 l Memory-Memory Architecture: GPR(3,3) Arch.–– e.g. VAX
Let us compare how f=(a-b)/(c-d*e) are translated.
Assume all six variables are 4-bytes integers.
Assume opcodes are 1 bytes except push and pop in stack arch. are 2 bits. Memory address is 16 bits and register address is 5 bits.
In reality, VAX and M68020 do not have 32 registers.
We will evaluate the instruction density and memory traffic.
See the actual compiled assembly code of GPR Architectures in ~cs520/src/arch.s*.
chow cs420/520-CH2-5/14/99–Page 3-
Compute f=(a-b)/(c-d*e) Using Stack Architecture
Assume that top 4 elements of the stack are inside CPU.
Instructions
Stack ContentMstacktop’←Mstacktop-1 bop Mstacktop binaryoperator
push b[a,b] subtract[a-b]
push c[a-b,c] push d[a-b,c,d] push e[a-b,c,d,e] mult [a-b, c, d*e] subtract[a-b, c-d*e] divide [a-b/(c-d*e)] pop f
The execution of subtract operation:
ALU to ALU-1
stack before
1. pop b to ALU-2
3. push a-b
stack after
Instruction count=10
Total bits for the instructions=bits for opcodes + bits for addresses =(6*2+4*8)+(6*16)=140 bits
Memory traffic=140 bits+6*4*8 bits = 332 bits
chow cs420/520-CH2-5/14/99–Page 4-
Compute f=(a-b)/(c-d*e) Using Accumulator Architecture
Instructions AC temp(amemorylocation)AC←ACbopMp
mult e[d*e]
store temp
sub temp[c-d*e] store temp
The execution of subtract operation:
a Accumulator
from memory
Instruction count=10
Total bits for the instructions=bits for opcodes + bits for addresses =(10*8)+(10*16)=240 bits
Memory traffic=240 bits+10*4*8 bits = 560 bits
chow cs420/520-CH2-5/14/99–Page 5-
Compute f=(a-b)/(c-d*e)
Using Load/Store Architecture: GPR(0,3)
MIPS R2000 instructionssp is a special register called stack pointer. Instructions (op dst, src1, src2)dst←src1 bop src2$n: nth register
lw $14, 20($sp)a was allocated at M[20+sp]
lw $15, 16($sp)b was allocated at M[16+sp]
subu $24, $14, $15$24=a-b
lw $25, 8($sp)d was allocated at M[8+sp]
lw $8, 4($sp)e was allocated at M[4+sp]
mul $9, $25, $8$9=d*e
lw $10, 12($sp)c was allocated at M[12+sp]
subu $11,$10,$9$11=c-d*e
div $12, $24, $11 Displacement address mode
sw $12, 0($sp)f was allocated at M[0+sp]
Instruction count=10
Total bits for the instructions=bits for opcodes + bits for addresses =(10*8)+(4*(5+5+5)+6*(5+16))=266 bits
Memory traffic=266 bits+6*4*8 bits = 458 bits
chow cs420/520-CH2-5/14/99–Page 6-
Compute f=(a-b)/(c-d*e)
Using Register-Memory Architecture: GPR(1,2)
M68020 instructions
Instructions (op src, dst)
movl d0 mulsl d0 movl d1 subl d0, d1
movl d0 subl d0 divsl d1, d0
Instruction count=8
a6 is a stack pointer d0 & d1 are registers
[c] [c-d*e]
dst←dst bop src
d was allocated at M[a6-16] e was allocated at M[a6-20] c was allocated at M[a6-12]
a was allocated at M[a6-4] b was allocated at M[a6-8]
Displacement address mode
f was allocated at M[a6-24]
Total bits for the instructions=bits for opcodes + bits for addresses =(8*8)+(6*(5+16)+2*(5+5))=210 bits
Memory traffic=210 bits+6*4*8 bits = 402 bits
chow cs420/520-CH2-5/14/99–Page 7-
Compute f=(a-b)/(c-d*e)
Using Memory-Memory Architecture: GPR(3,3)
VAX instructions fp is a frame (stack) pointer r0 & r1 are registers
Instructions (op src2,src1,dst) r0
r1 dst←src1 bop src2
a,b were allocated at M[fp-4],[fp-8] d,e were allocated at M[fp-16],[fp-20] [c-d*e]c was allocated at M[fp-12]
f was allocated at M[fp-24]
subl3 mull3 subl3 divl2 movl
-8(fp), -4(fp), r0 [a-b] -20(fp), -16(fp), r1[d*e] r1,-12(fp), r1
r0, -24(fp)
Instruction count=5
[(a-b)/(c-d*e)]
Total bits for the instructions=bits for opcodes + bits for addresses =(5*8)+(6*16+7*5)=171 bits
Memory traffic=171 bits+6*4*8 bits = 363 bits
chow cs420/520-CH2-5/14/99–Page 8-
Comparison of ISAs
In terms of Instruction Density and Memory Traffic
Architecture
Instruction Count
Code Size (bits)
Memory Traffic (bits)
Accumulator
Load-Store
How about the speed of execution?
chow cs420/520-CH2-5/14/99–Page 9-
Misalignment causes hardware complications
A misalignment memory access results in multiple aligned memory references.
register 3 CPU
an integer not aligned is stored in bytes 2002, 2003, 2004, 2005
32 bit wide bus
Mem. 32bits
To get the integer in to the register, it not only requires two memory accesses but also requires CPU to perform shift and or operations when receiving data.
chow cs420/520-CH2-5/14/99–Page 10-
Memory Alignment may require more storage
How many bytes will be allocated for the following structure on SPARC and PC? How about on SPARC and PC?
struct PenRecord{
unsigned char msgType;
short size; int X;
int Y;} pr;
Assume pr is allocated at memory starting at address 2000.
After we executed “pr.msgType=8, pr.size=4, pr.X=1, pr.Y=256;”, here are the content of memory on differen machines in hexadecimal.
address 2000 size X Y 2011 Onsparc: 08 00 00 04 00 00 00 01 00 00 01 00
OnPentium:08 fd 04 00 01 00 00 00 00 01 00 00
Onold86: 08 04 00 01 00 00 00 00 01 00 00
Bytes with red color are padding bytes. A structure variable is allocated with memory that is multiple of its largest element size.
chow cs420/520-CH2-5/14/99–Page 11-
Byte Ordering
There are two ways to ordering the bytes of short, int, long data in memory. l Little endian––(e.g, PC) put the byte with the less significant value in the
lower address of the allocated memory. Least significant byte sends/saves first.
l Big endian––(e.g, SPARC) put the byte with more significant value in the lower address of the allocated memory. Most signficant byte sends first.
address 2000 size X Y 2011 Onsparc: 08 00 00 04 00 00 00 01 00 00 01 00
big endian machine: For X, it was allocated at addresses 2004-7. The least significant of the four bytes, 01, is allocated at address 2007 (the higher address).
OnPentium:08 fd 04 00 01 00 00 00 00 01 00 00
little endian machine: For X, it was allocated at addresses 2004-7. The least significant of the four bytes, 01, is allocated at addresses 2004 (the lower address).
chow cs420/520-CH2-5/14/99–Page 12-
Memory Alignment
What happens if we change the declaration by moving the size field to the end?
struct PenRecord{
unsigned char msgType;
int Y; short size;
Assume pr is allocated at memory starting at address 2000.
After we executed “pr.msgType=8, pr.size=4, pr.X=1, pr.Y=256, here are the content of memory on different machines generated by write2.c
address 2000 X Y size 2015 On sparc: 08 00 00 00 00 00 00 01 00 00 01 00 00 04 06 38
OnPentium: 08000000010000000001000004004000
On old 86: 08 01 00 00 00 00 01 00 00 04 00
Bytes with red color are padding bytes. A structure variable is allocated with memory that is multiple of its largest element size. What if we change int to double
chow cs420/520-CH2-5/14/99–Page 13-
Memory Alignment Exercise
In ~cs520/alignment, there is a program, called writer.c, that declares a variable of the type PenRecord, sets its field values to be msgtype=8, X=1, Y=256, size=4, and print out sizeof(struct PenRecord), save the variable in binary form in a file, called datafile. Another program, reader.c, reads in the datafile and print the values of PenRecord fields. Assume you are in ~cs520 directory.
a) Run “SPARC/writer” on a SPARC, then run “LPC/reader datafile” on a Linux PC. Observe difference.
b) Run “SPARC/reader datafile” on a SPARC. Observe differences.
In chow.uccs.edu (a Win95 machine), at c:\cs520,
a) telnet to chow.uccs.edu with login=guest and password=cs520 b) run “reader datafile.spa” and see how data are interpreted.
Character does not require alignment; integer, short, and long require alignment. On Pentium PC, DEC3100, DECalpha and Sparc, int is interpreted as 4 bytes. On DECalpha, long is 8 bytes; while on other machines, long is 4 bytes. DEC3100, DECalpha, has stringent aligned accesses of objects.
SUN3, a 680×0 machine, has less stringent in aligned accesses. Both short and integer only need to align at even address.
Compilers on Pentium PC with Linux and Win32 generate code that requires memory alignment. The old 386 and 486 compilers generate code without memory requirement.
chow cs420/520-CH2-5/14/99–Page 14-
cs420/520-CH3-5/14/99–Page 15-a
~cs520/alignment/writer.c
#include
unsigned char msgType; short size;
int fd, cc;
int accessible;
struct PenRecord r;
if (access(“datafile”, F_OK) == 0) {
/* file with same name exist */
if ((fd=open(“datafile”, O_WRONLY, 00777)) <= 0) {
printf(“fd=%d\n”, fd); exit(1);}
} else if ((fd=open(“datafile”, O_CREAT, 00777)) <= 0) { printf(“fd=%d\n”, fd);
exit(1);} r.msgType=8;
printf(“size of PenRecord =%d\n”, sizeof(struct PenRecord)); cc=write(fd, &(r.msgType), sizeof(struct PenRecord));
printf(“no.of bytes written is %d\nr.msgType=%d, r.X=%d, r.Y=%d, r.size=%d\n”, cc, r.msgType, r.X, r.Y, r.size);
close(fd); }
~cs520/alignment/reader.c
#include
unsigned char msgType; short size;
int Y;} r;
int fd, cc;
if ((fd=open(“datafile”, O_RDONLY, 0)) <= 0) exit(1);
printf(“size of PenRecord = %d\n”, sizeof(struct PenRecord));
cc=read(fd, &r, sizeof(struct PenRecord));
printf(“no.of bytes read is %d\nr.msgType=%d, r.X=%d, r.Y=%d, r.size=%d\n”,
cc, r.msgType, r.X, r.Y, r.size); close(fd);
Run writer on SPARC, we get the following output:
size of PenRecord = 12
no.of bytes written is 12
r.msgType=8, r.X=1, r.Y=256, r.size=4
Run reader on DEC3100 with the datafile generated by writer on SPARC, got: size of PenRecord = 12
no.of bytes read is 12
r.msgType=8, r.X=16777216, r.65536, , r.size=1024
Run reader on SPARC with the datafile generated by writer on SPARC, got: size of PenRecord = 12
no.of bytes read is 12
r.msgType=8, r.X=1, r.Y=256, r.size=4
chow cs420/520-CH3-5/14/99--Page 16-a
Examples of Addressing Modes VAX instructions
LL0:.align 1
.globl _main
# static int sdata; .lcomm L16,4
.set L12,0x0
_main:.wordL12
# int i, j;
# char buf[128];
movab -144(sp),sp
movl $3,-4(fp)
# j = i+4;
addl3 $4,-4(fp),-8(fp)
moval -4(fp),-12(fp)
incl *-12(fp)
# sdata = i;
movl -4(fp),L16
# c = buf[i];
moval -140(fp),r0
movl -4(fp),r1
movb (r0)[r1],-141(fp)
# for (j=1; j<10; j++) { movl $1,-8(fp)
L2000001:moval-140(fp),r0 movl -8(fp),r1
movb -141(fp),(r0)[r1] #}
aoblss $10,-8(fp),L2000001 #}
Similar compiled codes for DEC3100 on ~cs520/src/addrmode.s*
buf[127:124]
fp-8 fp-12
# $3: Immediate address mode
# -4(fp): Displacement address mode
# *-12(fp): Memory indirect address mode
# L16: Direct address mode
# (r0)[r1]: Scaled address mode
buf[j] = c;
chow cs420/520-CH3-5/14/99--Page 17-a
Direct /Immediate Indirect
Self Modifying
Full Size or Short Literal
Register vs. Memory
as intermediate location
Full Size vs. Short Displacement Single vs. Multiple Index Registers
Pre- vs. Post-Increment Pre- vs. Post-Decrement Fixed vs. Data Dependent Increment (Decrement)
ADD R4, #3
ADD R4, (R1) ADD R4, @(R1)
LW R12, (R11+R2) ADD R1, 100(R2)[R3]
ADD R1, (R2)+ ADD R1, -(R2)
Design Consideration of Addressing Modes
Powerful instruction set architectures such as VAX were designed to translate the constructs of programming languages in very few instructions.
→ providing high level hardware support for languages.
But they also made the design of those processors very difficult and caused the delay of the delivery.
→In early 1980s, the direction of computer architecture starts the shift to a simpler architecture and RISC was born.
chow cs420/520-CH2-5/14/99--Page 18-
Trade-off in Encoding Address Modes
1. The desire to have as many registers and addressing modes as possible.
2. The impact of the size of the register and addressing mode fields on the average instruction size and hence on the average program size.
3. A desire to have instructions encode into lengths that will be easy to handle in the implementation, e.g. instructions in multiples of bytes, rather than arbitrary length. Many architects have chosen to use a fixed-length instruction.
→ Most RISC architectures use 32-bit fixed-length instruction.
chow cs420/520-CH2-5/14/99--Page 19-
Operations in the Instruction Set
Operator type
Arithmetic and Logical
Integer arithmetic and logical operations: add, and, subtract, or
Data Transfer
Load/Store (move instructions on machines with memory addressing)
Branch, jump, procedure call and return, traps
Operating system call, virtual memory management instructions
Floating point
Floating-point operations: add, multiply
Decimal add, decimal multiply, decimal-to-character conversions
String move, string compare, string search
chow cs420/520-CH2-5/14/99--Page 20-
Statistics about Control Flow Change
l Branches every 6 to 20 instructions
l Procedure calls every 70 to 300 instructions
l Branches 12 to 14 times more frequent than procedure calls.
l 0 is the most frequent used immediate value in compared (83%).
l Most backward-going branches are loop branches
l Loop branches are taken with 90% probability
l Branch behavior is application dependent and sometimes compiler- dependent.
chow cs420/520-CH2-5/14/99--Page 21-
conditional branches dominate
Procedure Call/Return
Two basic approaches to save registers in procedure calls: l Caller-saving
l Callee-saving
Where to save registers in procedure calls:
l Memory area pointed by the stack pointer
l Register Windows in Sparc architecture
chow cs420/520-CH2-5/14/99--Page 22-
ISA and Compiler Design
Compiler Writer’s goals:
l Generate correct code for valid program.
l Speed of compiled code.
l Fast compilation, debugging support, interoperability among different languages.
Recent compiler Structure
Dependencies
Language dependent; machine independent
Front-end per language
Transform language to common intermediate form
Procedure inlining and loop transformation
Global+local optimization register allocation
Detailed instruction selection and machine-dependent optimization; may include or be followed by assembler
cs420/520-CH2-5/14/99--Page 23-
Highly machine dependent; language independent
Intermediate representation
High-level optimization Global optimization
Code Generation
addn(int src, int n) {
return src+n;
j = addn(gv, i); printf(“j=%d\n”, j);
replace this statement with j=gv+i
Procedure Inlining/Integration
l Avoid the overhead of procedure call/return by replacing the procedure call with the procedure body (with proper variable substitution).
l Trade-off between the size of the procedure body and the frequencies of the procedure call. The code size of expanded program vs. call/return overhead.
l Some modern programming languages such as C++ have explicit inline language construct.
cs420/520-CH2-5/14/99--Page 24-
int A[128], B[128]; inti,j,k;
B[i+j*k] = A[i+j*k];
$14, 8($sp) $15, 4($sp) $24, $14, $15 $25, 12($sp) $8, $25, $24 $9, $8, 4
$10, $sp, 1040 $11, $9, $10 $12, -512($11) $12, -1024($11)
Note that the common subexpression, i+j*k, is only calculated once.
This is MIPS code compiled on DEC3100 using cc -S -O1 commsubexp.c
cs420/520-CH2-5/14/99--Page 25-
lw mul lw addu mul addu addu lw sw
# $8 = i+j*k
# each integer is 4 bytes,
# add the base address of the stack
# -512 = the offset from base address to A[0] # -1024 = the offset from base address to B[0]
Common Subexpression Elimination
int i, j, k, l;
$14, 12($sp) j=i+4;
$15, 8($sp) k=i;
$24, 4($sp)
$25, 0($sp)
# i+4 is optimized to constant 7
# const 3 is propagated to k
Constant Propagation
l Note that the constant propagation will be stopped when the value of the variable is replaced by the result of a non-constant expression.
chow cs420/520-CH2-5/14/99--Page 26-
Code Motion
_codemotion: |#PROLOGUE# 0
link a6,#-524
moveml #192,sp@ |#PROLOGUE# 1
moveq #1,d6
mulsl moveq #4,d7
asll #2,d0
movl d0,a0
addl d7,a0
k* j is moved out of the loop
movl auto-increment
addql #1,d6 addql #4,d7 moveq #40,d1 cmpl d1,d7 jlt L77003
addressing mode
|#PROLOGUE# 2
unlk a6 |#PROLOGUE# 3
codemotion(k, j) int k;
int i, base; int A[128];
for (i=1; i<10; i++) { base = k*j;
A[base+i] = i; }
M68020 code
cs420/520-CH2-5/14/99--Page 27-
Register Saving in Procedure Call/Return
int fib(n) int n; {
if (n<0) return -1;
if (n==0) return 0;
if (n==1) return 1; return fib(n-1)+fib(n-2);
spold parameter n
return addr
temporary variables
jal: save PC+4 (next instruction address) to $31, them jump to the target address
new spnew+36
spnew+32 spnew+28 spnew+24 spnew+20 spnew+16 spnew+12
spnew+0 =SPold-40
cs420/520-CH3-5/14/99--Page 28-a
subu sw sw sw sw
lw bne move b
lw addu jal move lw addu jal move addu
$31, 28($sp)
$4, 40($sp)
$17, 24($sp)
$16, 20($sp)
if (n<0) return -1; $14, 40($sp)
$14, 0, $32
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com