Computer architecture: processors
Dr Fei Xia and Dr Alex Bystrov
Introduction to processors
Copyright By PowCoder代写 加微信 powcoder
• The brain of the computing system, meant to carry out the intended functionality, as and when needed.
A simplified view v1.0
Processed data
29/10/20 Architecture topics, EEE8087 1
Simplified View v2.0 – data types
Instructions
Processed data
Simplified example of an instruction:
29/10/20 Architecture topics, EEE8087 2
Functional view
Instruction / Data
Architecture topics, EEE8087
CPU Structure
Arithmetic and
Logic Units
Internal Interconnects
Control Unit
Computer I/O
System bus
Architecture topics, EEE8087
Control unit: data flow
Arithmetic and
Logic Units
Internal Interconnects
Control Unit
Computer I/O
System bus
Architecture topics, EEE8087
CPU control steps: data flow
• Fetch instructions
• Interpret instructions
• Fetch data
• Process data
• Write data
Fetch next instruction
Decode instruction
Architecture topics, EEE8087 6
Execute instruction
Simplified view
Data flow: execute
• Fetch and Decode are very common in all CPU architectures; however, Execute flow may take many forms
• Depends on instruction being executed
• May include
– Memoryread/write – Input/Output
– Registertransfers
– ALUoperations
Architecture topics, EEE8087 7
• Some architectures have this additional step to improve performance
• Can fetch next instruction during execution of current instruction (pipelining)
• Called instruction prefetch
• Prefetch can require accessing main memory
Instr. N+1
Execute (Pre-)
Architecture topics, EEE8087
Improved Performance through
• Prefetch offers good performance as it reduces the latency between CPU and the main memory
• But performance is not doubled: – Fetchusuallyshorterthanexecution
• Prefetch more than one instruction?
– Anyjumporbranchmeansthatprefetched
instructions are not the required instructions
• Add more stages or time multiplex the stages to improve performance
29/10/20 Architecture topics, EEE8087 9
Pipelining
• Detailed data flow
– Fetchinstruction
– Decodeinstruction
– Controloperandaddresses – Fetchoperands
– Executeinstructions – Writeresult
• Overlap these operations
29/10/20 Architecture topics, EEE8087 10
Timing of Pipeline – 6 stages
Architecture topics, EEE8087
• FI:fetchinstr.
• DI:decodeinstr.
• CO:controloperandaddress • FO:fetchoperands
• EI:executeinstructions
• WO:write-backoperands
Branch in a Pipeline
Architecture topics, EEE8087
Instruction 3 caused a branch to 15
Instructions 4-7 have stalls
Resource conflict stalls
Time (clock cycles)
I n s t r.
Load Instr 1
Instr 2 Instr 3
Apart from branching, it is possible to have stalls because of resource conflicts
Needs careful processor pipeline design with appropriate arbitration between streams
(eg. skip the cycle 4)
Architecture topics, EEE8087
Dealing with Branches
• Multiple Streams
• Prefetch Branch Target
• Loop buffer
• Branch prediction
• Delayed branching
29/10/20 Architecture topics, EEE8087 14
Prefetching branching target
Prefetch the branch instructions and store somewhere non-conflicting
Target of branch is prefetched in addition to instructions following branch
Keep target until branch is executed
Used as far back as the IBM 360/91
Architecture topics, EEE8087
Loop Buffer
• Often jump targets are a loop with sequence of instructions
• Very fast memory (IRs) stores these N Instructions in sequence
• The instructions in the loop can be pipelined
• Maintained by fetch stage of pipeline
• Check buffer before fetching from memory
• Very good for small loops or jumps
• Used by CRAY-1
29/10/20 Architecture topics, EEE8087 16
Branch Prediction (1)
• Predict never taken (pessimistic)
– Assumethatjumpwillnothappen
– Alwaysfetchnextinstruction
– Examples:68020&VAX11/780(manufacturedby DEC)
– Donotprefetchafterbranch
• Predict always taken (optimistic)
Architecture topics, EEE8087 17
mov r3, str r3, mov r3, str r3, b .L2
[fp, #-16] #0
[fp, #-20]
[fp, #-20] [fp, #-20] [fp, #-16] r3, r2 #207
r1, asl #2 fp, #12 r3, r1
r3, r2 [r3, #0] [fp, #-20] r3, #1 [fp, #-20]
[fp, #-20] #49
ldr r1, ldr r2, ldr r3, mul r0, mvn r2, mov r3, sub r1, add r3, add r3, str r0, ldr r3, add r3, str r3,
ldr r3, cmp r3, ble .L3 sub sp,
fp, #12 Architecltdumrefdtopsipc,s,{fEpE,Es8p0, 8p7c}
int a,b,c[50];
for( a= 0; a < 50; a++)
c[a] = a * b; }
Predict always jump has a 49/50 success rate and predict never jump has a 1/50 success rate
Branch Prediction (2)
• Predict by Opcode
– Some instructions are more likely to result in a jump than others
– For example COMPARE instructions
– Can get up to 75% success
• Taken/Not taken switch
– Based on previous history (machine learning aided)
– Good for loops
• Delayed Branch
– Do not take jump until you have to
– Do all current in sequence until the jump instruction
– Rearrange instructions
Architecture topics, EEE8087 19
Speedup from pipelining
• Ideally should equal to the number of pipelined stages (pipeline depth)
Without pipelining, CPI is equal to the number of stages in Data Flow;
assuming each stage requires 1 cycle (= Ideal CPI x Pipeline depth)
CPI = clocks per instruction, ideally = 1
CPIpipelined =IdealCPI+AverageStallcyclesperInst
Speedup = Ideal CPI ́ Pipeline depth Ideal CPI + Pipeline stall CPI
́ Cycle Timeunpipelined Cycle Timepipelined
Architecture topics, EEE8087
Pipelined architecture examples
ARM7TDMI – 3 stage pipeline
Thumb®ARM decompress
ARM decode
Reg Select
FETCH DECODE
ARM9TDMI – 5 stage pipeline
ARM or Thumb Inst Decode
Reg Decode
Architecture topics, EEE8087
Instruction Fetch
Instruction Fetch
FETCH DECODE EXECUTE
Shift + ALU
Memory Access
Control unit: CPU types
Arithmetic and
Logic Units
Internal Interconnects
Control Unit
Computer I/O
Architecture topics, EEE8087
System bus
architecture
• “Princeton architecture”
• Data and instructions share the same memory and memory interface with the CPU
• Input and output may be on separate interconnects
• Usually simplified to using a single bus for all data/instructions transfer
• Most of classical and current systems belong to this to some degree
29/10/20 Architecture topics, EEE8087
Source: Kapooht
Harvard architecture
• Separate instruction and data memories connected to the processor’s control unit using separate interconnects
• I/O share the same interconnect
29/10/20 Architecture topics, EEE8087
A bit of both
• Modified Harvard architecture
– Orsometimescalled“almostVonNeumannarchitecture”
– MemoriesinsideandclosetoCPUaredividedinto instruction and data
• Instruction registers and data registers
• Instruction cache and data cache (usually L1 cache)
• Connected with separate interconnects
– MemoriesfurtherawayfromCPUareorganizedin fashion
– ARMandIntelcurrenttechprocessorsusethis
– Reviewpipelinestallswhenfetchclasheswithdata store (slide 13)
Architecture topics, EEE8087 25
CISC and RISC
• CISC: complex instruction set computer
• RISC: reduced instruction set computer
• Berkeley group coined the term RISC and made a CPU called RISC 1, soon after Stanford made a similar CPU called MIPS
• SPARC also emerged from SUN
• ARM has a range of RISC architectures
• Early RISC CPUs had about 50 instructions compared to 200-300 common for CISC
– The aim was to simplify CPU to process (and start) instructions faster
Architecture topics, EEE8087 26
RISC philosophy
• Instructions of fixed length executing in a single clock cycle
• Pipelines to achieve one-instruction-per-one-clock-cycle throughput (need to predict branches in program flow in advance)
• Simple control logic to increase clock speed, no micro-code
• Operations performed on internal registers only; only LOAD
and STORE instructions access external memory
MIPS example: add $rd, $rs, $rt
B31-26 opcode
B25-21 B20-16 B15-11 B10-6
B5-0 function
register s register t register d
shift amount
29/10/20 Architecture topics, EEE8087 27
CISC characteristics
• Binary compatibility
– Oldbinarycodecanrunonnewerversions
• Complex control logic to support many instructions
• Use of micro-code
– Oneprograminstructioncanexecuteinmanycycles
• Variable-length instructions to save program
• Small internal register sets compared with RISC
• Complex addressing modes, operands can reside in external memory or internal registers
29/10/20 Architecture topics, EEE8087 28
A CISC versus RISC example
29/10/20 Architecture topics, EEE8087 29
One way of looking at it...
• Runtime = clock-period x CPI x Ninstr
• CISC tries to reduce the number of instructions
– Fewer instructions to do more
– Increased CPI
– Complex CPU design (multi-mode registers, and multi-cycle executions)
• RISC tries to reduce the clock cycles per instruction
– less cycles-per-instr
– more instructions
– simpler CPU design
• Obvious trade-offs can be seen!
29/10/20 Architecture topics, EEE8087 30
Another way of looking at it
• CISC assembler code may be easier for human programmers to handle
– Whenmanuallycoding
• But is this advantage really relevant these days?
29/10/20 Architecture topics, EEE8087 31
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com