程序代写代做代考 arm assembler cache compiler algorithm clock Ref: “ARM System-on-Chip Architecture” (2nd Edition) by Steve Furber, Addison Wesley

Ref: “ARM System-on-Chip Architecture” (2nd Edition) by Steve Furber, Addison Wesley
Due Acknowledgement of the Reference URL at:
http://thinkingeek.com/2013/01/09/arm-assembler-raspberry-pi-chapter-1/

Structure of the ARM processor…
 Below is the structure of the ARM1176JZF-S processor commonly used in many micro-controllers / mobile devices.
We can see each processor / core are made up of different components / functional units to serve various purposes.
ELEC 6036 – HPC Written by Dr. V. Tam 2

About the core/processor…
 The ARM1176JZF-S is a 32-bit processor/core, i.e. the computer word length of each computer system – meaning the processor / core can handle a 32-bit instruction / data in each clock cycle.
00101011 01101001 10110110 00110101
(32-bit instruction / data)
ELEC 6036 – HPC Written by Dr. V. Tam
3

Supporting Units to the Core…
 Each core / processor is well supported with a no. of components/units inside the processor chip for fast computation and data storage/retrieval;
 The ARM1176JZF-S has:
33 general
 7 dedicated / specialized registers;
 Arithmetic Logic Unit (ALU) – the ALU performs all
arithmetic and logic operations, and generates the condition codes for instructions to set specific flags;
 Vector Floating Point (VFP) Co-Processor – for much faster floating-point arithmetic.


purpose 32

bit registers (
R0…R32
);
ELEC 6036 – HPC Written by Dr. V. Tam 4

Memory Management Unit..
 The processor memory management unit (MMU) works with the cache memory system to control accesses to and from external/main memory;
 The MMU also controls the translation of virtual addresses to physical addresses;
 Capacity of the Main / External Memory : Storage Size (i.e. No. of Memory Addresses) X
[Size of Each Memory Address / Cell] e.g. 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 232 X [32 bits]
=210 =1𝐾𝐾 X1𝐾𝐾X1𝐾𝐾X22 X[4X8bits] = 16 G bytes (since 8 bits = 1 byte)
ELEC 6036 – HPC Written by Dr. V. Tam
5

The Prefetch Unit & Instruction Cache..
 The prefetch unit fetches (16-bit or 32-bit) instructions from the instruction cache (also called I-cache), Instruction Tightly Coupled Memory (TCM), or from external memory and predicts the outcome of branches in the instruction stream (to be covered later);
 Modern microprocessors make extensive uses of caches for fast access and storage of data / instructions, (L1/L2/L3) D-cache or I-cache. Performance of caches : Registers >> Caches >> Main memory (>> means faster)
ELEC 6036 – HPC Written by Dr. V. Tam 6

Load & Store Unit (LSU)…
 The Load Store Unit (LSU) manages all LOAD and STORE operations, e.g. LOAD a value from the memoryaddress$A0A1B007 toregisterR0;
 The load-store pipeline decouples load and store operations from the other pipelines such as those for the ALU operations.
ELEC 6036 – HPC Written by Dr. V. Tam 7

VFPv2 Registers…
 ARMv6 defines a f loating point subarchitecture called the Vector Floating-point v2 (VFPv2) for which the Raspberry Pi does provide a H/W implementation;
 We already know that the ARM architecture provides 16 general purpose registers r0 to r15, where some of them play special roles: r13, r14 and r15.
 Despite their name, these general purpose registers do not allow operating floating point numbers in them, so VFPv2 provides us with some specific registers.
 These VFPv2 registers are named s0 to s31, for single- precision, and d0 to d15 for double-precision floating-point operations.

The 5-Stage Pipeline in ARMS…
Basically, the ARMS processor uses a 5-stage pipeline with the prefetch unit occupying the first stage and the integer unit using the remaining four stages:
1. Instruction prefetch. (IF) 2.Instructiondecodeandregisterread. (ID) 3. Execute (shift and ALU). (EX) 4.Datamemoryaccess. (Mem)
5. Write-back results. (WB)

The Pipeline Organization in ARMS
(IFetch)
(ID / EX/ Mem / WB)

The ARMS Applications…
 The above ARMS was designed as a general-purpose processor core that can readily be applications manufactured by ARM Limited’s many licensees.
 It offers significantly (two to three times) higher performance than the simpler ARM7 cores for a similar increase in silicon area, and requires the support of double-bandwidth on-chip memory if it is to realize its full potential.
 One application of the ARMS core is to build a high- performance CPU such as the ARM810.

Branch Prediction by the Prefetch Unit…
 The prefetch unit of the ARMS processor is responsible for branch prediction and uses static prediction based on the branch direction (backwards branches are predicted ‘taken’, whereas forwards branches are predicted ‘not taken’) to attempt to guess where the instruction stream will go;
 the integer unit will compute the exact stream and issue corrections to the prefetch unit where necessary.

Independent Fetch Unit
Stream of Instructions (In-order Issue) to Execute
Execution Unit (Integer Unit in ARMS)
Instruction (Pre)Fetch With Branch Prediction
 Instruction fetch decoupled with Execution
 Often issue logic Included with Fetch
Correctness Feedback on Branch Results

Prediction: Branches, Dependencies, or even Data..
 Prediction has become essential to getting good performance from scalar instruction streams.
 We will discuss predicting branches. However, architects are now predicting everything: data dependencies, actual data, and results of groups of instructions
 at what point does computation become a probabilistic operation + verification?
 we are pretty close with control hazards already…
 Why does prediction work?
– underlying algorithm has regularities;
– data that is being operated on has regularities;
– instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems
 Prediction => compressible information streams?

Dynamic Branch Prediction…
 Prediction could be “static” (at compile time, as used in the ARMS architecture) or “dynamic” (at run-time)
– for our example, if we were to statically predict “taken,” we would only be wrong once each pass through loop;
 Is dynamic branch prediction better than static branch prediction?
– seems to be; still some debate to this effect (we will see some analysis later);
– today, lots of hardware being devoted to dynamic branch predictors.

Dynamic Branch Prediction..
 Solution: 2-bit scheme where change prediction only if get mis-prediction twice;
 Red: stop, not taken
 Green: go, taken
 Adds hysteresis to decision making process.

~~~END of Module-4~~~