代写 MIPS scala parallel assembly compiler computer architecture software High Perf. Comp.Architecture

High Perf. Comp.Architecture
Module 0
AN INTRODUCTION to
Basic Computer Systems Concepts
& High Performance Computing (HPC)
:
1

About the Evolution of
Computer Systems..
[**mainly taken from Chp.1, p. 6 ~ 12, of “Advanced Computer Architecture – Parallelism, Scalability & Programmability” by Kai Hwang]
 Since the birth of 1st generation electronic computers (like the IBM 701 based on vacuum tubes) thru’ 4th generation (incl. the VAX 9000 or IBM 3090 based on VLSI), and now up to the 5th generation massively parallel computers (like Fujitsu VPP500, Cray/MPP, CM- 5 using ULSI processors) to achieve teraflops (1012 floating-point operations per sec.), it is long recognized that the concept of computing architecture is
NO LONGER restricted to the bare machine hardware !
2

System Software…
 A modern computer is an integrated system consisting of machine hardware, an instruction set [to be elaborated later], system software [e.g. pre-processor, compiler, linker, loader, or instruction/process scheduler], application
programs
[e.g.
programs we showed in this course or
examples of
reference books], and user interfaces.
assembly
3

Architecture of A Modern Computer System = H/W + S/W
4

System S/W Supports…
 System software (including compiler or loader) is needed for the development of “efficient programs”, esp. for parallel computation, in high- level languages (HLL);
5

System S/W Supports…
 The compiler is generally used to translate source code written in HLL into object code. The (optimizing) compiler assigns variables to registers or to memory words, and reserves functional units (FUs)
(or sometimes called processing elements [PEs] – similar to the processor cores in each CPU) for operators;
6

System S/W Supports…
 An assembler is used to translate the compiled object code into machine code which can be directly recognized by the machine H/W;
 A loader is used to initiate the program execution through the OS kernel / manager.
7

To Achieve Greater Performance Gain Thru’ Parallelism !!
 Over the past few decades (~ 40 yrs), we found that greater performance gain can be achieved thru’ executing the computer instructions or blocks of code from a sequential mode (one after another) to the concurrent / parallel execution
Assuming: all the involved instructions are independent of each other !
8

Models of the Parallel Computation…
 Basically, the development of computer architecture for parallel computation in the past decades has gone thru’ evolutional rather than revolutional changes, to be depicted as follows.
9

Evolutionary Models of the Parallel Computation..
The scope of this course refers all the way from “scalar” thru’ “pipeline” to “implicit vector” in the above diagram.
[taken from Fig 1.2 of Kai Hwang’s “Advanced Comp. Arch.”]
10

Further Notes…
 SIMD – stands for “single instruction stream over multiple data streams”. SIMD is an example of the vector computers.
 MIMD – stands for “multiple instruction streams over multiple data streams”. MIMD is an example of the parallel, specifically MPP, computers.
11

KEY Modules of this Course..
 This course covers 5 KEY modules as a fully
integrative approach covering BOTH
fundamental and NEW development in HPC !
Mod. 1 – Basic Issues in Pipelining
Mod. 2 – Adv. Pipelining &
Dynamic Scoreboard
Mod. 3 –
Dynamic Tomasulo’s Approach
Mod. 5 –
* Cloud Computing &
Sys. Architectures
Mod. 4 –
* ARM Design &
Predictive Approach
Core / Fundamental HPC Concepts
12

Fundamental Concepts/Def’s for Computer Systems
 To recall on p.10 of “Motivational Notes on HPC”, a central processing unit (CPU) is the “brain” of each computer system;
 Each CPU consists of a few [say 2 − 8] (processor) cores in which each processor core is a processing unit which reads in instructions to perform specific actions.
 Thus, to look into the behavior / performance
of each computer system, we would firstly
study the structure of its CPU / processor.
13

Structure of the ARM
processor…
 Below is the structure of the ARM1176JZF-S processor commonly used in many micro-controllers / mobile devices.
We can see each processor / core are made up of different components / functional units to serve various purposes.
14

About the core/processor…
 The ARM1176JZF-S is a 32-bit processor/core, i.e. the computer word length of each computer system – meaning the processor / core can handle a 32-bit instruction / data in each clock cycle.
(32-bit instruction / data)
00101011 01101001 10110110 00110101
15

Supporting Units to the Core…
 Each core / processor is well supported with a no. of components/units inside the processor chip for fast computation and data storage/retrieval;

The ARM1176JZF-S has:
33 general
 7 dedicated / specialized registers;
 Arithmetic Logic Unit (ALU) – the ALU performs all
arithmetic and logic operations, and generates the condition codes for instructions to set specific flags;
 Vector Floating Point (VFP) Co-Processor – for much faster floating-point arithmetic.

purpose 32

bit registers (
R0…R32

);
16

Memory Management Unit..
 The processor memory management unit (MMU) works with the cache memory system to control accesses to and from external/main memory;
 The MMU also controls the translation of virtual addresses to physical addresses;
 Capacity of the Main / External Memory : Storage Size (i.e. No. of Memory Addresses) X
[Size of Each Memory Address / Cell] e.g. 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 232 X [32 bits]
=210 =1𝐾𝐾 X1𝐾𝐾X1𝐾𝐾X22X[4X8bits] = 16 G bytes (since 8 bits = 1 byte)
17

The Prefetch Unit & Instruction Cache..
 The prefetch unit fetches (16-bit or 32-bit) instructions from the instruction cache (also called I-cache), Instruction Tightly Coupled Memory (TCM), or from external memory and predicts the outcome of branches in the instruction stream (to be covered in Mod-4);
 Modern microprocessors make extensive uses of caches for fast access and storage of data / instructions, (L1/L2/L3) D-cache or I-cache. Performance of caches : Registers >> Caches >>
Main memory (>> means faster)
18

Load & Store Unit (LSU)…
 The Load Store Unit (LSU) manages all LOAD and STORE operations, e.g. LOAD a value from the memory address $A0A1B007 to register R0;
 The load-store pipeline decouples load and store operations from the other pipelines such as those for the ALU operations.
19

5 Basic Types of Instructions for ALL Computer Systems..
 Tofacilitateoursubsequentdiscussiononallupcominglecture notes about the pipelining computer systems like RS/3000 or /4000 architecture, we generally categorize ALL assembly instructions into 5 BASIC TYPES for MOST computer systems;
 In the 2nd (Interpretation) column of the following tables,
 the first comment highlighted in blue is for interpretation/meaning of the instruction on the MIPS R3000/R4000 pipeline or similarly the DLX architecture as commonly adopted in our subsequent lecture notes;
 the second comment highlighted in orange is for possible interpretation as on other architecture.
20

5 Basic Types of Instructions for ALL Computer Systems..
Types of Instructions & Examples
Interpretation/Meaning
R-type : instructions relating to registers only
e.g. ADD R1, R2, R3
– the interpretation is ADD [DST] , [SRC1], [SRC2]
So, R2 + R3(i.e. assign to) R1 as the destination
– other possible interpretation: ADD [SRC1], [SRC2], [DST] So, R1 + R2R3
21

5 Basic Types of Instructions for ALL Computer Systems..
ORi : Operands (as already stored in memory) and Register i
e.g. ADD R1, $1010, R3
– the interpretation is ADD [DST] , [SRC1], [SRC2]
So, [$1010] + R3 (i.e. assign to) R1 as the destination
– other possible interpretation: ADD [SRC1], [SRC2], [DST]
So, R1 + [$1010] R3
22

5 Basic Types of Instructions for ALL Computer Systems..
 LW : Load a (computer) Word from the main memory to a register
e.g. LW R1, $1010
– the interpretation is [$1010]R1 as the destination
– SAME AS ABOVE
 SW : Store a (computer) Word from a register to the main memory
e.g. SW R1, $1010
– the interpretation is R1[$1010] as the destination
– SAME AS ABOVE
23

5 Basic Types of Instructions for ALL Computer Systems..
BEQ : Branch on Equal (or Branch instructions in
general)
e.g. BEQ R1, $1832
– the interpretation is
IF (R1 == 0) then jump to addr. $1832 to execute the instr. There
– SAME AS ABOVE
24

ARM Instructions & MIPS/DLX..
 In fact, ARM instructions are also very similar to the 5 basic formats of MIPS/DLX (i.e. the first [blue] highlighted interpretation), e.g.
Common ARM Instructions
Description
ADD x, y, z
y + z → x [DST] (ref. :R-type)
LDR r, addr
Load intoregisterr fromaddr (ref.)
STR r, addr
Store from register r to addr (ref. )
BEQ

~~ END of Module 0 ~~
26