CS计算机代考程序代写 computer architecture mips assembly c/c++ c++ Arithmetic for Computers 2: Floating Point Numbers

Arithmetic for Computers 2: Floating Point Numbers
CS 154: Computer Architecture Lecture #9
Winter 2020
Ziad Matni, Ph.D.
Dept. of Computer Science, UCSB

Administrative
• Lab 4 due today! • Lab5outsoon
• Syllabus (Schedule Section) has been updated 2/5/20 Matni, CS154, Wi20 2

Midterm Exam (Wed. 2/12) What’s on It?
• Everything we’ve done so far from start to Monday, 2/10
What Should I Bring?
• Your pencil(s), eraser, MIPS Reference Card (on 1 page)
• You can bring 1 sheet of hand-written notes (turn it in with exam). 2 sides ok.
What Else Should I Do?
• IMPORTANT: Come to the classroom 5-10 minutes EARLY
• If you are late, I may not let you take the exam
• IMPORTANT: Use the bathroom before the exam – once inside, you cannot leave
• Random seat assignments
• Bring your UCSB ID
2/5/20 Matni, CS154, Wi20 3

Lecture Outline
• Floating Point Numbers Representations • IEEE 754 F-P Standard
• Arithmetic in F-P
• Instructions for F-P
• Hardware implementations
2/5/20 Matni, CS154, Wi20 4

Floating Point
• Representation for non-integral numbers
• Including very small and very large numbers
• Usually follows some ”normalized” form
of scientific notation
2/5/20 Matni, CS154, Wi20 5

Floating Point Numbers in CPUs
We need 3 pieces of information to produce a binary floating point number:
+/- N x 2E
The mantissa (aka significand) of the number
The sign of the number (positive or negative)
2/5/20 Matni, CS111, Sp19 6
The exponent of the number

Representation in MIPS (Single Precision)
• The actual form is: (-1)S x (1 + Fraction) x – Bias
• Called the IEEE 754 F-P Standard (more on this coming up)
• MIPS design for “single-precision” has:
8 bits for exponent and 23 bits for fraction
• Gives a range from 2.0 x 10-38 to 2.0 x 1038 – quite large!
• Overflow can occur: here it means that the exponent is too large to be represented in the exponent field.
• If a negative exponent is too large, then we get underflow.
2/5/20 Matni, CS154, Wi20 7
2
Exponent

Double Precision Floating Points
• Single Precision is float in C/C++
• Double Precision is double in C/C++
• 64 bits (2 words) instead of 32 bits • 11 bits for exponent (instead of 8) • 52 bits for fraction (instead of 23)
Gives a wider range and greater precision than single-precision
Range is: 2.0 x 10-308 to 2.0 x 10308
2/5/20 Matni, CS154, Wi20 8

IEEE 754 Floating-Point Standard
• Includes single and double-precision definitions (since 1980s) • Very widespread in almost all CPUs today
• S = 0èpositive S = 1ènegative •
• The ”Bias” is 127 for single-precision and 1023 for double-precision Examples with single-precision:
The ”1” in “1 + Fraction” is implicit
S=0, E=0x82, F=0 is: (+1) x (1 + 0) x 2 (130-127) =1×23 =8
S=0, E=0x83, F=0x600000 is: (+1) x (1 + 0.11) x 2 (131-127)
=1.11×24 =11100=28
2/5/20 Usefulwebsite: https://www.h-Msactnhi,mCS1i5d4,tW.ni20et/FloatConverter/IEEE754.html 9

More Examples!
• Hex word for single-precision F-P is: 0x3FA00000 • So:
0011 1111 1010 0000 … 0000
S=0 E=0x7F=127 F=010…0 • So:
Number = (+1) x (1 + 0.01) x 2(127 – 127) = 1.01 (bin) =1+1×2-2 =1.25
2/5/20
Matni, CS154, Wi20 10

Yet More Examples!!
• Hex word for single-precision F-P is: 0xBF300000 • So:
1011 1111 0011 0000 … 0000
S=1 E=0x7E=126 F=011…0 • So:
2/5/20
Matni, CS154, Wi20 11
Number
= (-1) x (1 + 0.011) x 2(126 – 127) = 1.011 (bin) = -(1 + (1 x 2-2) + (1 x 2-3)) x 2-1
= -(1 + 0.25 + 0.125) x 0.5
= -0.6875
2-1 = 0.5
2-2 = 0.25
2-3 = 0.125 2-4 = 0.0625 2-5 = 0.03125

Even More Examples!!!
• What is the single-precision word (in hex) of the F-P number 29.125? • Ok, here we go:
I am reminded that 0.125 = 2-3
And, I know that 29 in binary is: 11101
So 29.125(10) = 11101.001(2) = 1.1101001 x 24 This is a positive number, so S = 0
F = 1101001000…0 (23 bits in all)
E = 4 + 127 = 131 = 10000011
• So:
Number in bin = 0 10000011 1101001000…0
or 0100 0001 1110 1001 0…0 = 0x41E90000
2/5/20
Matni, CS154, Wi20 12
2-1 = 0.5
2-2 = 0.25
2-3 = 0.125 2-4 = 0.0625 2-5 = 0.03125

Special Exponent Values
Consider Single-Precision Numbers:
• Exponents 0x00 and 0xFF are reserved
• Smallest exponent is 1 è Actual exponent = 1 – 127 = -126 • Smallest fraction is 0
•So,Iget±1.0×2-126 ≅±1.2×10–38
• Largest exponent is 0xFE = 254èActual exp. = 127 • Largest fraction is 111…11 , which approaches 1 •So,Iget±2.0×2+127 ≅±3.4×10+38
2/5/20 Matni, CS154, Wi20 13

Special IEEE 754 Values
• IEEE 754 allows for special symbols to represent “unusual events” •WhenS=0, E=0xFF, F=0,
IEEE calls the number “inf” (i.e. infinity) •“-inf”iswhenS=1, E=0xFF, F=0
• These are to optionally allow programmers to divide by 0.
• Allows for the result of invalid operations
These are called “Not a Number” or “NaN”
2/5/20
• Example: 0/0 , inf – inf, etc…
Matni, CS154, Wi20 14

Floating-Point Addition
Consider a 4-digit decimal example:
9.999 x 101 + 1.610 x 10–1
1.
2. 3. 4.
Align decimal points
• Shift number with smaller exponent • 9.999 x 101 + 0.016 x 101
Add significands • 10.015 x 101
2/5/20
Matni, CS154, Wi20 15
Normalize result & check for over/underflow • 1.0015 x 102
Round and renormalize if necessary (what? why? Be patient…) • 1.002 x 102

Floating-Point Addition
Consider a 4-digit binary example:
1.000 x 2-1 + -1.110 x 2–2
1.
2. 3. 4.
Align decimal points
• Shift number with smaller exponent • 1.000 x 2-1 + -0.111 x 2-1
Add significands • 0.001 x 2-1
2/5/20
Matni, CS154, Wi20 16
Normalize result & check for over/underflow • 1.000 x 2-4
Round and renormalize if necessary • 1.000 x 2-4 = 0.0625

Re: Rounding in Binary F-P
• Can we create ANY floating point number in binary?
• What about 0.3333… (i.e. 1/3)?
• In binary, 1/10 is the infinitely repeating fraction 0.0001100110011001100110011001100110011001100…
• Since we cannot create ALL F-P numbers in binary, rounding (i.e. approximating) is necessary
• Many users are not aware of the approximation because of the way values are displayed
• The actual stored value is the nearest representable binary fraction
2/5/20 Matni, CS154, Wi20 17

C++ Program to Illustrate Rounding in Binary F-P
#include
#include
int main()
{
// Try running the program without the next 2 lines
// as a comparison. Or change the precision number around.
std::cout << std::setprecision(30); std::cout << std::fixed; float a = 1.0/3; double b = 1.0/3; std::cout << a << "\n" << b << "\n"; float x = 1.0/10; double y = 1.0/10; std::cout << x << "\n" << y; } 2/5/20 Matni, CS154, Wi20 18 Floating-Point Adder Hardware • Much more complex than integer adder • Remember the 4 steps from a couple of slides ago?... • Doing it in one clock cycle would take too long • Would force a slower clock on the system • How much we can do in 1 clock cycle is a matter for later discussion • FP adder usually takes several cycles • Can be pipelined for more efficient operation 2/5/20 Matni, CS154, Wi20 19 2/5/20 Matni, CS154, Wi20 20 FP Adder Hardware FP Other Arithmetic Hardware • FP multiplier is of similar complexity to FP adder • But uses a multiplier for significands instead of an adder • FP arithmetic hardware (incl. addition) is usually in a co-processor & does: • Addition, subtraction, multiplication, division, reciprocal, square-root • FPçèinteger conversion • Operations usually takes several cycles • Can be pipelined 2/5/20 Matni, CS154, Wi20 21 MIPS FP Instructions Single-Precision Double-Precision Addition add.s add.d Subtraction sub.s sub.d Multiplication mul.s mul.d Division div.s div.d Comparisons c.xx.s c.xx.d Where xx can be eq, neq, lt, gt, le, ge Example: c.eq.s Load lwc1 lwd1 Store swc1 swd1 Also, F-P branch, true (bc1t) and branch, false (bc1f) 2/5/20 Matni, CS154, Wi20 22 MIPS FP Instructions • FP instructions operate only on FP registers • Programs generally don’t do integer ops on FP data, or vice versa • More registers with minimal code-size impact 2/5/20 Matni, CS154, Wi20 23 The Floating Point Registers • MIPS has 32 separate registers for floating point: • $f0, $f1, etc... • Paired for double-precision • $f0/$f1, $f2/$f3, etc... • Example MIPS assembly code: lwc1 $f4, 0($sp) # Load 32b F.P. number into F4 lwc1 $f6, 4($sp) # Load 32b F.P. number into F6 add.s $f2, $f4, $f6 # F2 = F4 + F6 single precision swc1 $f2, 8($sp) # Store 32b F.P. number from F2 2/5/20 Matni, CS154, Wi20 24 Example Code C++ code: float f2c (float fahr) { return ((5.0/9.0)*(fahr - 32.0)); } Assume: fahr in $f12, result in $f0, constants in global memory space (i.e. defined in .data) Compiled MIPS code: 2/5/20 Matni, CS154, Wi20 25 f2c: lwc1 $f16, const5 lwc1 $f18, const9 div.s $f16, $f16, $f18 lwc1 $f18, const32 sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr $ra YOUR TO-DOs for the Week • Readings! •Work on Lab 5! •Start studying for the midterm! 2/5/20 Matni, CS154, Wi20 26 2/5/20 Matni, CS154, Wi20 27