4.1: Floating Point Numbers
CSU11022 – Introduction to Computing II
Dr / School of Computer Science and Statistics
D.A.Patterson, J.L.Hennessy, “Computer Organisation and Design: ARM Edition”, Morgan-Kaufmann, 2016.
Copyright By PowCoder代写 加微信 powcoder
(Section 3.5: Floating Point, available in the Library, doesn’t have to be the ARM Edition)
// some really small numbers and one large number
float [] vals = {
3.7e-5f, 4.8e-5f, 1.7e-5f, 2.4e-5f,
3.7e-5f, 4.8e-5f, 1.7e-5f, 2.4e-5f,
3.7e-5f, 4.8e-5f, 1.7e-5f, 2.4e-5f,
3.7e-5f, 4.8e-5f, 1.7e-5f, 2.4e-5f,
float result;
// add the numbers first-to-last
result = 0;
for (int i = 0; i < vals.length; i++) {
result += vals[i];
System.out.println("sum first-to-last: " + String.format("%.8f",result));
// output: sum first-to-last: 12345.00097656
// add the numbers last-to-first
result = 0;
for (int i = vals.length - 1; i >= 0; i–) {
result += vals[i];
System.out.println(“sum last-to-first: ” + String.format(“%.8f”,result));
// output: sum first-to-last: 12345.00000000
Binary number representation
32-bits … 232 unique values that we can use to represent different things
e.g. unsigned integers
0 … 232–1 (or 0 … 4,294,967,295)
e.g. signed integers using 2’s complement
–231 … 0 … +231–1 (or –2,147,483,648 … 0 … +2,147,483,647)
How do we represent real numbers like 21⁄2 or 3.14159265… ? Also, how do we represent values with really large or really small
magnitudes?
e.g. 2.2 x 1011 e.g. 1.3 x 10–8
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Scientific notation – decimal
The values 2.2 x 1011 and 1.3 x 10-8 are examples of (normalized) scientific notation in decimal form
Values expressed in normalised scientific notation satisfy the condition:
𝟣 ≤ |𝖿| < 𝟣𝟢
Normalized scientific notation give us one canonical form in which to express a value using scientific notation and allows quick, visual comparison of magnitude
As computer scientists, we avoid expressing the same thing in different ways (a==b?)
𝟥𝟩𝟤.𝟫𝟪 𝟥𝟩.𝟤𝟫𝟪×𝟣𝟢𝟣
𝟥𝟩𝟤𝟫.𝟪×𝟣𝟢−𝟣 𝟥.𝟩𝟤𝟫𝟪×𝟣𝟢𝟤
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Binary Floating-Point Numbers
Convert the following binary numbers to decimal numbers with fractions
10010101 = 1x27 + 1x24 + 1x22 + 1x20 = 149
1.1 101000.01
= 1x20 + 1x2-1 = 11⁄2 = 1.5
= 1x25 + 1x23 + 1x2-2 = 401⁄4 = 40.25
Convert the following decimal numbers to binary floating point numbers
0.75 x 2 = 1.5 0.5 x 2 = 1.0
7.75 = 111.11
2.1 = 10.000110011001100 ...
0.3125 x 2 0.625 x 2 0.25 x 2 0.5 x 2
= 0.625 = 1.25 = 0.5
9.3125 = 1001.0101
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Scientific notation – binary
Like decimal values, we can express binary values using scientific notation (again, in normalized form)
1010.1 = 1.0101 x 23 0.00101 = 1.01 x 2-3
The general form is again:
and in normalised form, f satisfies the following condition:
12 ≤ | f | < 102
5.7510 = 101.112 × 20 = 1.01112 × 22
The normalized form of a binary number expressed using scientific notation forms the basis for its representation in a computer
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
4.2: IEEE-754
CSU11022 – Introduction to Computing II
Dr / School of Computer Science and Statistics
https://www.h-schmidt.net/FloatConverter/IEEE754.html
IEEE 754 Floating-Point representation
Use a different interpretation of a 32-bit value to represent floating point numbers, e.g. IEEE 754
exponent (e)
fraction (f)
31 30 23 22
How can we represent ...
... positive and negative values?
... values with positive and negative exponents?
Where is the binary (radix) point?
(−𝟣)𝗌 ×𝖿×𝟤𝖾
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Sign of Exponent and Fraction
0 ⇒ positive floating-point number
1 ⇒ negative floating-point number Positive and negative exponents?
Option 1: 2’s Complement exponents Option 2: Biased exponents
Subtract a constant bias (b = 127) from stored exponent to obtain signed exponent
exponent (e)
fraction (f)
31 30 23 22
(−1)s × f × 2e−b
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Storing the fraction
The following two representations are of the same value (3.510)
31 30 23 22 0
+1.11 × 2(128 - 127) = 11.12 = 3.510
31 30 23 22 0
+0.111 × 2(129 - 127) = 11.12 = 3.510 (same value!)
We don’t want multiple representations of the same value!
if (a == b) ...
1.1100000000000000000000
0.11100000000000000000000
Storing floating-point numbers in normalized form avoids this problem: 𝟣𝟤 ≤ |𝖿| < 𝟣𝟢𝟤,so𝖿isintheform𝟣.𝖽𝖽𝖽𝖽𝖽...
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Normalization and the Hidden Bit
With normalisation
... becomes ...
0.0101 x 2-4 1.0100 x 2-6
adjust fraction so there is a single 1 to left of radix point compensate by adjusting exponent accordingly
If there is always going to be a 1 to the left of the radix point, we don’t need to store it!
Increases precision by one bit!
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Final IEEE 754 Floating-Point Representation
exponent (e)
fraction (f)
31 30 23 22
(−𝟣)𝗌 ×𝟣.𝖿×𝟤(𝖾−𝖻)
https://www.h-schmidt.net/FloatConverter/IEEE754.html
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
4.3: IEEE-754 Examples
CSU11022 – Introduction to Computing II
Dr / School of Computer Science and Statistics
https://www.h-schmidt.net/FloatConverter/IEEE754.html
1.25 = 1.012 (already normalised)
e = 0 + 127 = 127 = 011111112
f = 1.012 or .012 after removing the hidden bit
0 01111111 01000000000000000000000
0011 1111 1010 0000 0000 0000 0000 0000 3FA00000
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
10.75 = 1010.112 x 20 = 1.010112 x 23
e = 3 + 127 = 130 = 100000102
f = 1.010112 or .010112 after removing the hidden bit
0 10000010 01011000000000000000000
0100 0001 0010 1100 0000 0000 0000 0000 412C0000
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
-0.125 = -0.0012 x 20 = 1.02 x 2-3
e = -3 + 127 = 124 = 011111002
f = 1.02 or .02 after removing the hidden bit
1 01111100 00000000000000000000000
1011 1110 0000 0000 0000 0000 0000 0000 BE000000
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
decode 0x414a0000
0 10000010 10010100000000000000000
s = 0 (positive)
e = 130 (𝟤𝟣𝟥𝟢−𝟣𝟤𝟩 = 𝟤𝟥)
f = 1.100101 (after adding the hidden bit) +1.100101 x 23 = +1100.101 = +12.625
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Special Values
Special bit patterns, e.g. Zero (±)
00000000000000000000000
Infinity (±)
00000000000000000000000
Not a Number (NaN)
??????????????????????? ( != 0 )
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Single and Double Precision
32-Bit Single Precision (bias = 127)
exponent (e)
fraction (f)
31 30 23 22 0
64-Bit Double Precision (bias = 1023)
exponent (e)
fraction (f)
fraction (f)
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
4.4: Floating Point addition
CSU11022 – Introduction to Computing II
Dr / School of Computer Science and Statistics
https://www.h-schmidt.net/FloatConverter/IEEE754.html
Floating Point Addition
We can add the fractions of two floating point values if their exponents are the same
If their exponents are not the same to begin, shift the fraction of the value with the smaller exponent to compensate
1.01101 x 23 + 1.00110 x 2-2
= 1.01101 x 23 + 0.0000100110 x 23
= 1.0111000110 x 23
1.0110100000
+ 0.0000100110
1.0111000110
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Compare the exponents of the two numbers; shift the fraction of the smaller number to the right until its exponent would match the larger exponent
Add the fractions
Normalize the result by either shifting right and incrementing the exponent or shifting left and decrementing the exponent
Round the fraction to the appropriate number of bits
Overflow / Underflow
Still Normalised ?
Exception!
1.5 + 1.75 = 3.25
A = 0x3fc00000 (1.5) B = 0x3fe00000 (1.75)
0 01111111 10000000000000000000000 0 01111111 11000000000000000000000 sefsef
e 01111111 (127-127=0, 20)
f 1.10000 (remember hidden bit!)
1.100000 x 20 A
e 01111111 (127-127=0, 20)
f 1.110000 (remember hidden bit!)
1.110000 x 20 B
11.010000 x 20 Result (not normalised)
1.1010000 x 21 Result (normalised)
0 10000000 10100000000000000000000 (encoding s e f) 0x40500000 (3.25)
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
1.5 + 10.5 = 12.0
A = 0x3fc00000 (1.5) B = 0x41280000 (10.5)
0 01111111 10000000000000000000000 0 10000010 01010000000000000000000 sefsef
e 01111111 (127-127=0, 20)
f 1.10000 (remember hidden bit!)
e 10000010 (130-127=3, 23)
f 1.0101000 (remember hidden bit!)
0.0011000 x 23 A (adjust fraction so exponents are equal) 1.0101000 x 23 B
1.1000000 x 23 Result (already normalised)
0 10000010 10000000000000000000000 (encoding s e f) 0x41400000 (12.0)
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
Floating Point Addition
What about adding negative values (S==1)?
Proceed as before but before adding, the fractions of values with S==1 should be converted to their 2’s Compliment
Trinity College Dublin, The University of Dublin © / Trinity College Dublin 2015-2023
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com