Floating Point I CSE 351 Autumn 2016
Floating Point I
http://xkcd.com/571/
CS295
L12: Floating Point I
Number Representation Revisited
What can we represent in one word?
Signed and Unsigned Integers
Characters (ASCII)
Addresses
How do we encode the following:
Real numbers (e.g. 3.14159)
Very large numbers (e.g. 6.02×1023)
Very small numbers (e.g. 6.626×10-34)
Special numbers (e.g. ∞, NaN)
2
Floating
Point
CS295
L12: Floating Point I
Number Representation Really Matters
1991: Patriot missile targeting error
clock skew due to conversion from integer to floating point
1996: Ariane 5 rocket exploded ($1 billion)
overflow converting 64-bit floating point to 16-bit integer
2000: Y2K problem
limited (decimal) representation: overflow, wrap-around
2038: Unix epoch rollover
Unix epoch = seconds since 12am, January 1, 1970
signed 32-bit integer representation rolls over to TMin in 2038
Other related bugs:
1982: Vancouver Stock Exchange 10% error in less than 2 years
1994: Intel Pentium FDIV (floating point division) HW bug ($475 million)
1997: USS Yorktown “smart” warship stranded: divide by zero
1998: Mars Climate Orbiter crashed: unit mismatch ($193 million)
3
CS295
L12: Floating Point I
This is the part where I try to scare you…
Floating Point Topics
Fractional binary numbers
IEEE floating-point standard
Floating-point operations and rounding
Floating-point in C
There are many more details that we won’t cover
It’s a 58-page standard…
4
CS295
L12: Floating Point I
Start by talking about representing fractional numbers in binary in general
Then about the IEEE Standard for floating point representation, operations,
Will not cover:
All the details of the 58-page spec
Or expect you to perform operations
Want to show you how floating point values are represented
To give you a sense of the kinds of things that can go wrong with them
Representation of Fractions
“Binary Point,” like decimal point, signifies boundary between integer and fractional parts:
Example 6-bit
representation:
Example: 10.10102 = 1×21 + 1×2-1 + 1×2-3 = 2.62510
5
xx.yyyy
21
20
2-1
2-2
2-3
2-4
CS295
L12: Floating Point I
Representation of Fractions
“Binary Point,” like decimal point, signifies boundary between integer and fractional parts:
Example 6-bit
representation:
In this 6-bit representation:
What is the encoding and value of
the smallest (most negative) number?
What is the encoding and value of
the largest (most positive) number?
What is the smallest number greater
than 2 that we can represent?
6
xx.yyyy
21
20
2-1
2-2
2-3
2-4
CS295
L12: Floating Point I
Scientific Notation (Decimal)
Normalized form: exactly one digit (non-zero) to left of decimal point
Alternatives to representing 1/1,000,000,000
Normalized: 1.0×10-9
Not normalized: 0.1×10-8,10.0×10-10
7
6.0210 × 1023
radix (base)
decimal point
exponent
mantissa
CS295
L12: Floating Point I
Scientific Notation (Binary)
Computer arithmetic that supports this called floating point due to the “floating” of the binary point
Declare such variable in C as float (or double)
8
1.012 × 2-1
radix (base)
binary point
exponent
mantissa
CS295
L12: Floating Point I
9
Translating To and From Scientific Notation
Consider the number 1.011two×24
To convert to ordinary number, shift the decimal to the right by 4
– Result: 10110two = 22ten
For negative exponents, shift decimal to the left
– 1.011two×2-2 => 0.01011two = 0.34375ten
Go from ordinary number to scientific notation by shifting until in normalized form
– 1101.001two => 1.101001two×23
CS295
L12: Floating Point I
“Father” of Floating Point Standard
IEEE Standard 754 for Binary Floating- Point Arithmetic
www.cs.berkeley.edu/~wkahan/ieee754status/754story.html
Prof. Kahan
Prof. Emeritus UC Berkeley
1989
ACM Turing Award Winner!
CS295
L12: Floating Point I
Scientific Notation Translation
Convert from scientific notation to binary point
Perform the multiplication by shifting the decimal until the exponent disappears
Example: 1.011224 = 101102 = 2210
Example: 1.01122-2 = 0.010112 = 0.3437510
Convert from binary point to normalized scientific notation
Distribute out exponents until binary point is to the right of a single digit
Example: 1101.0012 = 1.1010012×23
Practice: Convert 11.37510 to binary scientific notation
11
CS295
L12: Floating Point I
Fraction like 1/3 needs an infinite number of digits to be represented, even in base 10!
Floating Point Topics
Fractional binary numbers
IEEE floating-point standard
Floating-point operations and rounding
Floating-point in C
There are many more details that we won’t cover
It’s a 58-page standard…
12
CS295
L12: Floating Point I
Start by talking about representing fractional numbers in binary in general
Then about the IEEE Standard for floating point representation, operations,
Will not cover:
All the details of the 58-page spec
Or expect you to perform operations
Want to show you how floating point values are represented
To give you a sense of the kinds of things that can go wrong with them
IEEE Floating Point
IEEE 754
Established in 1985 as uniform standard for floating point arithmetic
Main idea: make numerically sensitive programs portable
Specifies two things: representation and result of floating operations
Now supported by all major CPUs
Driven by numerical concerns
Scientists/numerical analysts want them to be as real as possible
Engineers want them to be easy to implement and fast
In the end:
Scientists mostly won out
Nice standards for rounding, overflow, underflow, but…
Hard to make fast in hardware
Float operations can be an order of magnitude slower than integer ops
13
CS295
L12: Floating Point I
Back in the 80s, it was like the wild west out there
Institute of Electrical and Electronics Engineers
Committee had to balance concerns
Scientists wanted them to be as real as possible
Engineers wanted it to be as easy to implement in HW as possible
Floating point can be order of magnitude slower than integer
When we get super desperate, turn to fixed point
But computers are pretty fast, right?
Floating Point Encoding
Use normalized, base 2 scientific notation:
Value: ±1 × Mantissa × 2Exponent
Bit Fields: (-1)S × 1.M × 2(E–bias)
Representation Scheme:
Sign bit (0 is positive, 1 is negative)
Mantissa (a.k.a. significand) is the fractional part of the number in normalized form and encoded in bit vector M
Exponent weights the value by a (possibly negative) power of 2 and encoded in the bit vector E
14
S
E
M
31 30
23 22
0
1 bit
8 bits
23 bits
CS295
L12: Floating Point I
Why use biased notation for the exponent?
–Remember that we want floating point numbers
to look small when their actual value is small
We don’t like how in 2’s complement, -1 looks bigger than 0. Bias notation preserves the linearity of value
The Exponent Field
S Exponent Significand
31 30 23 22 0
1 bit 8 bits 23 bits
15
CS295
L12: Floating Point I
The Exponent Field
Use biased notation
Read exponent as unsigned, but with bias of 2w-1-1 = 127
Representable exponents roughly ½ positive and ½ negative
Exponent 0 (Exp = 0) is represented as E = 0b 0111 1111
Why biased?
Makes floating point arithmetic easier
Makes somewhat compatible with two’s complement
Practice: To encode in biased notation, add the bias then encode in unsigned:
Exp = 1 E = 0b
Exp = 127 E = 0b
Exp = -63 E = 0b
16
CS295
L12: Floating Point I
E = 1 128 0b 1000 0000
E = 127 254 0b 1111 1110
E = -63 64 0b 0100 0000
The Mantissa (Fraction) Field
Note the implicit 1 in front of the M bit vector
Example: 0b 0011 1111 1100 0000 0000 0000 0000 0000
is read as 1.12 = 1.510, not 0.12 = 0.510
Gives us an extra bit of precision
Mantissa “limits”
Low values near M = 0b0…0 are close to 2Exp
High values near M = 0b1…1 are close to 2Exp+1
17
(-1)S x (1 . M) x 2(E–bias)
S
E
M
31 30
23 22
0
1 bit
8 bits
23 bits
CS295
L12: Floating Point I
Peer Instruction Question
What is the correct value encoded by the following floating point number?
0b 0 10000000 11000000000000000000000
+ 0.75
+ 1.5
+ 2.75
+ 3.5
We’re lost…
18
CS295
L12: Floating Point I
Precision and Accuracy
Precision is a count of the number of bits in a computer word used to represent a value
Capacity for accuracy
Accuracy is a measure of the difference between the actual value of a number and its computer representation
High precision permits high accuracy but doesn’t guarantee it. It is possible to have high precision but low accuracy.
Example: float pi = 3.14;
pi will be represented using all 24 bits of the mantissa (highly precise), but is only an approximation (not accurate)
19
CS295
L12: Floating Point I
Need Greater Precision?
Double Precision (vs. Single Precision) in 64 bits
C variable declared as double
Exponent bias is now 210–1 = 1023
Advantages: greater precision (larger mantissa),
greater range (larger exponent)
Disadvantages: more bits used,
slower to manipulate
20
S
E (11)
M (20 of 52)
63 62
52 51
32
M (32 of 52)
31
0
CS295
L12: Floating Point I
21
Floating Point Numbers Summary
Exponent Significand Meaning
0 ? ?
0 ? ?
1-254 anything ± fl. pt
255 ? ?
255 ? ?
CS295
L12: Floating Point I
Representing Zero
0 00000000 00000000000000000000000
22
1 00000000 00000000000000000000000
0
sign exponent significand
31 30 23 22
sign exponent significand
But wait… what happened to zero?
–Using standard encoding 0x00000000 is 1.0×2-127≠0
All because of that dang implicit 1
–Special case: Exp and Significand all zeros = 0
–Two zeros! But at least 0x00000000 = 0 like integers
31 30 23 22 0
+0
-0
CS295
L12: Floating Point I
23
Floating Point Numbers Summary
Exponent Significand Meaning
0 0 ± 0
0 ? ?
1-254 anything ± fl. pt
255 ? ?
255 ? ?
CS295
L12: Floating Point I
Representing ± ∞
24
Division by zero
infinity is a number!
okay to do further comparison eg. x/0 > y
Representation
Max exponent = 255
all zero significand
0 11111111 00000000000000000000000
31 30 23 22
0
significand
1 11111111 00000000000000000000000
sign exponent
31 30
23 22
0
sign exponent
significand
+ ∞
– ∞
CS295
L12: Floating Point I
Floating Point Numbers Summary
25
Exponent Significand Meaning
0 0 ± 0
0 non-zero ?
1-254 anything ± fl. pt
255 0 ± ∞
255 non-zero ?
CS295
L12: Floating Point I
Representing NaN
26
0/0, sqrt(-4), ∞–∞ ?
Useful for debugging
Op(NaN, some number) = NaN
Representation
Max exponent = 255
non-zero significand
0 11111111 Non-zero
31 30 23 22
0
significand
1 11111111 Non-zero
sign exponent
31 30
23 22
0
sign exponent
significand
+ NaN
– NaN
CS295
L12: Floating Point I
Floating Point Numbers Summary
27
Exponent Significand Meaning
0 0 ± 0
0 non-zero ?
1-254 anything ± Norm fl. pt
255 0 ± ∞
255 non-zero NaN
CS295
L12: Floating Point I
Representing Very Small Numbers
But wait… what happened to zero?
Using standard encoding 0x00000000 =
Special case: E and M all zeros = 0
Two zeros! But at least 0x00000000 = 0 like integers
New numbers closest to 0:
a = 1.0…02×2-126 = 2-126
b = 1.0…012×2-126 = 2-126 + 2-149
Normalization and implicit 1 are to blame
Special case: E = 0, M ≠ 0 are denormalized numbers
28
0
+∞
-∞
Gaps!
a
b
CS295
L12: Floating Point I
Denorm Numbers
29
Short for “denormalized numbers”
–No leading 1
–Careful! Implicit exponent = -126 when Exp = 0x00 (intuitive reason: the “binary point” moves one
more bit to the left of the leading bit)
Now what do the gaps look like?
–Smallest denorm: ± 0.0…01two×2-126 = ± 2-149
–Largest denorm: ± 0.1…1two×2-126 = ± (2-126 – 2-149)
–Smallest norm: ± 1.0…0two×2-126 = ± 2-126
No uneven gap! Increments by 2-149
So much closer to 0
CS295
L12: Floating Point I
Floating Point Numbers Summary
30
Exponent Significand Meaning
0 0 ± 0
0 non-zero ± Denorm fl pt.
1-254 anything ± Norm fl. pt
255 0 ± ∞
255 non-zero NaN
CS295
L12: Floating Point I
Converting From Hex and Decimal
Convert 0x40600000 to decimal
1 bit for sign, 8 bits for exponent, 23 bits for significand, bias of -127
CS295
L12: Floating Point I
Step 1: Convert to Binary
0x40600000 = 0100 0000 0110 0000 0000 0000 0000 0000
CS295
L12: Floating Point I
Step 2: Split Bits Up
0100 0000 0110 0000 0000 0000 0000 0000
Sign Exponent Significand
0 100 0000 0 110 0000 0000 0000 0000 0000
1 bit 8 bits 23 bits
CS295
L12: Floating Point I
0100 0000 0110 0000 0000 0000 0000 0000
Sign Exponent Significand
0 100 0000 0 110 0000 0000 0000 0000 0000
1 bit 8 bits 23 bits Exponent is not 00000000, so normalized!
Step 3: Check If Norm/Denorm
CS295
L12: Floating Point I
Step 4: Evaluate
Plug into normalized formula
CS295
L12: Floating Point I
Step 4: Evaluate
Plug into normalized formula
Ignore trailing 0’s
0 10000000 11000000000000000000000
Sign = 0, Exp = 128, Bias = 127, 1.significand = 1.11
NOTE: In the context of this formula, Bias = 127
CS295
L12: Floating Point I
Step 4: Evaluate
Plug into normalized formula
0 10000000 11000000000000000000000
Sign = 0, Exp = 128, Bias = 127, 1.significand = 1.11
Ignore trailing 0’s
(-1)0 ∗ 2128 – 127 ∗ 1.112 = 2 ∗1.112
CS295
L12: Floating Point I
Step 4: Evaluate
Plug into normalized formula
0 10000000 11000000000000000000000
Sign = 0, Exp = 128, Bias = -127, 1.significand = 1.11
Ignore trailing 0’s
(-1)0 ∗ 2128 – 127 ∗ 1.112 = 21 ∗1.112
= 11.12
= 21 + 20 + 2-1
exponent is 1 = shifting decimal right by 1
= 3.5
CS295
L12: Floating Point I
Converting From Decimal to Binary
Convert -5.625 to binary
1 bit for sign, 8 bits for exponent, 23 bits for significand, bias of -127
CS295
L12: Floating Point I
Step 1: Convert Left Side of Decimal
-5.625
Ignore sign for now (just make sign bit a 1 at the end)
5 = 22 + 20
= 1012
CS295
L12: Floating Point I
Step 2: Convert Right Side of Decimal
.625 = .5 + .125
= 2-1 + 2-3
= .1012
CS295
L12: Floating Point I
Step 3: Combine Both Results and Normalize
5.625 = 5 + .625
= 1012 + .1012
= 101.1012
101.1012 = 1.011012 ∗ 22
Decimal moved 2 places to the left
CS295
L12: Floating Point I
Step 4: Convert to Binary
-1.011012 ∗ 22
sign exponent
1 10000001
(negative) (2 + 127 for bias)
significand
01101
(ignore implicit 1)
Combine to get: -5.625 = 11000000101101000000000000000000
CS295
L12: Floating Point I