CS2421 Autumn 2013
CSE 2421
IEEE 754 Floating Point
Required Reading: Computer Systems: A Programmer’s Perspective, 3rd Edition
Chapter 2, Sections 2.4 through 2.4.4
IEEE floating point
IEEE Standard 754 floating point is the most common representation today for real numbers on computers, including Intel-based PC’s, Macintoshes, and most Unix platforms
Floating point encodes rational numbers/values of the form
V = x × 2y
Useful for computations involving:
Very large numbers: |V| >> 0
Numbers very close to 0: |V| << 1
More generally, as an approximation to real numbers
Single precision and double precision:
Single precision: 32 bits
Double precision: 64 bits
Floating point
Limited range and precision (finite space):
Range (for given precision):
Single: +/- 2-149 to +/- 2127
Compare with int: -231 to 231 - 1
Double: +/- 2-1074 to +/- 21023
Compare with long (ints): -263 to 263 - 1
Overflow means that values have grown too large for the representation, much in the same way that you can have integer overflow.
Underflow (a value too small to be represented) is a less serious problem because it just denotes a loss of precision, which is guaranteed to be closely approximated by zero.
Floating Point
“Real numbers” having a decimal portion that is not necessarily equal to 0
Example: 123.14 base 10
1*102 + 2*101 + 3*100 + 1*10-1 + 4*10-2
Digit format: dmdm-1…d1d0 . d-1d-2…d-n
dnum summation_of(i = -n to m) di * 10i
Example: 110.11 base 2
1*22 + 1*21 + 0*20 + 1*2-1 + 1*2-2
Digit format: bmbm-1…b1b0 . b-1b-2…b-n
bnum summation_of(i = -n to m) bi * 2i
In both cases, digits on the left of the “point” are weighted by a positive power and those on the right are weighted by a negative power
Reminder raising to the negative power is 1/base_raised_to_the_positive_power
4
Floating Point
Shifting* the binary point one position left:
Divides the number by 2
Compare 101.11 (5¾) base 2 with 10.111(2 7/8) base 2
Shifting the binary point one position right:
Multiplies the number by 2
Compare 101.11(5¾) base 2 with 1011.1 (11½) base 2
Numbers such as 0.111…11 base 2 represent numbers just below 1 0.111111 base 2 = 63/64 (decimal 0.984375), and 0.11111111 = 255/256 (decimal 0.99609375)
*NOTE: the binary point shifting discussed above is completely different than shifting bits
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
101.11 = 5 and 3/4 … 10.111 = 2 and 7/8
23/4 … 23/8
46/8 … 23/8
101.11 = 5 and 3/4 … 1011.1 = 11 and 1/2
23/4… 23/2
23/4… 46/4
5
Some limitations
Only finite-length encodings can be stored
1/3 and 5/7 cannot be represented exactly (generally, only rational numbers with denominators that are powers of 2, i.e., numbers such as 1/2 or 3/16, have a terminating, or finite-length binary encoding)
Fractional binary notation can only represent numbers that can be written x * 2y (for integers x and y)
Example: 63/64 = 63*2-6 ; 255/256 = 255 *2-8
Otherwise, value can only be approximated
Get increasing accuracy by lengthening the binary representation but still have finite space, so precision will always be limited
Scientific Notation
When we represent a number in scientific notation
Typically only 1 digit to the left of the decimal point
+6.385 * 108
-9.66712 * 1017
Four fields
Sign
Left side of decimal point
Right side of decimal point
exponent
Hold these thoughts as we move on to IEEE 754
IEEE 754 encoding
The bits in the encoding of the real number value are divided into three fields (details about the sizes of these fields on the next slide):
S field (or sign field): encodes the sign of the number;
E field (or exponent field): encodes the exponent of the real number (positive, 0 or negative), in biased form;
F field (or significand/mantissa field): which encodes a number greater than or equal to 0, and less than 2, to which the exponent is raised to obtain the real value encoded by the entire IEEE 754 representation (this is x in a representation of the value of the number using |V| = x × 2y)
Biased Integer Representation
An integer representation that skews the bit pattern so as to look just like unsigned but actually represent negative numbers.
examples: given 4 bits, we BIAS values by 23 (8)
TRUE VALUE to be represented 3
add in the bias +8
----
unsigned value 11
so the bit pattern of 3 in 4-bit biased-8 representation will be 1011
Biased Integer Representation
Going the other way, suppose we were given a biased-8 representation of 0110
unsigned 0110 represents 6
subtract out the bias - 8
----
TRUE VALUE represented -2
this representation allows operations on the biased numbers to be the same as for unsigned integers, but actually represents both positive and negative values.
choosing a bias:
the bias chosen is most often based on the number of bits available for representing an integer. To get an approx. equal distribution of true values above and below 0, the bias should be 2(n-1) or 2(n-1) - 1
Floating point
The real number decimal value represented is: (-1)s x f x 2e
-------------------------------------------------------
| S | E | F |
-------------------------------------------------------
S is one bit representing the sign of the number (it directly encodes s)
0 for positive, 1 for negative
E is an 8-bit biased integer (NOTE: not B2T!) representing the exponent e
e = E – bias E = e + bias
F is an unsigned integer (ε depends on the E field: if E is 0, ε is zero; otherwise,
ε is 1)
f = ( F/(2n) ) + ε F = (f - ε)*2n
For 32 bit, n = 23 and bias = 127
For 64 bit, n = 52 and bias = 1023
Sign Exponent Fraction Bias
Single Precision (4 bytes) 1 [31] 8 [30-23] 23 [22-00] 127
Double Precision (8 bytes) 1 [63] 11 [62-52] 52 [51-00] 1023
Representing Floating Point Numbers
Think of IEEE 754 as similar to scientific notation.
IEEE 754 standard is used so that there is consistency across computers
This is not the only way to represent floating point numbers, it is just the IEEE standard way of doing it.
https://ryanstutorials.net/binary-tutorial/binary-floating-point.php
Floating point example
Represent the decimal number 64.2 in IEEE standard single precision floating point format.
5 step process to obtain Sign(S), Exponent(E), and Mantissa(F)
Floating point example
Step 1:
Get a binary representation for 64.2, with at least the number of bits for the F field + 1 (24 for single precision, or 53 for double precision). To do this, get unsigned binary representations for the stuff to the left and right of the decimal point separately.
64 is 100 0000
.2 can be computed using the algorithm:
Whole number part
.2 x 2 = 0.4 0
.4 x 2 = 0.8 0
.8 x 2 = 1.6 1
.6 x 2 = 1.2 1
.2 x 2 = 0.4 0 now this whole pattern (0011) repeats.*
.4 x 2 = 0.8 0
.8 x 2 = 1.6 1
.6 x 2 = 1.2 1
Write the whole number parts from top to bottom, right to left:
So, a binary representation for .2 is .0011 0011 0011…
Putting the halves back together again, and writing 24 bits:
64.2 is 1000000.00110011001100110
*NOTE: The pattern doesn’t always repeat. Consider .25 = .0100 0000 0000 000
Floating point example
Step 2:
Normalize the binary representation. (i.e. Make it look like scientific notation) with a single 1 to the left of the binary point (for now – more on this later):
100 0000.0011 0011 0011 0011 0
normalizes to
1.0000 0000 1100 1100 1100 110 x 26
NOTE: the bit to the left of the binary point will be considered the “hidden bit”(it does not appear in the encoding, but is implied by the value of the exponent)
The “Hidden Bit”
There are 23 bits for the mantissa in a single precision IEEE 754 representation.
It turns out that, if floating point numbers are always stored in a normalized form, then the leading bit (the one on the left of the binary point) is always a 1. So, why store it at all?
We can put it back into the number (giving 24 bits of precision for the mantissa) for any calculation, but we only have to store 23 bits.
Floating point example
Step 3:
6 is the true exponent (e). For the standard form, to get E, it needs to be in 8-bit, biased-127 representation (add 127 to the true exponent to get the biased-127 representation).
6
+ 127
-----
133
Convert this decimal value to binary:
133 in 8-bit, unsigned representation is 1000 0101
This is the bit pattern used for E in the standard form.
Floating point example
Step 4:
The mantissa/significand stored (F) is the values to the right of the binary point in the normalized form (1.0000 0000 1100 1100 1100 110 x26)
We need 23 bits of it for single precision.
0000 0000 1100 1100 1100 110
Floating point example
Step 5: Put it all together (and include the correct sign bit: the original number is positive, so the sign bit is 0):
S E F
0 1000 0101 0000 0000 1100 1100 1100 110
0b 0100 0010 1000 0000 0110 0110 0110 0110
0x 4 2 8 0 6 6 6 6
Putting it all together
So, to sum up:
The sign bit is 0 for positive, 1 for negative.
The exponent's base is two.
The exponent field (E) contains 127 (the bias) plus the true exponent for single-precision, or 1023 plus the true exponent for double precision.
The first bit of the mantissa/significand is implied to be 1 for biased exponents in the range 1 – 254 (but assumed to be 0 for exponent of 0); i.e., the mantissa/significand is interpreted as 1.F, where F is the field of fraction bits.
Ask students to do example shown on next page for next class
20
Floating point example
Given binary 1100 0100 1000 0000 0010 1100 0000 0000
S = 1
E = 1000 1001 = 137 = 127+10 e = 10
F = 000 0000 0010 1100 0000 0000
f = 1.000 0000 0010 1100 0000 0000 x 210
Move binary point right 10 positions
1000 0000 001.0 1100 0000 0000
Convert to decimal (don’t forget to check S!)
= -1025.375
Reappearance of Hidden Bit
IEEE 754 types of real number values
IEEE uses three different encodings to represent real values:
1. Denormalized values: These are extremely small values, very close to 0, including 0. If such a value becomes too small, it can cause underflow (the largest denormalized value is approximately +/-1.18×10−38).
2. Normalized values: These are values greater than the largest denormalized value, but small enough to fit into the representation (that is, without overflow).
3. Special values: These are values which are either +/- infinity (created by overflow) or NaN (not a number – values which are not real numbers, possibly caused by invalid operations, such as division by 0).
We will primarily look at conversion for normalized values, but also briefly discuss the others.
S, E, and F fields for different types of values
23
Type of value S E F “hidden” bit
0.0 0 or 1 all 0’s all 0’s 0
denormalized 0 or 1 all 0’s not all 0’s 0
normalized 0 or 1 >0, but not all 1’s any bit pattern 1
+/- infinity 0 or 1 all 1’s all 0’s
NaN* 0 or 1 all 1’s anything but all zeros
*Not a Number
The largest negative value cannot be represented with two zero values of +0 and -0 (like B2O and B2S)
23
Denormalized Values
Were created for one primary purpose: gradual underflow. It’s a way to keep the relative difference between tiny numbers small. If you go straight from the smallest normal number to zero (abrupt underflow), the relative change is infinite. If you go to denormals on underflow, the relative change is still not fully accurate, but at least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly. There are always the same amount of numbers between successive powers of two: 252 (for double precision). So without denormals, you always end up with a gap between 0 and the smallest floating-point number that is 252 times the size of the difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the mathematically equivalent x == y and x – y == 0. If x and y are tiny but different and you use abrupt underflow, then if their difference is less than the minimum cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal numbers gets to be a denormal, which is still not zero. The equivalence is preserved.
So, using denormals on purpose is not advised, because they were designed only as a backup mechanism in exceptional cases.
Reference:http://stackoverflow.com/questions/15140847/denormalized-numbers-ieee-754-floating-point
Floating point example
Represent the decimal number 76.875 in IEEE standard single precision floating point format.
This value will be a normalized value, because it is clearly larger than the largest denormalized value (it is not a value less than 1, but very close to 0), and it is neither infinity nor NaN
5 step process to obtain Sign(S), Exponent(E), and Mantissa(F)
25
Floating point example
Step 1:
Get a binary representation for 76.875, with at least the number of bits for the F field + 1 (24 for single precision, or 53 for double precision). To do this, get unsigned binary representations for the stuff to the left and right of the decimal point separately.
76 is 100 1100 .875 can be computed using the algorithm:
Whole number part
2 76 0 .875 x 2 = 1.75 1
2 38 0 .75 x 2 = 1.5 1
2 19 1 .5 x 2 = 1.0 1
2 9 1 .0 x 2 = 0.0 0
2 4 0
2 2 0
2 1 1
Write the whole number parts for 76 from bottom to top, right to left: 100 1100
Write the fractional part for .875 from top to bottom, right to left: .11000000…
Putting the halves back together again, and writing 24 bits:
76.875 is 100 1100.1110 0000 0000 0000
Floating point example
Step 2:
Normalize the binary representation. (i.e. Make it look like scientific notation) with a single 1 to the left of the binary point
100 1100.1110 0000 0000 0000
normalizes to
1.001100111000000000000 x 26
NOTE: the bit to the left of the binary point will be considered the “hidden bit”(it does not appear in the encoding, but is implied by the value of the exponent)
Floating point example
Step 3:
6 is the true exponent (e). For the standard form, to get E, it needs to be in 8-bit, biased-127 representation (add 127 to the true exponent to get the biased-127 representation).
6
+ 127
—–
133
Convert this decimal value to binary:
133 in 8-bit, unsigned representation is 1000 0101
This is the bit pattern used for E in the standard form.
Floating point example
Step 4:
The mantissa/significand stored (F) is the value to the right of the binary point in the normalized form (1.0011 0011 1000 0000 0000 000 x 26)
We need 23 bits of it for single precision.
0011 0011 1000 0000 0000 000
Floating point example
Step 5: Put it all together (and include the correct sign bit: the original number is positive, so the sign bit is 0):
S E F
0 1000 0101 0011 0011 1000 0000 0000 000
0b 0100 0010 1001 1001 1100 0000 0000 0000
0x 4 2 9 9 C 0 0 0
Floating point example
Given binary 1100 0110 1000 0110 0010 1100 0110 0000
S = 1
E = 1000 1101 = 141 = 127+14 e = 14
F = 000 0110 0010 1100 0110 0000
f = 1.000 0110 0010 1100 0110 0000 x 214
Move binary point right 14 positions
1000 0110 0010 110.0 0110 0000
Convert to decimal (don’t forget to check S!)
= – 17,174.1875
Reappearance of Hidden Bit
What are the trade-offs in floating point?
Can’t use integers for everything
Trying to cover a much broader range of real values; but something has to give, and it’s the precision
Pi a good example:
Whether or not a rational number has a terminating expansion depends on the base.
For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333…).
In base-2 only rational numbers with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational number with a denominator that has a prime factor other than 2 will have an infinite binary expansion.
Range and Precision for Normalized Numbers
Next, we look somewhat more at the range and precision of IEEE single precision values
To begin, let’s see the range of values which can be encoded by the F field of a single precision number
Range of F for Normalized Numbers
Minimum value which can be encoded:
1.0000 0000 0000 0000 0000 000
Clearly, this = 1.0
Range of F for Normalized Numbers (cont)
Maximum value which can be encoded:
1.1111 1111 1111 1111 1111 111
= 1 + ( (223 – 1) / (223) )
= 1 + (((1024 X 1024 X 8) – 1) / (1024 X 1024 X 8))
= 1 + 0.99999988
= 1.99999988
IEEE 754 Single Precision Range
Using these values, we can see the range of values which can be represented in IEEE single precision for various exponents
Please see the following pdf file on Carmen:
IEEE_754_Single_Precision_Range_Table.pdf
Floating Point Precision
Question: It is usually said that floating point values are an approximation to real numbers, but does the floating point representation always allow for a fractional part of the value? (In other words, are there values which can only be encoded in IEEE 754 as integers?)
Intuitively, since we have a limited number of bits for F, we should guess that the answer is that a fractional portion of a value cannot always be encoded.
Let’s look at single precision.
37
Reminder raising to the negative power is 1/base_raised_to_the_positive_power
37
Floating Point Precision
Question: Can we determine the minimum value in IEEE single precision that does not allow for the representation of any fractional portion of the value? In other words, the minimum value which can only represent an integer value?
At what point are all the bits in the F field required to encode the integer portion of the value, leaving no bits for any fractional portion?
38
Reminder raising to the negative power is 1/base_raised_to_the_positive_power
38
Floating Point Precision
Here is the encoding (assume positive):
0 1001 0110 0000 0000 0000 0000 0000 000
E => 127 + e
For FP, the value is 1.0000 0000 0000 0000 0000 000 x 223
0000 0000 1000 0000 0000 0000 0000 0000 (integer in 32 bits)
39
Reminder raising to the negative power is 1/base_raised_to_the_positive_power
39
Floating Point Precision
We have seen that in some cases, the encoding allows for values which are no more precise than integers (i.e., only integer values can be encoded).
Question: Are there cases where IEEE 754 encoded values are even less precise than integers?
40
Reminder raising to the negative power is 1/base_raised_to_the_positive_power
40
Floating Point Precision
Consider:
1.00000000000000000000000 * 250 =
and
1.00000000000000000000001 * 250 =
We have incremented our value by the minimum amount that we can, right?
Floating Point Precision
Consider:
1.00000000000000000000000 * 250 =
1,125,899,906,842,624.0 decimal
and
1.00000000000000000000001 * 250 =
1,125,900,041,060,352.0 decimal
That’s a difference of 134,217,728!!
Exercise
In class exercise 7:
IEEE 754 Problem
Encode the number 259.375 in IEEE single precision (32 bits)
Solution:
Whole number part:
259 = 256 + 2 + 1 = 1 0000 0011
Fractional part: 0.375
Fractional number part
2 * 0.375 = 0.75 0
2 * 0.75 = 1.50 1
2 * 0.50 = 1.00 1
2 * 0.00 = 0.0 0
So, (0.375)10 = 0.011000…. binary
Put the two parts together, in 24 bits:
1 0000 0011.0110 0000 0000 000
Solution (continued)
Normalize:
1 0000 0011.0110 0000 0000 000 x 20 =
1.0000 0011 0110 0000 0000 000 x 28
Everything after the binary point is the F field in the IEEE 754 encoding
The true exponent = 8
Solution (continued)
Bias the exponent:
8 + 127 (bias for single precision) = 135
135 in 8 bits: 128 + 4 + 2 + 1
1000 0111
S is 0, because the value is positive, so the encoding is:
S E F
0 1000 0111 0000 0011 0110 0000 0000 000
0b 0100 0011 1000 0001 1011 0000 0000 0000
0x 4 3 8 1 B 0 0 0
Exercise
In class exercise:
IEEE 754 Problem
Given binary 1100 0100 1000 0110 0010 1100 0000 0000
Convert it to decimal
Solution:
S = 1
E = 1000 1001 = 137 = 127+10 e = 10
F = 000 0110 0010 1100 0000 0000
f = 1.000 0110 0010 1100 0000 0000 x 210
Move binary point right 10 positions
100 0011 0001.0110 0000 0000 0
Convert to decimal (don’t forget to check S!)
= -(210 + 25 + 24 + 20 + 2-2 + 2-3) = -(1024+32+16+1+0.25+0.125)
= -1073.375
In-Class Exercise
1) Encode the number 159.475 in IEEE single precision (32 bits)
2) Given binary IEEE single precision FP number:
1100 0100 1110 0110 0010 1100 0000 0000
Convert it to decimal