Microsoft PowerPoint – 18_IEEE_754_Floating_Point
O
SU
C
SE
2
42
1
J.E.Jones
Required Reading: Computer Systems: A Programmer’s Perspective, 3rd Edition
• Chapter 2, Sections 2.4 through 2.4.4
• Pearson Tutorial Assignment
O
SU
C
SE
2
42
1
J. E. Jones
IEEE Standard 754 floating point is the most common
representation today for real numbers on computers, including
Intel-based PC’s, Macintoshes, and most Unix platforms
Floating point encodes rational numbers/values of the form
V = x × 2y
Useful for computations involving:
◦ Very large numbers: |V| >> 0
◦ Numbers very close to 0: |V| << 1
◦ More generally, as an approximation to real numbers
Single precision and double precision:
◦ Single precision: 32 bits
◦ Double precision: 64 bits
O
SU
C
SE
2
42
1
J. E. Jones
Limited range and precision (finite space):
Range (for given precision):
◦ Single: +/- 2-149 to +/- 2127
Compare with int: -231 to 231 - 1
◦ Double: +/- 2-1074 to +/- 21023
Compare with long (ints): -263 to 263 - 1
Overflow means that values have grown too large for the
representation, much in the same way that you can have
integer overflow.
Underflow (a value too small to be represented) is a less
serious problem because it just denotes a loss of precision,
which is guaranteed to be closely approximated by zero.
We’ve still only got 2w bit patterns
To work with! What gives?!?!
O
SU
C
SE
2
42
1
J. E. Jones
Limited range and precision (finite space):
Range (for given precision):
◦ Single: +/- 2-149 to +/- 2127
Compare with int: -231 to 231 - 1
◦ Double: +/- 2-1074 to +/- 21023
Compare with long (ints): -263 to 263 - 1
Overflow means that values have grown too large for the
representation, much in the same way that you can have
integer overflow.
Underflow (a value too small to be represented) is a less
serious problem because it just denotes a loss of precision,
which is guaranteed to be closely approximated by zero.
We’ve still only got 2w bit patterns
To work with! What gives?!?!
PRECISION!
O
SU
C
SE
2
42
1
J. E. Jones
“Real numbers” having a decimal portion that is not
necessarily equal to 0
Example: 123.14 base 10
◦ 1*102 + 2*101 + 3*100 + 1*10-1 + 4*10-2
◦ Digit format: dmdm-1…d1d0 . d-1d-2…d-n
◦ dnum summation_of(i = -n to m) di * 10i
Example: 110.11 base 2
◦ 1*22 + 1*21 + 0*20 + 1*2-1 + 1*2-2
◦ Digit format: bmbm-1…b1b0 . b-1b-2…b-n
◦ bnum summation_of(i = -n to m) bi * 2i
In both cases, digits on the left of the “point” are weighted by
a positive power and those on the right are weighted by a
negative power
O
SU
C
SE
2
42
1
J. E. Jones
Shifting* the binary point one position left:
◦ Divides the number by 2
◦ Compare 101.11 (5¾) base 2 with 10.111(2 7/8) base 2
Shifting* the binary point one position right:
◦ Multiplies the number by 2
◦ Compare 101.11(5¾) base 2 with 1011.1 (11½) base 2
Numbers such as 0.111…11 base 2 represent numbers just
below 1 0.111111 base 2 = 63/64 (decimal 0.984375), and
0.11111111 = 255/256 (decimal 0.99609375)
*NOTE: the binary point shifting discussed above is completely different than bitwise
shifting bits discussed previously
O
SU
C
SE
2
42
1
J. E. Jones
Only finite-length encodings can be stored
◦ 1/3 and 5/7 cannot be represented exactly (generally, only
rational numbers with denominators that are powers of 2,
i.e., numbers such as 1/2 or 3/16, have a terminating, or
finite-length binary encoding)
Fractional binary notation can only represent numbers that can
be written x * 2y (for integers x and y)
◦ Example: 63/64 = 63*2-6 ; 255/256 = 255 *2-8
◦ Otherwise, value can only be approximated
◦ Get increasing accuracy by lengthening the binary
representation but still have finite space, so precision will
always be limited
O
SU
C
SE
2
42
1
J. E. Jones
When we represent a number in scientific notation
◦ Typically, only 1 digit to the left of the decimal point
6.385 * 104 == 63,850
-9.66712 * 107 == -96,671,200
1.23456 * 103 == 1,234.56
2.75 * 10-2 == 0.0275
◦ Four fields
Sign
Left side of decimal point
Right side of decimal point
exponent
Hold these thoughts as we move on to IEEE 754
O
SU
C
SE
2
42
1
J. E. Jones
An integer representation that skews the bit pattern so as to
look just like unsigned but actually represent negative numbers.
examples: given 4 bits, we BIAS values by 23 (8)
TRUE VALUE to be represented 3 -3
add in the bias +8 8
----
unsigned value 11 5
so, the bit pattern of 3 (or -3) in 4-bit biased-8 representation will
be 1011 (or 0101)
Think of it as sliding the number line from
between -2w-1 and 2w-1-1 to between 0 and 2w-1
O
SU
C
SE
2
42
1
J. E. Jones
Going the other way, suppose we were given a biased-8 representation of
0110
unsigned 0110 represents 6
subtract out the bias - 8
----
TRUE VALUE represented -2
this representation allows operations on the biased numbers to be the same
as for unsigned integers, but actually represents both positive and negative
values.
choosing a bias:
the bias chosen is most often based on the number of bits available for
representing an integer. To get an approx. equal distribution of true values
above and below 0, the bias should be 2(n-1) or 2(n-1) - 1
O
SU
C
SE
2
42
1
J. E. Jones
The bits in the encoding of the real number value are divided
into three fields (details about the sizes of these fields on the
next slide):
◦ S field (or sign field): encodes the sign of the number;
◦ E field (or exponent field): encodes the exponent of the real
number (positive, 0 or negative), in biased form;
◦ F field (or significand/mantissa field): which encodes a
number greater than or equal to 0, and less than 2, to which
the exponent is raised to obtain the real value encoded by the
entire IEEE 754 representation (this is x in a representation
of the value of the number using |V| = x × 2y)
O
SU
C
SE
2
42
1
J. E. Jones
The real number decimal value represented is: (-1)s * (f )* (2e)
-------------------------------------------------------
| S | E | F |
-------------------------------------------------------
S is one bit representing the sign of the number (it directly encodes s)
◦ 0 for positive, 1 for negative
E is an 8-bit biased integer (NOTE: not B2T!) representing the exponent e
◦ e = E – bias E = e + bias
F is an unsigned integer, ε depends on the E field:
◦ if E is 0, ε is zero; otherwise, ε is 1
◦ f = ( F/(2n) ) + ε F = (f - ε)*2n
For 32-bit, n = 23 and bias = 127
For 64-bit, n = 52 and bias = 1023
Sign Exponent Fraction Bias
Single Precision (4 bytes) 1 [31] 8 [30-23] 23 [22-00] 127
Double Precision (8 bytes) 1 [63] 11 [62-52] 52 [51-00] 1023
O
SU
C
SE
2
42
1
J. E. Jones
The real number decimal value represented is: (-1)s * (f )* (2e)
-------------------------------------------------------
| S | E | F |
-------------------------------------------------------
S is one bit representing the sign of the number (it directly encodes s)
◦ 0 for positive, 1 for negative
E is an 8-bit biased integer (NOTE: not B2T!) representing the exponent e
◦ e = E – bias E = e + bias
F is an unsigned integer, ε depends on the E field:
◦ if E is 0, ε is zero; otherwise, ε is 1
◦ f = ( F/(2n) ) + ε F = (f - ε)*2n
For 32-bit, n = 23 and bias = 127
For 64-bit, n = 52 and bias = 1023
Sign Exponent Fraction Bias
Single Precision (4 bytes) 1 [31] 8 [30-23] 23 [22-00] 127
Double Precision (8 bytes) 1 [63] 11 [62-52] 52 [51-00] 1023
O
SU
C
SE
2
42
1
J. E. Jones
Think of IEEE 754 as similar to scientific notation.
IEEE 754 standard is used so that there is consistency
across computers
This is not the only way to represent floating point
numbers, it is just the IEEE standard way of doing it.
O
SU
C
SE
2
42
1
J. E. Jones
Represent the decimal number 64.2 in IEEE standard
single precision floating point format.
5 step process to obtain Sign(S), Exponent(E), and
Mantissa(F)
O
SU
C
SE
2
42
1
J. E. Jones
Step 1:
Get a binary representation for 64.2, with at least the number of bits for the F field + 1 (24 for single precision, or 53 for
double precision). To do this, determine unsigned binary representations for the value to the left and right of the
binary point separately.
64 is 0100 0000
.2 can be computed using the algorithm:
Whole number part
.2 x 2 = 0.4 0
.4 x 2 = 0.8 0
.8 x 2 = 1.6 1
.6 x 2 = 1.2 1
.2 x 2 = 0.4 0 now this whole pattern (0011) repeats.*
.4 x 2 = 0.8 0
.8 x 2 = 1.6 1
.6 x 2 = 1.2 1
Write the whole number parts from top to bottom, right to left. So, a binary representation
for 0.2 is .0011 0011 0011…
Putting the halves back together again, and writing 24 bits: 64.2 is 1000000.00110011001100110
*NOTE: The pattern doesn’t always repeat. Consider .25 = .0100 0000 0000 000
O
SU
C
SE
2
42
1
J. E. Jones
Step 2:
Normalize the binary representation. (i.e. Make it look like
scientific notation) with a single 1 to the left of the binary point:
100 0000.0011 0011 0011 0011 0
normalizes to
1.0000 0000 1100 1100 1100 110 x 26
NOTE: the bit to the left of the binary point will be
considered a “hidden bit”(it does not appear in the encoding,
but is implied by the value of the exponent)
O
SU
C
SE
2
42
1
J. E. Jones
Step 3:
6 is the true exponent (e). For the standard form, to get E, it needs to be in 8-
bit, biased-127 representation (add 127 to the true exponent to get the biased-
127 representation).
6 = e
+ 127
-----
133 = E
Convert this decimal value to binary:
133 in 8-bit, unsigned representation is 1000 0101 = E
This is the bit pattern used for E in the standard form.
O
SU
C
SE
2
42
1
J. E. Jones
There are 23 bits for the mantissa in a single precision
IEEE 754 representation.
It turns out that, if floating point numbers are always
stored in a normalized form, then the leading bit (the
one on the left of the binary point) is always a 1. So,
why store it at all?
We can put it back into the number (giving 24 bits of
precision for the mantissa) for any calculation, but we
only store 23 bits.
O
SU
C
SE
2
42
1
J. E. Jones
Step 4:
The mantissa/significand stored (F) is the values to the right of
the binary point in the normalized form
(1.000 0000 0110 0110 0110 0110 x26)
We need 23 bits of it for single precision.
So, F = 000 0000 0110 0110 0110 0110
The “Hidden Bit”
O
SU
C
SE
2
42
1
J. E. Jones
Step 5: Put it all together (and include the correct sign bit: the
original number is positive, so the sign bit is 0):
S E F
0 1000 0101 000 0000 0110 0110 0110 0110
0 1000 0111 000 0000 0110 0110 0110 0110
0b 0100 0010 1000 0000 0110 0110 0110 0110
0x 4 2 8 0 6 6 6 6 hex representation
O
SU
C
SE
2
42
1
J. E. Jones
So, to sum up:
◦ The sign bit is 0 for positive, 1 for negative.
◦ The exponent's base is two.
◦ The exponent field (E) contains 127 (the bias) plus
the true exponent for single-precision, or 1023 plus
the true exponent for double precision.
◦ The first bit of the mantissa/significand is implied
to be 1 for biased exponents in the range 1 – 254
(but assumed to be 0 for exponent of 0); i.e., the
mantissa/significand is interpreted as 1.F, where F
is the field of fraction bits.
O
SU
C
SE
2
42
1
J. E. Jones
• Given binary 1100 0100 1000 0000 0010 1100 0000 0000
S = 1
E = 1000 1001 = 137 = 127+10 e = 10
F = 000 0000 0010 1100 0000 0000
f = 1.000 0000 0010 1100 0000 0000 x 210
• Move binary point right 10 positions
1000 0000 001.0 1100 0000 0000
• Convert to decimal (don’t forget to check S!)
= 210+20+2-2+2-3 = 1024 + 1 + 0.25 + 0.125
= -1025.375
Reappearance of Hidden Bit
O
SU
C
SE
2
42
1
J. E. Jones
IEEE uses three different encodings to represent real values:
◦ 1. Denormalized values: These are extremely small values, very close to
0, including 0. If such a value becomes too small, it can cause underflow
(the largest denormalized value is approximately +/-1.18×10−38).
◦ 2. Normalized values: These are values greater than the largest
denormalized value, but small enough to fit into the representation (that
is, without overflow).
◦ 3. Special values: These are values which are either +/- infinity (created
by overflow) or NaN (not a number – values which are not real numbers,
possibly caused by invalid operations, such as division by 0).
We will primarily look at conversion for normalized values,
but also briefly discuss the others.
O
SU
C
SE
2
42
1
J. E. Jones
Type of value S E F “hidden” bit
0 0 or 1 all 0’s all 0’s 0
denormalized 0 or 1 all 0’s not all 0’s 0
normalized 0 or 1
>0,
but not all 1’s any bit pattern 1
+/‐ infinity 0 or 1 all 1’s all 0’s
NaN* 0 or 1 all 1’s anything but all zeros
*Not a Number
O
SU
C
SE
2
42
1
J. E. Jones
Were created for one primary purpose: gradual underflow. It’s a way to keep the
relative difference between tiny numbers small. If you go straight from the smallest
normal number to zero (abrupt underflow), the relative change is infinite. If you go
to denormals on underflow, the relative change is still not fully accurate, but at
least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly.
There are always the same amount of numbers between successive powers of two:
252 (for double precision). So without denormals, you always end up with a gap
between 0 and the smallest floating-point number that is 252 times the size of the
difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the
mathematically equivalent x == y and x – y == 0. If x and y are tiny but different
and you use abrupt underflow, then if their difference is less than the minimum
cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal
numbers gets to be a denormal, which is still not zero. The equivalence is
preserved.
So, using denormals on purpose is not advised, because they were designed
only as a backup mechanism in exceptional cases.
Reference: http://stackoverflow.com/questions/15140847/denormalized-numbers-ieee-754-floating-point
O
SU
C
SE
2
42
1
J. E. Jones
O
SU
C
SE
2
42
1
J. E. Jones
Represent the decimal number 76.875 in IEEE
standard single precision floating point format.
This value will be a normalized value, because it is
clearly larger than the largest denormalized value (it
is not a value less than 1, but very close to 0), and it
is neither infinity nor NaN
5 step process to obtain Sign(S), Exponent(E), and
Mantissa(F)
O
SU
C
SE
2
42
1
J. E. Jones
Step 1:
Get a binary representation for 76.875, with at least the number of bits for the F field + 1 (24 for
single precision, or 53 for double precision). To do this, get unsigned binary representations for
the stuff to the left and right of the decimal point separately.
76 is 100 1100 .875 can be computed using the
algorithm:
Whole number part
2 76 0 .875 x 2 = 1.75 1
2 38 0 .75 x 2 = 1.5 1
2 19 1 .5 x 2 = 1.0 1
2 9 1 .0 x 2 = 0.0 0
2 4 0
2 2 0
2 1 1
Write the whole number parts for 76 from bottom to top, right to left: 100 1100
Write the fractional part for .875 from top to bottom, right to left: .111000000…
Putting the halves back together again, and writing 24 bits:
76.875 is 100 1100.1110 0000 0000 0000 0
NOTE: This fraction doesn’t repeat!!! Fill in
remaining 24 bits with trailing 0s!
O
SU
C
SE
2
42
1
J. E. Jones
Step 2:
Normalize the binary representation. (i.e. Make it look like scientific notation)
with a single 1 to the left of the binary point
100 1100.1110 0000 0000 0000 0
normalizes to
1.001 1001 1100 0000 0000 0000 x 26
NOTE: the bit to the left of the binary point will be considered the “hidden
bit”(it does not appear in the encoding, but is implied by the value of the
exponent)
O
SU
C
SE
2
42
1
J. E. Jones
Step 3:
6 is the true exponent (e). For the standard form, to get E, it needs to be in 8-bit,
biased-127 representation (add 127 to the true exponent to get the biased-127
representation).
6
+ 127
—–
133
Convert this decimal value to binary:
133 in 8-bit, unsigned representation is 1000 0101
This is the bit pattern used for E in the standard form.
O
SU
C
SE
2
42
1
J. E. Jones
Step 4:
The mantissa/significand stored (F) is the value to the right of the binary point
in the normalized form (1.0011 0011 1000 0000 0000 000 x 26)
We need 23 bits of it for single precision. Fill in with trailing 0s
if need be.
0011 0011 1000 0000 0000 000
O
SU
C
SE
2
42
1
J. E. Jones
Step 5: Put it all together (and include the correct sign bit: the
original number is positive, so the sign bit is 0):
S E F
0 1000 0101 0011 0011 1000 0000 0000 000
0b 0100 0010 1001 1001 1100 0000 0000 0000
0x 4 2 9 9 C 0 0 0
O
SU
C
SE
2
42
1
J. E. Jones
Given binary 1 10001101 00001100010110001100000
S = 1
E = 1000 1101 = 141 = 127+14 e = 14
F = 000 0110 0010 1100 0110 0000
f = 1.000 0110 0010 1100 0110 0000 x 214
• Move binary point right 14 positions
1000 0110 0010 110.0 0110 0000
Convert to decimal (don’t forget to check S!)
= – 17,174.1875
Reappearance of Hidden Bit
O
SU
C
SE
2
42
1
J. E. Jones
Can’t use integers for everything
Trying to cover a much broader range of real values; but
something has to give, and it’s the precision
Pi a good example:
◦ Whether or not a rational number has a terminating expansion
depends on the base.
For example, in base-10 the number 1/2 has a terminating
expansion (0.5) while the number 1/3 does not (0.333…).
In base-2 only rational numbers with denominators that
are powers of 2 (such as 1/2 or 3/16) are terminating. Any
rational number with a denominator that has a prime factor
other than 2 will have an infinite binary expansion.
O
SU
C
SE
2
42
1
J. E. Jones
Next, we look somewhat more at the range and
precision of IEEE single precision values
To begin, let’s see the range of values which can be
encoded by the F field of a single precision number
O
SU
C
SE
2
42
1
J. E. Jones
Minimum value which can be encoded:
◦ 1.0000 0000 0000 0000 0000 000
Hidden bit of 1, with F as all 0s
Clearly, this means f = 1.0
O
SU
C
SE
2
42
1
J. E. Jones
Maximum value which can be encoded:
◦ 1.1111 1111 1111 1111 1111 111
Hidden bit of 1, with F as all 1s
= 1 + ( (223 – 1) / (223) )
= 1 + (((1024 X 1024 X 8) – 1) / (1024 X 1024 X 8))
= 1 + 0.99999988
f = 1.99999988
O
SU
C
SE
2
42
1
J. E. Jones
Using these values, we can see the range of values which
can be represented in IEEE single precision for various
exponents
Please see the following pdf file on Piazza:
IEEE_754_Single_Precision_Range_Table.pdf
O
SU
C
SE
2
42
1
J. E. Jones
Question: We say that floating point values are an
approximation to real numbers, but does the floating-
point representation always allow for a fractional part
of the value? (In other words, are there values which
can only be encoded in IEEE 754 as integers?)
Intuitively, since we have a limited number of bits for
F, we should guess that the answer is that a fractional
portion of a value cannot always be encoded.
Let’s look at single precision.
O
SU
C
SE
2
42
1
J. E. Jones
Question: Can we determine the minimum value in
IEEE single precision that does not allow for the
representation of any fractional portion of the value?
In other words, the minimum value which can only
represent an integer value?
O
SU
C
SE
2
42
1
J. E. Jones
At what point are all the bits in the F field required to
encode the integer portion of the value, leaving no
bits for any fractional portion?
O
SU
C
SE
2
42
1
J. E. Jones
Here is the encoding (assume positive):
0 1001 0110 0000 0000 0000 0000 0000 000
This value is 1.0000 0000 0000 0000 0000 000 x 223
10000 0000 0000 0000 0000 000.0
O
SU
C
SE
2
42
1
J. E. Jones
We have seen that in some cases, the encoding allows
for values which are no more precise than integers
(i.e., only integer values can be encoded).
Question: Are there cases where IEEE 754 encoded
values are even less precise than integers?
O
SU
C
SE
2
42
1
J. E. Jones
Consider:
1.00000000000000000000000 * 250 =
and
1.00000000000000000000001 * 250 =
We have incremented our value by the minimum amount that we can,
right?
O
SU
C
SE
2
42
1
J. E. Jones
Consider:
1.00000000000000000000000 * 250 =
1,125,899,906,842,624.0 decimal
and
1.00000000000000000000001 * 250 =
1,125,900,041,060,352.0 decimal
That’s a difference of 134,217,728!!
O
SU
C
SE
2
42
1
J. E. Jones
What if
float a = 64.2;
double b = 64.2;
double c = (double)a;
if(a == b) printf(“float equals double\n”);
else (printf(“float doesn’t equal double\n”);
Which line prints?
O
SU
C
SE
2
42
1
J. E. Jones
What if
float a = 64.2;
double b = 64.2;
double c = (double)a;
if(a == b) printf(“float equals double\n”);
else (printf(“float doesn’t equal double\n”);
Which line prints?
float doesn’t equal double WHAT ?!?!?
O
SU
C
SE
2
42
1
J. E. Jones
float a = 64.2 is:
0x42806666
This is consistent with what we converted 64.2 to be in
IEEE 754 in our earlier example.
float a = 64.2;
double b = 64.2;
double c = (double)a;
O
SU
C
SE
2
42
1
J. E. Jones
float a = 64.2 is:
0x42806666
double b = is:
0x40500ccccccccccd
I’m comfortable believing this is the IEEE 754 double
precision value for 64.2. (I got the value via gdb
command x/g &b)
float a = 64.2;
double b = 64.2;
double c = (double)a;
O
SU
C
SE
2
42
1
J. E. Jones
float a = 64.2 is:
0x42806666
double b = 64.2 is:
0x40500ccccccccccd
double c = is:
0x40500cccc0000000 Hmmmm.
As they used to say on Hill Street Blues, “Let’s be careful out there!”
float a = 64.2;
double b = 64.2;
double c = (double)a;
O
SU
C
SE
2
42
1
J. E. Jones
float a = 64.25 is:
0x42808000
double b = is:
0x4050100000000000
double c = is:
0x4050100000000000 Hmmmm.
Why the different result?
float a = 64.25;
double b = 64.25;
double c = (double)a;
O
SU
C
SE
2
42
1
J. E. Jones
float a = 64.25 is:
0x42808000
double b = is:
0x4050100000000000
double c = is:
0x4050100000000000 Hmmmm.
Why the different result? No repeating decimal value
float a = 64.25;
double b = 64.25;
double c = (double)a;
O
SU
C
SE
2
42
1
J. E. Jones
What should you do???
◦ Some say you should not use relational
equality when dealing with floating point
values
◦ Some say you should use an epsilon and do
something like this: if (abs(a-b) < epsilon ) {.....} The problem, though, is that there is no agreement with respect to what epsilon should be…