CS计算机代考程序代写 algorithm Microsoft PowerPoint - 18_IEEE_754_Floating_Point

Microsoft PowerPoint – 18_IEEE_754_Floating_Point

O
SU

C
SE

2
42

J.E.Jones

Required Reading: Computer Systems: A Programmer’s Perspective, 3rd Edition
• Chapter 2, Sections 2.4 through 2.4.4
• Pearson Tutorial Assignment

O
SU

C
SE

2
42

J. E. Jones

 IEEE Standard 754 floating point is the most common
representation today for real numbers on computers, including
Intel-based PC’s, Macintoshes, and most Unix platforms

 Floating point encodes rational numbers/values of the form
V = x × 2y

 Useful for computations involving:
◦ Very large numbers: |V| >> 0
◦ Numbers very close to 0: |V| << 1 ◦ More generally, as an approximation to real numbers  Single precision and double precision: ◦ Single precision: 32 bits ◦ Double precision: 64 bits O SU C SE 2 42 1 J. E. Jones  Limited range and precision (finite space):  Range (for given precision): ◦ Single: +/- 2-149 to +/- 2127  Compare with int: -231 to 231 - 1 ◦ Double: +/- 2-1074 to +/- 21023  Compare with long (ints): -263 to 263 - 1  Overflow means that values have grown too large for the representation, much in the same way that you can have integer overflow.  Underflow (a value too small to be represented) is a less serious problem because it just denotes a loss of precision, which is guaranteed to be closely approximated by zero. We’ve still only got 2w bit patterns To work with! What gives?!?! O SU C SE 2 42 1 J. E. Jones  Limited range and precision (finite space):  Range (for given precision): ◦ Single: +/- 2-149 to +/- 2127  Compare with int: -231 to 231 - 1 ◦ Double: +/- 2-1074 to +/- 21023  Compare with long (ints): -263 to 263 - 1  Overflow means that values have grown too large for the representation, much in the same way that you can have integer overflow.  Underflow (a value too small to be represented) is a less serious problem because it just denotes a loss of precision, which is guaranteed to be closely approximated by zero. We’ve still only got 2w bit patterns To work with! What gives?!?! PRECISION! O SU C SE 2 42 1 J. E. Jones  “Real numbers” having a decimal portion that is not necessarily equal to 0  Example: 123.14 base 10 ◦ 1*102 + 2*101 + 3*100 + 1*10-1 + 4*10-2 ◦ Digit format: dmdm-1…d1d0 . d-1d-2…d-n ◦ dnum summation_of(i = -n to m) di * 10i  Example: 110.11 base 2 ◦ 1*22 + 1*21 + 0*20 + 1*2-1 + 1*2-2 ◦ Digit format: bmbm-1…b1b0 . b-1b-2…b-n ◦ bnum summation_of(i = -n to m) bi * 2i  In both cases, digits on the left of the “point” are weighted by a positive power and those on the right are weighted by a negative power O SU C SE 2 42 1 J. E. Jones  Shifting* the binary point one position left: ◦ Divides the number by 2 ◦ Compare 101.11 (5¾) base 2 with 10.111(2 7/8) base 2  Shifting* the binary point one position right: ◦ Multiplies the number by 2 ◦ Compare 101.11(5¾) base 2 with 1011.1 (11½) base 2  Numbers such as 0.111…11 base 2 represent numbers just below 1  0.111111 base 2 = 63/64 (decimal 0.984375), and 0.11111111 = 255/256 (decimal 0.99609375) *NOTE: the binary point shifting discussed above is completely different than bitwise shifting bits discussed previously O SU C SE 2 42 1 J. E. Jones  Only finite-length encodings can be stored ◦ 1/3 and 5/7 cannot be represented exactly (generally, only rational numbers with denominators that are powers of 2, i.e., numbers such as 1/2 or 3/16, have a terminating, or finite-length binary encoding)  Fractional binary notation can only represent numbers that can be written x * 2y (for integers x and y) ◦ Example: 63/64 = 63*2-6 ; 255/256 = 255 *2-8 ◦ Otherwise, value can only be approximated ◦ Get increasing accuracy by lengthening the binary representation but still have finite space, so precision will always be limited O SU C SE 2 42 1 J. E. Jones  When we represent a number in scientific notation ◦ Typically, only 1 digit to the left of the decimal point  6.385 * 104 == 63,850  -9.66712 * 107 == -96,671,200  1.23456 * 103 == 1,234.56  2.75 * 10-2 == 0.0275 ◦ Four fields  Sign  Left side of decimal point  Right side of decimal point  exponent  Hold these thoughts as we move on to IEEE 754 O SU C SE 2 42 1 J. E. Jones An integer representation that skews the bit pattern so as to look just like unsigned but actually represent negative numbers. examples: given 4 bits, we BIAS values by 23 (8) TRUE VALUE to be represented 3 -3 add in the bias +8 8 ---- unsigned value 11 5 so, the bit pattern of 3 (or -3) in 4-bit biased-8 representation will be 1011 (or 0101) Think of it as sliding the number line from between -2w-1 and 2w-1-1 to between 0 and 2w-1 O SU C SE 2 42 1 J. E. Jones Going the other way, suppose we were given a biased-8 representation of 0110 unsigned 0110 represents 6 subtract out the bias - 8 ---- TRUE VALUE represented -2 this representation allows operations on the biased numbers to be the same as for unsigned integers, but actually represents both positive and negative values. choosing a bias: the bias chosen is most often based on the number of bits available for representing an integer. To get an approx. equal distribution of true values above and below 0, the bias should be 2(n-1) or 2(n-1) - 1 O SU C SE 2 42 1 J. E. Jones  The bits in the encoding of the real number value are divided into three fields (details about the sizes of these fields on the next slide): ◦ S field (or sign field): encodes the sign of the number; ◦ E field (or exponent field): encodes the exponent of the real number (positive, 0 or negative), in biased form; ◦ F field (or significand/mantissa field): which encodes a number greater than or equal to 0, and less than 2, to which the exponent is raised to obtain the real value encoded by the entire IEEE 754 representation (this is x in a representation of the value of the number using |V| = x × 2y) O SU C SE 2 42 1 J. E. Jones  The real number decimal value represented is: (-1)s * (f )* (2e) ------------------------------------------------------- | S | E | F | -------------------------------------------------------  S is one bit representing the sign of the number (it directly encodes s) ◦ 0 for positive, 1 for negative  E is an 8-bit biased integer (NOTE: not B2T!) representing the exponent e ◦ e = E – bias  E = e + bias  F is an unsigned integer, ε depends on the E field: ◦ if E is 0, ε is zero; otherwise, ε is 1 ◦ f = ( F/(2n) ) + ε F = (f - ε)*2n  For 32-bit, n = 23 and bias = 127  For 64-bit, n = 52 and bias = 1023 Sign Exponent Fraction Bias Single Precision (4 bytes) 1 [31] 8 [30-23] 23 [22-00] 127 Double Precision (8 bytes) 1 [63] 11 [62-52] 52 [51-00] 1023 O SU C SE 2 42 1 J. E. Jones  The real number decimal value represented is: (-1)s * (f )* (2e) ------------------------------------------------------- | S | E | F | -------------------------------------------------------  S is one bit representing the sign of the number (it directly encodes s) ◦ 0 for positive, 1 for negative  E is an 8-bit biased integer (NOTE: not B2T!) representing the exponent e ◦ e = E – bias  E = e + bias  F is an unsigned integer, ε depends on the E field: ◦ if E is 0, ε is zero; otherwise, ε is 1 ◦ f = ( F/(2n) ) + ε F = (f - ε)*2n  For 32-bit, n = 23 and bias = 127  For 64-bit, n = 52 and bias = 1023 Sign Exponent Fraction Bias Single Precision (4 bytes) 1 [31] 8 [30-23] 23 [22-00] 127 Double Precision (8 bytes) 1 [63] 11 [62-52] 52 [51-00] 1023 O SU C SE 2 42 1 J. E. Jones  Think of IEEE 754 as similar to scientific notation.  IEEE 754 standard is used so that there is consistency across computers  This is not the only way to represent floating point numbers, it is just the IEEE standard way of doing it. O SU C SE 2 42 1 J. E. Jones  Represent the decimal number 64.2 in IEEE standard single precision floating point format.  5 step process to obtain Sign(S), Exponent(E), and Mantissa(F) O SU C SE 2 42 1 J. E. Jones Step 1: Get a binary representation for 64.2, with at least the number of bits for the F field + 1 (24 for single precision, or 53 for double precision). To do this, determine unsigned binary representations for the value to the left and right of the binary point separately. 64 is 0100 0000 .2 can be computed using the algorithm: Whole number part .2 x 2 = 0.4 0 .4 x 2 = 0.8 0 .8 x 2 = 1.6 1 .6 x 2 = 1.2 1 .2 x 2 = 0.4 0 now this whole pattern (0011) repeats.* .4 x 2 = 0.8 0 .8 x 2 = 1.6 1 .6 x 2 = 1.2 1 Write the whole number parts from top to bottom, right to left. So, a binary representation for 0.2 is .0011 0011 0011… Putting the halves back together again, and writing 24 bits: 64.2 is 1000000.00110011001100110 *NOTE: The pattern doesn’t always repeat. Consider .25 = .0100 0000 0000 000 O SU C SE 2 42 1 J. E. Jones Step 2: Normalize the binary representation. (i.e. Make it look like scientific notation) with a single 1 to the left of the binary point: 100 0000.0011 0011 0011 0011 0 normalizes to 1.0000 0000 1100 1100 1100 110 x 26 NOTE: the bit to the left of the binary point will be considered a “hidden bit”(it does not appear in the encoding, but is implied by the value of the exponent) O SU C SE 2 42 1 J. E. Jones Step 3: 6 is the true exponent (e). For the standard form, to get E, it needs to be in 8- bit, biased-127 representation (add 127 to the true exponent to get the biased- 127 representation). 6 = e + 127 ----- 133 = E Convert this decimal value to binary: 133 in 8-bit, unsigned representation is 1000 0101 = E This is the bit pattern used for E in the standard form. O SU C SE 2 42 1 J. E. Jones  There are 23 bits for the mantissa in a single precision IEEE 754 representation.  It turns out that, if floating point numbers are always stored in a normalized form, then the leading bit (the one on the left of the binary point) is always a 1. So, why store it at all?  We can put it back into the number (giving 24 bits of precision for the mantissa) for any calculation, but we only store 23 bits. O SU C SE 2 42 1 J. E. Jones Step 4: The mantissa/significand stored (F) is the values to the right of the binary point in the normalized form (1.000 0000 0110 0110 0110 0110 x26) We need 23 bits of it for single precision. So, F = 000 0000 0110 0110 0110 0110 The “Hidden Bit” O SU C SE 2 42 1 J. E. Jones Step 5: Put it all together (and include the correct sign bit: the original number is positive, so the sign bit is 0): S E F 0 1000 0101 000 0000 0110 0110 0110 0110 0 1000 0111 000 0000 0110 0110 0110 0110 0b 0100 0010 1000 0000 0110 0110 0110 0110 0x 4 2 8 0 6 6 6 6 hex representation O SU C SE 2 42 1 J. E. Jones  So, to sum up: ◦ The sign bit is 0 for positive, 1 for negative. ◦ The exponent's base is two. ◦ The exponent field (E) contains 127 (the bias) plus the true exponent for single-precision, or 1023 plus the true exponent for double precision. ◦ The first bit of the mantissa/significand is implied to be 1 for biased exponents in the range 1 – 254 (but assumed to be 0 for exponent of 0); i.e., the mantissa/significand is interpreted as 1.F, where F is the field of fraction bits. O SU C SE 2 42 1 J. E. Jones • Given binary 1100 0100 1000 0000 0010 1100 0000 0000 S = 1 E = 1000 1001 = 137 = 127+10  e = 10 F = 000 0000 0010 1100 0000 0000 f = 1.000 0000 0010 1100 0000 0000 x 210 • Move binary point right 10 positions 1000 0000 001.0 1100 0000 0000 • Convert to decimal (don’t forget to check S!) = 210+20+2-2+2-3 = 1024 + 1 + 0.25 + 0.125 = -1025.375 Reappearance of Hidden Bit O SU C SE 2 42 1 J. E. Jones  IEEE uses three different encodings to represent real values: ◦ 1. Denormalized values: These are extremely small values, very close to 0, including 0. If such a value becomes too small, it can cause underflow (the largest denormalized value is approximately +/-1.18×10−38). ◦ 2. Normalized values: These are values greater than the largest denormalized value, but small enough to fit into the representation (that is, without overflow). ◦ 3. Special values: These are values which are either +/- infinity (created by overflow) or NaN (not a number – values which are not real numbers, possibly caused by invalid operations, such as division by 0).  We will primarily look at conversion for normalized values, but also briefly discuss the others. O SU C SE 2 42 1 J. E. Jones Type of value S E F “hidden” bit 0 0 or 1 all 0’s all 0’s 0 denormalized 0 or 1 all 0’s not all 0’s 0 normalized 0 or 1 >0,

but not all 1’s any bit pattern 1

+/‐ infinity 0 or 1 all 1’s all 0’s

NaN* 0 or 1 all 1’s anything but all zeros

*Not a Number

O
SU

C
SE

2
42

J. E. Jones

Were created for one primary purpose: gradual underflow. It’s a way to keep the
relative difference between tiny numbers small. If you go straight from the smallest
normal number to zero (abrupt underflow), the relative change is infinite. If you go
to denormals on underflow, the relative change is still not fully accurate, but at
least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly.
There are always the same amount of numbers between successive powers of two:
252 (for double precision). So without denormals, you always end up with a gap
between 0 and the smallest floating-point number that is 252 times the size of the
difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the
mathematically equivalent x == y and x – y == 0. If x and y are tiny but different
and you use abrupt underflow, then if their difference is less than the minimum
cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal
numbers gets to be a denormal, which is still not zero. The equivalence is
preserved.
So, using denormals on purpose is not advised, because they were designed
only as a backup mechanism in exceptional cases.

Reference: http://stackoverflow.com/questions/15140847/denormalized-numbers-ieee-754-floating-point

O
SU

C
SE

2
42

J. E. Jones

O
SU

C
SE

2
42

J. E. Jones

 Represent the decimal number 76.875 in IEEE
standard single precision floating point format.

 This value will be a normalized value, because it is
clearly larger than the largest denormalized value (it
is not a value less than 1, but very close to 0), and it
is neither infinity nor NaN

 5 step process to obtain Sign(S), Exponent(E), and
Mantissa(F)

O
SU

C
SE

2
42

J. E. Jones

Step 1:
Get a binary representation for 76.875, with at least the number of bits for the F field + 1 (24 for
single precision, or 53 for double precision). To do this, get unsigned binary representations for
the stuff to the left and right of the decimal point separately.

76 is 100 1100 .875 can be computed using the
algorithm:

Whole number part
2 76 0 .875 x 2 = 1.75 1
2 38 0 .75 x 2 = 1.5 1
2 19 1 .5 x 2 = 1.0 1
2 9 1 .0 x 2 = 0.0 0
2 4 0
2 2 0
2 1 1

Write the whole number parts for 76 from bottom to top, right to left: 100 1100
Write the fractional part for .875 from top to bottom, right to left: .111000000…
Putting the halves back together again, and writing 24 bits:
76.875 is 100 1100.1110 0000 0000 0000 0

NOTE: This fraction doesn’t repeat!!! Fill in
remaining 24 bits with trailing 0s!

O
SU

C
SE

2
42

J. E. Jones

Step 2:
Normalize the binary representation. (i.e. Make it look like scientific notation)
with a single 1 to the left of the binary point

100 1100.1110 0000 0000 0000 0
normalizes to

1.001 1001 1100 0000 0000 0000 x 26

NOTE: the bit to the left of the binary point will be considered the “hidden
bit”(it does not appear in the encoding, but is implied by the value of the
exponent)

O
SU

C
SE

2
42

J. E. Jones

Step 3:
6 is the true exponent (e). For the standard form, to get E, it needs to be in 8-bit,
biased-127 representation (add 127 to the true exponent to get the biased-127
representation).

6
+ 127

—–
133

Convert this decimal value to binary:
133 in 8-bit, unsigned representation is 1000 0101

This is the bit pattern used for E in the standard form.

O
SU

C
SE

2
42

J. E. Jones

Step 4:
The mantissa/significand stored (F) is the value to the right of the binary point
in the normalized form (1.0011 0011 1000 0000 0000 000 x 26)
We need 23 bits of it for single precision. Fill in with trailing 0s
if need be.

0011 0011 1000 0000 0000 000

O
SU

C
SE

2
42

J. E. Jones

Step 5: Put it all together (and include the correct sign bit: the
original number is positive, so the sign bit is 0):

S E F
0 1000 0101 0011 0011 1000 0000 0000 000

0b 0100 0010 1001 1001 1100 0000 0000 0000
0x 4 2 9 9 C 0 0 0

O
SU

C
SE

2
42

J. E. Jones

 Given binary 1 10001101 00001100010110001100000
S = 1
E = 1000 1101 = 141 = 127+14  e = 14
F = 000 0110 0010 1100 0110 0000

f = 1.000 0110 0010 1100 0110 0000 x 214
• Move binary point right 14 positions
1000 0110 0010 110.0 0110 0000
Convert to decimal (don’t forget to check S!)
= – 17,174.1875

Reappearance of Hidden Bit

O
SU

C
SE

2
42

J. E. Jones

 Can’t use integers for everything
 Trying to cover a much broader range of real values; but

something has to give, and it’s the precision
 Pi a good example:
◦ Whether or not a rational number has a terminating expansion

depends on the base.
 For example, in base-10 the number 1/2 has a terminating

expansion (0.5) while the number 1/3 does not (0.333…).
 In base-2 only rational numbers with denominators that

are powers of 2 (such as 1/2 or 3/16) are terminating. Any
rational number with a denominator that has a prime factor
other than 2 will have an infinite binary expansion.

O
SU

C
SE

2
42

J. E. Jones

 Next, we look somewhat more at the range and
precision of IEEE single precision values

 To begin, let’s see the range of values which can be
encoded by the F field of a single precision number

O
SU

C
SE

2
42

J. E. Jones

 Minimum value which can be encoded:
◦ 1.0000 0000 0000 0000 0000 000
Hidden bit of 1, with F as all 0s

 Clearly, this means f = 1.0

O
SU

C
SE

2
42

J. E. Jones

 Maximum value which can be encoded:
◦ 1.1111 1111 1111 1111 1111 111
Hidden bit of 1, with F as all 1s

= 1 + ( (223 – 1) / (223) )
= 1 + (((1024 X 1024 X 8) – 1) / (1024 X 1024 X 8))

= 1 + 0.99999988
f = 1.99999988

O
SU

C
SE

2
42

J. E. Jones

Using these values, we can see the range of values which
can be represented in IEEE single precision for various
exponents

 Please see the following pdf file on Piazza:
IEEE_754_Single_Precision_Range_Table.pdf

O
SU

C
SE

2
42

J. E. Jones

 Question: We say that floating point values are an
approximation to real numbers, but does the floating-
point representation always allow for a fractional part
of the value? (In other words, are there values which
can only be encoded in IEEE 754 as integers?)

 Intuitively, since we have a limited number of bits for
F, we should guess that the answer is that a fractional
portion of a value cannot always be encoded.

 Let’s look at single precision.

O
SU

C
SE

2
42

J. E. Jones

 Question: Can we determine the minimum value in
IEEE single precision that does not allow for the
representation of any fractional portion of the value?
In other words, the minimum value which can only
represent an integer value?

O
SU

C
SE

2
42

J. E. Jones

 At what point are all the bits in the F field required to
encode the integer portion of the value, leaving no
bits for any fractional portion?

O
SU

C
SE

2
42

J. E. Jones

 Here is the encoding (assume positive):

 0 1001 0110 0000 0000 0000 0000 0000 000

 This value is 1.0000 0000 0000 0000 0000 000 x 223

 10000 0000 0000 0000 0000 000.0

O
SU

C
SE

2
42

J. E. Jones

 We have seen that in some cases, the encoding allows
for values which are no more precise than integers
(i.e., only integer values can be encoded).

 Question: Are there cases where IEEE 754 encoded
values are even less precise than integers?

O
SU

C
SE

2
42

J. E. Jones

 Consider:
1.00000000000000000000000 * 250 =

and
1.00000000000000000000001 * 250 =

We have incremented our value by the minimum amount that we can,
right?

O
SU

C
SE

2
42

J. E. Jones

 Consider:
1.00000000000000000000000 * 250 =

1,125,899,906,842,624.0 decimal
and
1.00000000000000000000001 * 250 =

1,125,900,041,060,352.0 decimal

That’s a difference of 134,217,728!!

O
SU

C
SE

2
42

J. E. Jones

 What if
float a = 64.2;
double b = 64.2;
double c = (double)a;

if(a == b) printf(“float equals double\n”);
else (printf(“float doesn’t equal double\n”);

Which line prints?

O
SU

C
SE

2
42

J. E. Jones

 What if
float a = 64.2;
double b = 64.2;
double c = (double)a;

if(a == b) printf(“float equals double\n”);
else (printf(“float doesn’t equal double\n”);

Which line prints?
float doesn’t equal double WHAT ?!?!?

O
SU

C
SE

2
42

J. E. Jones

 float a = 64.2 is:
0x42806666

This is consistent with what we converted 64.2 to be in
IEEE 754 in our earlier example.

float a = 64.2;
double b = 64.2;
double c = (double)a;

O
SU

C
SE

2
42

J. E. Jones

 float a = 64.2 is:
0x42806666

 double b = is:
0x40500ccccccccccd

I’m comfortable believing this is the IEEE 754 double
precision value for 64.2. (I got the value via gdb
command x/g &b)

float a = 64.2;
double b = 64.2;
double c = (double)a;

O
SU

C
SE

2
42

J. E. Jones

 float a = 64.2 is:
0x42806666

 double b = 64.2 is:
0x40500ccccccccccd

 double c = is:
0x40500cccc0000000 Hmmmm.

As they used to say on Hill Street Blues, “Let’s be careful out there!”

float a = 64.2;
double b = 64.2;
double c = (double)a;

O
SU

C
SE

2
42

J. E. Jones

 float a = 64.25 is:
0x42808000

 double b = is:
0x4050100000000000

 double c = is:
0x4050100000000000 Hmmmm.

Why the different result?

float a = 64.25;
double b = 64.25;
double c = (double)a;

O
SU

C
SE

2
42

J. E. Jones

 float a = 64.25 is:
0x42808000

 double b = is:
0x4050100000000000

 double c = is:
0x4050100000000000 Hmmmm.

Why the different result? No repeating decimal value

float a = 64.25;
double b = 64.25;
double c = (double)a;

O
SU

C
SE

2
42

J. E. Jones

 What should you do???
◦ Some say you should not use relational

equality when dealing with floating point
values
◦ Some say you should use an epsilon and do

something like this: if (abs(a-b) < epsilon ) {.....} The problem, though, is that there is no agreement with respect to what epsilon should be…

Related Posts