PowerPoint Presentation
FLOATING POINT
NUMBERS
Bernhard Kainz (with thanks to A. Gopalan, N. Dulay and E.
Edwards)
b.kainz@imperial.ac.uk
IEEE floating point standard
mailto:b.kainz@imperial.ac.uk
IEEE floating point standard
• IEEE: institute of electrical and electronic engineers
(USA)
• Comprehensive standard for binary floating point
arithmetic
• Widely adopted predictable results independent of
architecture
• Standard defines:
• Format of binary floating point numbers, i.e. how the fields are
stored in memory
• Semantics of arithmetic operations
• Rules for error conditions
Single precision format (32-bit)
• Coefficient is called the significand in the IEEE standard
• Value represented is ±𝟏. 𝑭 × 𝟐𝑬−𝟏𝟐𝟕
• The normal bit (the 1.) is omitted from the significand
field a hidden bit
• Single precision yields 24 bits (approx. 7 decimal digits
of precision)
• Normalised ranges in decimal are approximately:
−𝟏𝟎𝟑𝟖 to −𝟏𝟎−𝟑𝟖, 0, 𝟏𝟎𝟑𝟖 to 𝟏𝟎−𝟑𝟖
Sign
S
1 bit
Exponent
E
8 bits
Significand
F
23 bits
Exponent field
• In the IEEE standard, exponents are stored as excess
values, not as 2’s complement
• Example: In 8-bit excess-127
-127 would be held as 0000 0000
… …
0 0111 1111
1 1000 0000
… …
128 1111 1111
• Allows non-negative floating point numbers to be
compared using simple integer comparisons
Double precision format (64-bit)
• Value represented is ±𝟏. 𝑭 × 𝟐𝑬−𝟏𝟎𝟐𝟑
• Double precision yields 53 bits (approx. 16 decimal
digits of precision)
• Normalised ranges in decimal are approximately:
−𝟏𝟎𝟑𝟎𝟖 to −𝟏𝟎−𝟑𝟎𝟖, 0, 𝟏𝟎𝟑𝟎𝟖 to 𝟏𝟎−𝟑𝟎𝟖
• Single precision generally reserved for when memory is
scarce or for debugging numerical calculations since
rounding errors show up more quickly
Sign
S
1 bit
Exponent
E
11 bits
Significand
F
52 bits
Example: conversion to IEEE format
What is 42.6875 in IEEE single precision format?
1. Convert to binary number: 42.6875 = 10 1010 . 1011
2. Normalise: 1.0101 0101 1 × 25
3. Significand field is thus:
0101 0101 1000 0000 0000 000
4. Exponent field is (5 + 127 = 132): 1000 0100
Sign
S
0
Exponent
E
1000 0100
Significand
F
0101 0101 1000 0000 0000 000
Hex: 422A C000
Example: conversion from IEEE format
What is the IEEE single precision value represented by
BEC0 0000 in decimal?
1. Exponent field: 0111 1101 = 125
2. True binary exponent: 125 − 127 = −2
3. Significand field + hidden bit:
1.1000 0000 0000 0000 0000 000
4. So unsigned value is 1.1 × 2−2 = 0.011 (binary)
= 0.25 + 0.125 = 0.375 (decimal)
5. Adding sign bit gives finally −𝟎. 𝟑𝟕𝟓
Sign
S
1
Exponent
E
0111 1101
Significand
F
1000 0000 0000 0000 0000 000
Example: addition
Carry out the addition 𝟒𝟐. 𝟔𝟖𝟕𝟓 + 𝟎. 𝟑𝟕𝟓 in IEEE single
precision arithmetic
Number Sign Exponent Significand
42.6875 0 1000 0100 0101 0101 1000 0000 0000 000
0.375 0 0111 1101 1000 0000 0000 0000 0000 000
• To add these numbers, exponents must be the same
make the smaller exponent equal to the larger by shifting
significand accordingly
• Note: must restore hidden bit when carrying out floating
point operations
Example: addition (cont.)
• Significand of larger no.: 1.0101 0101 1000 0000 0000 000
• Significand of smaller no.: 1.1000 0000 0000 0000 0000 000
• Exponents differ by (1000 0100 − 0111 1101 = 7) so shift binary
point of smaller no. 7 places to the left:
• Significand of smaller no.: 0.0000 0011 0000 0000 0000 000
• Significand of larger no.: 1.0101 0101 1000 0000 0000 000
• Significand of sum: 1.0101 1000 1000 0000 0000 000
• So sum is 1.0101 1000 1 × 25 = 10 1011.0001 = 43.0625
Sign
S
0
Exponent
E
1000 0100
Significand
F
0101 1000 1000 0000 0000 000
Special values
• IEEE formats can encode five kinds of values: zero,
normalised numbers, denormalised numbers, infinity
and not-a-number (NaNs)
• Single precision representations:
IEEE value Sign
field
Exponent Significand True exponent
±0 0 or 1 0 0 (all zeros)
± denormalised no. 0 or 1 0 Any non-zero bit
pattern
−126
±normalised no. 0 or 1 1…254 Any bit pattern −126…127
±∞ 0 or 1 255 0 (all zeros)
Not-a-number 0 or 1 255 Any non-zero bit
pattern
Denormalised numbers
• An all zero exponent is used to represent both zero and
denormalised numbers
• An all one exponent is used to represent infinities and
not-a-numbers
• Means range for normalised numbers is reduced, for
single precision the exponent range is −126…127 rather
than −127…128
• Denormalised numbers represent values between the
underflow limits and zero, i.e. for single precision we have
± 0. 𝐹 × 2−126
• Allows a more gradual shift to zero – useful in some
numerical applications
Infinities and NaNs
• Infinities represent values exceeding the overflow limits
and for divisions of non-zero quantities by zero
• You can do basic ‘arithmetic’ with them, e.g.:
∞+ 5 = ∞, ∞+∞ = ∞
• NaNs represent the result of operations which have no
(real) mathematical interpretation, e.g.
0
0
, +∞+−∞, 0 ×∞, square root of a negative number
• Operations resulting in NaNs can either yield a NaN result
(quiet NaN) or an exception (signalling NaN)
Special Operations
Operation Result
N ÷ ± Infinity 0
± Infinity × ± Infinity ± Infinity
± non-zero ÷ 0 ± Infinity
Infinity + Infinity Infinity
± 0 ÷ ± 0 NaN
Infinity – Infinity NaN
± Infinity ÷ ± Infinity NaN
± Infinity × 0 NaN
☺ SOME FUN ☺
Floating Point Precision
• C code:
#include
int main() {
float a, b, c;
float EPSILON = 0.0000001;
a = 1.345f; b = 1.123f;
c = a + b;
if (c == 2.468)
printf (“They are equal.\n”);
else
printf (“\nThey are not equal! The value of c is %.10f or %f\n”,c,c);
// With some tolerance
if (((2.468 – EPSILON) < c) && (c < (2.468 + EPSILON)))
printf ("\n%.10f is equal to 2.468 with tolerance\n\n", c);
}
Run-time
Finding Machine Epsilon
• Pseudo-code
Set machineEps = 1.0;
Loop
machineEps = machineEps/2.0
Until ((1 + machineEps/2.0) != 1)
Print machineEps
Finding Machine Epsilon
• C code
#include
int main( int argc, char **argv )
{
float machEps = 1.0f;
do {
machEps /= 2.0f;
// If next epsilon yields 1, then break, because current
// epsilon is the machine epsilon.
}
while ((float)(1.0 + (machEps/2.0f)) != 1.0);
printf( “\nCalculated Machine epsilon: %G\n\n”, machEps );
return 0;
}
Finding Machine Epsilon
• In Java
public class machEps
{
private static void calculateMachineEpsilonFloat() {
float machEps = 1.0f;
do {
machEps /= 2.0f;
} while ((float)(1.0 + (machEps/2.0)) != 1.0);
System.out.println( “Calculated machine epsilon: ” + machEps );
}
public static void main (String args[])
{
calculateMachineEpsilonFloat ();
}
}
Run-time
Special Operations
• Example
#include
int main (int argc, char **argv)
{
float a = 1.0/0.0;
float b = a * -100;
float c = b/a;
int d = 2 * 10 + 3;
printf (“\nValue of a = %f\n\n”, a);
printf (“\nValue of b = %f\n\n”, b);
printf (“\nValue of c = %f\n\n”, c);
printf (“\nValue of d = %d\n\n”, d);
}
Run-time