程序代写代做代考 Java PowerPoint Presentation

PowerPoint Presentation

FLOATING POINT

NUMBERS

Bernhard Kainz (with thanks to A. Gopalan, N. Dulay and E.

Edwards)

b.kainz@imperial.ac.uk

IEEE floating point standard

mailto:b.kainz@imperial.ac.uk

IEEE floating point standard

• IEEE: institute of electrical and electronic engineers

(USA)

• Comprehensive standard for binary floating point

arithmetic

• Widely adopted  predictable results independent of

architecture

• Standard defines:

• Format of binary floating point numbers, i.e. how the fields are

stored in memory

• Semantics of arithmetic operations

• Rules for error conditions

Single precision format (32-bit)

• Coefficient is called the significand in the IEEE standard

• Value represented is ±𝟏. 𝑭 × 𝟐𝑬−𝟏𝟐𝟕

• The normal bit (the 1.) is omitted from the significand
field  a hidden bit

• Single precision yields 24 bits (approx. 7 decimal digits
of precision)

• Normalised ranges in decimal are approximately:

−𝟏𝟎𝟑𝟖 to −𝟏𝟎−𝟑𝟖, 0, 𝟏𝟎𝟑𝟖 to 𝟏𝟎−𝟑𝟖

Sign

1 bit

Exponent

8 bits

Significand

23 bits

Exponent field

• In the IEEE standard, exponents are stored as excess

values, not as 2’s complement

• Example: In 8-bit excess-127
-127 would be held as 0000 0000

… …

0 0111 1111

1 1000 0000

… …

128 1111 1111

• Allows non-negative floating point numbers to be

compared using simple integer comparisons

Double precision format (64-bit)

• Value represented is ±𝟏. 𝑭 × 𝟐𝑬−𝟏𝟎𝟐𝟑

• Double precision yields 53 bits (approx. 16 decimal

digits of precision)

• Normalised ranges in decimal are approximately:

−𝟏𝟎𝟑𝟎𝟖 to −𝟏𝟎−𝟑𝟎𝟖, 0, 𝟏𝟎𝟑𝟎𝟖 to 𝟏𝟎−𝟑𝟎𝟖

• Single precision generally reserved for when memory is

scarce or for debugging numerical calculations since

rounding errors show up more quickly

Sign

1 bit

Exponent

11 bits

Significand

52 bits

Example: conversion to IEEE format

What is 42.6875 in IEEE single precision format?

1. Convert to binary number: 42.6875 = 10 1010 . 1011

2. Normalise: 1.0101 0101 1 × 25

3. Significand field is thus:

0101 0101 1000 0000 0000 000

4. Exponent field is (5 + 127 = 132): 1000 0100

Sign

Exponent

1000 0100

Significand

0101 0101 1000 0000 0000 000

Hex: 422A C000

Example: conversion from IEEE format

What is the IEEE single precision value represented by
BEC0 0000 in decimal?

1. Exponent field: 0111 1101 = 125

2. True binary exponent: 125 − 127 = −2

3. Significand field + hidden bit:
1.1000 0000 0000 0000 0000 000

4. So unsigned value is 1.1 × 2−2 = 0.011 (binary)
= 0.25 + 0.125 = 0.375 (decimal)

5. Adding sign bit gives finally −𝟎. 𝟑𝟕𝟓

Sign

Exponent

0111 1101

Significand

1000 0000 0000 0000 0000 000

Example: addition

Carry out the addition 𝟒𝟐. 𝟔𝟖𝟕𝟓 + 𝟎. 𝟑𝟕𝟓 in IEEE single
precision arithmetic

Number Sign Exponent Significand

42.6875 0 1000 0100 0101 0101 1000 0000 0000 000

0.375 0 0111 1101 1000 0000 0000 0000 0000 000

• To add these numbers, exponents must be the same 

make the smaller exponent equal to the larger by shifting

significand accordingly

• Note: must restore hidden bit when carrying out floating

point operations

Example: addition (cont.)

• Significand of larger no.: 1.0101 0101 1000 0000 0000 000

• Significand of smaller no.: 1.1000 0000 0000 0000 0000 000

• Exponents differ by (1000 0100 − 0111 1101 = 7) so shift binary
point of smaller no. 7 places to the left:

• Significand of smaller no.: 0.0000 0011 0000 0000 0000 000

• Significand of larger no.: 1.0101 0101 1000 0000 0000 000

• Significand of sum: 1.0101 1000 1000 0000 0000 000

• So sum is 1.0101 1000 1 × 25 = 10 1011.0001 = 43.0625

Sign

Exponent

1000 0100

Significand

0101 1000 1000 0000 0000 000

Special values

• IEEE formats can encode five kinds of values: zero,

normalised numbers, denormalised numbers, infinity

and not-a-number (NaNs)

• Single precision representations:

IEEE value Sign

field

Exponent Significand True exponent

±0 0 or 1 0 0 (all zeros)

± denormalised no. 0 or 1 0 Any non-zero bit
pattern

−126

±normalised no. 0 or 1 1…254 Any bit pattern −126…127

±∞ 0 or 1 255 0 (all zeros)

Not-a-number 0 or 1 255 Any non-zero bit
pattern

Denormalised numbers

• An all zero exponent is used to represent both zero and

denormalised numbers

• An all one exponent is used to represent infinities and

not-a-numbers

• Means range for normalised numbers is reduced, for

single precision the exponent range is −126…127 rather
than −127…128

• Denormalised numbers represent values between the

underflow limits and zero, i.e. for single precision we have

± 0. 𝐹 × 2−126

• Allows a more gradual shift to zero – useful in some

numerical applications

Infinities and NaNs

• Infinities represent values exceeding the overflow limits

and for divisions of non-zero quantities by zero

• You can do basic ‘arithmetic’ with them, e.g.:

∞+ 5 = ∞, ∞+∞ = ∞

• NaNs represent the result of operations which have no

(real) mathematical interpretation, e.g.

0
, +∞+−∞, 0 ×∞, square root of a negative number

• Operations resulting in NaNs can either yield a NaN result

(quiet NaN) or an exception (signalling NaN)

Special Operations

Operation Result

N ÷ ± Infinity 0

± Infinity × ± Infinity ± Infinity

± non-zero ÷ 0 ± Infinity

Infinity + Infinity Infinity

± 0 ÷ ± 0 NaN

Infinity – Infinity NaN

± Infinity ÷ ± Infinity NaN

± Infinity × 0 NaN

☺ SOME FUN ☺

Floating Point Precision

• C code:

#include
int main() {

float a, b, c;

float EPSILON = 0.0000001;

a = 1.345f; b = 1.123f;

c = a + b;

if (c == 2.468)
printf (“They are equal.\n”);

else
printf (“\nThey are not equal! The value of c is %.10f or %f\n”,c,c);

// With some tolerance

if (((2.468 – EPSILON) < c) && (c < (2.468 + EPSILON))) printf ("\n%.10f is equal to 2.468 with tolerance\n\n", c); } Run-time Finding Machine Epsilon • Pseudo-code Set machineEps = 1.0; Loop machineEps = machineEps/2.0 Until ((1 + machineEps/2.0) != 1) Print machineEps Finding Machine Epsilon • C code #include

int main( int argc, char **argv )
{
float machEps = 1.0f;

do {
machEps /= 2.0f;
// If next epsilon yields 1, then break, because current

// epsilon is the machine epsilon.

}
while ((float)(1.0 + (machEps/2.0f)) != 1.0);

printf( “\nCalculated Machine epsilon: %G\n\n”, machEps );
return 0;

}

Finding Machine Epsilon

• In Java

public class machEps
{
private static void calculateMachineEpsilonFloat() {

float machEps = 1.0f;

do {
machEps /= 2.0f;

} while ((float)(1.0 + (machEps/2.0)) != 1.0);

System.out.println( “Calculated machine epsilon: ” + machEps );
}

public static void main (String args[])
{

calculateMachineEpsilonFloat ();
}
}

Run-time

Special Operations

• Example
#include

int main (int argc, char **argv)
{
float a = 1.0/0.0;

float b = a * -100;

float c = b/a;

int d = 2 * 10 + 3;

printf (“\nValue of a = %f\n\n”, a);
printf (“\nValue of b = %f\n\n”, b);
printf (“\nValue of c = %f\n\n”, c);
printf (“\nValue of d = %d\n\n”, d);
}

Run-time

Related Posts