程序代写代做代考 matlab Floating-Point Arithmetic

Floating-Point Arithmetic

ECS130 Winter 2017

January 27, 2017

Floating point numbers
I Floating-point representation of numbers (scientific notation) has four

components, for example,

− 3.1416× 101

sign


significand


base

← exponent

I Computers use binary (base 2) representation, each digit is the integer
0 or 1, e.g.

10110 = 1× 24 + 0× 23 + 1× 22 + 1× 21 + 0× 20

= 22in base 10

and

1

10
=

1

16
+

1

32
+

1

256
+

1

512
+

1

4096
+

1

8192
+ . . .

=
1

24
+

1

25
+

1

28
+

1

29
+

1

212
+

1

213
+ . . .

= 1.100110011 . . .× 2−4

Floating point numbers
I The representation of a floating point number:

x = ±b0.b1b2 · · · bp−1 × 2E

where
I It is normalized, i.e., b0 = 1 (the hidden bit)
I Precision (= p) is the number of bits in the significand (mantissa)

(including the hidden bit).
I Machine epsilon �m = 2

−(p−1), the gap between the number 1 and the
smallest floating point number that is greater than 1.

I Special numbers

0, −0, Inf =∞, −Inf = −∞, NaN = “Not a Number”.

IEEE standard
I All computers designed since 1985 use the IEEE Standard for Binary

Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), represent each
number as a binary number and use binary arithmetic.

I Essentials of the IEEE standard:

1. consistent representation of floating-point numbers by all machines
adopting the standard;

2. correctly rounded floating-point operations, using various rounding
modes;

3. consistent treatment of exceptional situation such as division by zero.

Citation of 1989 Turing Award to William Kahan: “… his devotion to
providing systems that can be safely and robustly used for numerical
computations has impacted the lives of virtually anyone who will use a
computer in the future”

IEEE single precision format
I Single format takes 32 bits (=4 bytes) long:

s E f

sign exponent

t
binary point fraction

←− −→8 ←− −→23

I It represents the number (−1)s · (1.f)× 2E−127

I The leading 1 in the fraction need not be stored explicitly since it is
always 1 (hidden bit)

I Precision p = 24 and machine epsilon �m = 2
−23 ≈ 1.2× 10−7

I The “E − 127” in the exponent is to avoid the need for storage of a
sign bit.
Emin = (00000001)2 = (1)10, Emax = (11111110)2 = (254)10.

I The range of positive normalized numbers:

Nmin = 1.00 · · · 0× 2Emin−127 = 2−126 ≈ 1.2× 10−38

Nmax = 1.11 · · · 1× 2Emax−127 ≈ 2128 ≈ 3.4× 1038.

I Special repsentations for 0, ±∞ and NaN.

IEEE double pecision format
I IEEE double format takes 64 bits (= 8 bytes) long:

s E f

sign exponent

t
binary point fraction

←− −→11 ←− −→52

I It represents the numer (−1)s · (1.f)× 2E−1023

I Precision p = 53 and machine epsilon �m = 2
−52 ≈ 2.2× 10−16

I The range of positive normalized numbers is from

Nmin = 2
−1022 ≈ 2.2× 10−308

Nmax = 1.11 · · · 1× 21023 ≈ 21024 ≈ 1.8× 10308.

I Special repsentations for 0, ±∞ and NaN.

Rounding modes
I Let a positive real number x be in the normalized range, i.e.,
Nmin ≤ x ≤ Nmax, and write in the normalized form

x = 1.b1b2 · · · bp−1bpbp+1 . . .× 2E ,

I Then the closest floating-pont number less than or equal to x is

x− = 1.b1b2 · · · bp−1 × 2E

i.e., x− is obtained by truncating.

I The next floating-point number bigger than x− (also the next one
that bigger than x) is

x+ = (1.b1b2 · · · bp−1 + 0.00 · · · 01)× 2E

I If x is negative, the situtation is reversed.

rounding modes, cont’d

Four rounding modes:

1. round down: fl(x) = x−

2. round up: fl(x) = x+

3. round towards zero:
fl(x) = x− of x ≥ 0
fl(x) = x+ of x ≤ 0

4. round to nearest (IEEE default rounding mode):
fl(x) = x− or x+ whichever is nearer to x.

Note: except that if x > Nmax , fl(x) = ∞, and if x < −Nmax , fl(x) = −∞. In the case of tie, i.e., x− and x+ are the same distance from x, the one with its least significant bit equal to zero is chosen. Rounding error I When the round to nearest is in effect, relerr(x) = |fl(x)− x| |x| ≤ 1 2 �m. I Therefore, we have relerr =   1 2 · 21−24 = 2−24 ≈ 5.96 · 10−8, single precision 1 2 · 2−52 ≈ 1.11× 10−16, double precision. Floating-point arithmetic I IEEE rules for correctly rounded floating-point operations: if x and y are correctly rounded floating-point numbers, then fl(x+ y) = (x+ y)(1 + δ) fl(x− y) = (x− y)(1 + δ) fl(x× y) = (x× y)(1 + δ) fl(x/y) = (x/y)(1 + δ) where |δ| ≤ 1 2 �m for the round to nearest, I IEEE standard also requires that correctly rounded remainder and square root operations be provided. Floating-point arithmetic, cont’d IEEE standard response to exceptions Event Example Set result to Invalid operation 0/0, 0×∞ NaN Division by zero Finite nonzero/0 ±∞ Overflow |x| > Nmax ±∞ or ±Nmax
underflow x 6= 0, |x| < Nmin ±0, ±Nmin or subnormal Inexact whenever fl(x ◦ y) 6= x ◦ y correctly rounded value Floating-point arithmetic error I Let x̂ and ŷ be the floating-point numbers and that x̂ = x(1 + τ1) and ŷ = y(1 + τ2), for |τi| ≤ τ � 1 where τi could be the relative errors in the process of “collecting/getting” the data from the original source or the previous operations, and I Question: how do the four basic arithmetic operations behave? Floating-point arithmetic error: +,− Addition and subtraction: fl(x̂+ ŷ) = (x̂+ ŷ)(1 + δ), |δ| ≤ 1 2 �m = x(1 + τ1)(1 + δ) + y(1 + τ2)(1 + δ) = x+ y + x(τ1 + δ +O(τ�m)) + y(τ2 + δ +O(τ�m)) = (x+ y) ( 1 + x x+ y (τ1 + δ +O(τ�m)) + y x+ y (τ2 + δ +O(τ�m)) ) ≡ (x+ y)(1 + δ̂), where δ̂ can be bounded as follows: |δ̂| ≤ |x|+ |y| |x+ y| ( τ + 1 2 �m +O(τ�m) ) . Floating-point arithmetic error: +,− Three possible cases: 1. If x and y have the same sign, i.e., xy > 0, then |x+ y| = |x|+ |y|;
this implies

|δ̂| ≤ τ +
1

2
�m +O(τ�m)� 1.

Thus fl(x̂+ ŷ) approximates x+ y well.

2. If x ≈ −y ⇒ |x+ y| ≈ 0, then (|x|+ |y|)/|x+ y| � 1; this implies
that |δ̂| could be nearly or much bigger than 1.
This is so called catastrophic cancellation, it causes relative errors or
uncertainties already presented in x̂ and ŷ to be magnified.

3. In general, if (|x|+ |y|)/|x+ y| is not too big, fl(x̂+ ŷ) provides a
good approximation to x+ y.

Catastrophic cancellation: example 1
I Computing


x+ 1−


x straightforward causes substantial loss of

significant digits for large n
x fl(


x + 1) fl(


x) fl(fl(


x + 1) − fl(


x)

1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-06

1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-06

1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-07

1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-07

1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-08

1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-08

1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+00

Catastrophic cancellation: example 1
I Catastrophic cancellation can sometimes be avoided if a formula is

properly reformulated.

I For example, one can compute

x+ 1−


x almost to full precision

by using the equality


x+ 1−


x =

1

x+ 1 +


x
.

Consequently, the computed results are

n fl(1/(

n+ 1 +


n))

1.00e+10 4.999999999875000e-06

1.00e+11 1.581138830080237e-06

1.00e+12 4.999999999998749e-07

1.00e+13 1.581138830084150e-07

1.00e+14 4.999999999999987e-08

1.00e+15 1.581138830084189e-08

1.00e+16 5.000000000000000e-09

Catastrophic cancellation: example 2
I Consider the function

h(x) =
1− cosx

x2

Note that 0 ≤ f(x) < 1/2 for all x 6= 0. I Let x = 1.2× 10−8, then the computed fl(h(x)) = 0.770988... is completely wrong! I Alternatively, the function can be re-written as h(x) = ( sin(x/2) x/2 )2 . I Consequently, for x = 1.2× 10−8, then the computed function fl(h(x)) = 0.499999... < 1/2 is fine! Floating-point arithmetic error: ×, / Multiplication and Division: fl(x̂× ŷ) = (x̂× ŷ)(1 + δ) = xy(1 + τ1)(1 + τ2)(1 + δ) ≡ xy(1 + δ̂×), fl(x̂/ŷ) = (x̂/ŷ)(1 + δ) = (x/y)(1 + τ1)(1 + τ2) −1(1 + δ) ≡ xy(1 + δ̂÷), where δ̂× = τ1 + τ2 + δ +O(τ�m) δ̂÷ = τ1 − τ2 + δ +O(τ�m). Thus |δ̂×| ≤ 2τ + 12�m +O(τ�m) and |δ̂÷| ≤ 2τ + 1 2 �m +O(τ�m). Multiplication and division are very well-behaved! Reading I Section 1.7 of Numerical Computing with MATLAB by C. Moler I Websites discussions of numerical disasters: I T. Huckle, Collection of software bugs http://www5.in.tum.de/∼huckle/bugse.html I K. Vuik, Some disasters caused by numerical errors http://ta.twi.tudelft.nl/nw/users/vuik/wi211/disasters.html I D. Arnold, Some disasters attributable to bad numerical computing http://www.ima.umn.edu/∼arnold/disasters/disasters.html I In-depth material: D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Survey, Vol.23(1), pp.5-48, 1991 LU factorization – revisited: the need of pivoting in LU factorization, numerically I LU factorization without pivoting A = [ .0001 1 1 1 ] = [ 10−4 1 1 1 ] = LU = [ 1 l21 1 ] [ u11 u12 u22 ] I In three decimal-digit floating-point arithmetic, we have L̂ = [ 1 0 fl(1/10−4) 1 ] = [ 1 0 104 1 ] , Û = [ 10−4 1 fl(1− 104 · 1) ] = [ 10−4 1 −104 ] , I Check: L̂Û = [ 1 0 104 1 ] [ 10−4 1 −104 ] = [ 10−4 1 1 0 ] 6≈ A, LU factorization – revisited: the need of pivoting in LU factorization, numerically I Consider solving Ax = [ 1 2 ] for x using this LU factorization. I Solving L̂y = [ 1 2 ] =⇒ ŷ1 = fl(1/1) = 1 ŷ2 = fl(2− 104 · 1) = −104. note that the value 2 has been “lost” by subtracting 104 from it. I Solving Ûx = ŷ = [ 1 −104 ] =⇒ x̂2 = fl((−104)/(−104)) = 1 x̂1 = fl((1− 1)/10−4) = 0, I x̂ = [ x̂1 x̂2 ] = [ 0 1 ] is a completely erroneous solution to the correct answer x ≈ [ 1 1 ] . LU factorization – revisited: the need of pivoting in LU factorization, numerically I LU factorization with partial pivoting PA = LU, I By exchanging the order of the rows of A, i.e., P = [ 0 1 1 0 ] , Then for the LU factorization of PA is given by L̂ = [ 1 0 fl(10−4/1) 1 ] = [ 1 0 10−4 1 ] , Û = [ 1 1 fl(1− 10−4 · 1) ] = [ 1 1 1 ] . I The computed LU approximates A very accurately. I As a result, the computed solution x is also perfect!