程序代写代做代考 matlab Floating-Point Arithmetic

Floating-Point Arithmetic

ECS130 Winter 2017

January 27, 2017

Floating point numbers
I Floating-point representation of numbers (scientific notation) has four

components, for example,

− 3.1416× 101
↑

sign

↑
significand

↑
base

← exponent

I Computers use binary (base 2) representation, each digit is the integer
0 or 1, e.g.

10110 = 1× 24 + 0× 23 + 1× 22 + 1× 21 + 0× 20

= 22in base 10

and

10
=

16
+

32
+

256
+

512
+

4096
+

8192
+ . . .

=
1

24
+

25
+

28
+

29
+

212
+

213
+ . . .

= 1.100110011 . . .× 2−4

Floating point numbers
I The representation of a floating point number:

x = ±b0.b1b2 · · · bp−1 × 2E

where
I It is normalized, i.e., b0 = 1 (the hidden bit)
I Precision (= p) is the number of bits in the significand (mantissa)

(including the hidden bit).
I Machine epsilon �m = 2

−(p−1), the gap between the number 1 and the
smallest floating point number that is greater than 1.

I Special numbers

0, −0, Inf =∞, −Inf = −∞, NaN = “Not a Number”.

IEEE standard
I All computers designed since 1985 use the IEEE Standard for Binary

Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), represent each
number as a binary number and use binary arithmetic.

I Essentials of the IEEE standard:

1. consistent representation of floating-point numbers by all machines
adopting the standard;

2. correctly rounded floating-point operations, using various rounding
modes;

3. consistent treatment of exceptional situation such as division by zero.

Citation of 1989 Turing Award to William Kahan: “… his devotion to
providing systems that can be safely and robustly used for numerical
computations has impacted the lives of virtually anyone who will use a
computer in the future”

IEEE single precision format
I Single format takes 32 bits (=4 bytes) long:

s E f

sign exponent

t
binary point fraction

←− −→8 ←− −→23

I It represents the number (−1)s · (1.f)× 2E−127

I The leading 1 in the fraction need not be stored explicitly since it is
always 1 (hidden bit)

I Precision p = 24 and machine epsilon �m = 2
−23 ≈ 1.2× 10−7

I The “E − 127” in the exponent is to avoid the need for storage of a
sign bit.
Emin = (00000001)2 = (1)10, Emax = (11111110)2 = (254)10.

I The range of positive normalized numbers:

Nmin = 1.00 · · · 0× 2Emin−127 = 2−126 ≈ 1.2× 10−38

Nmax = 1.11 · · · 1× 2Emax−127 ≈ 2128 ≈ 3.4× 1038.

I Special repsentations for 0, ±∞ and NaN.

IEEE double pecision format
I IEEE double format takes 64 bits (= 8 bytes) long:

s E f

sign exponent

t
binary point fraction

←− −→11 ←− −→52

I It represents the numer (−1)s · (1.f)× 2E−1023

I Precision p = 53 and machine epsilon �m = 2
−52 ≈ 2.2× 10−16

I The range of positive normalized numbers is from

Nmin = 2
−1022 ≈ 2.2× 10−308

Nmax = 1.11 · · · 1× 21023 ≈ 21024 ≈ 1.8× 10308.

I Special repsentations for 0, ±∞ and NaN.

Rounding modes
I Let a positive real number x be in the normalized range, i.e.,
Nmin ≤ x ≤ Nmax, and write in the normalized form

x = 1.b1b2 · · · bp−1bpbp+1 . . .× 2E ,

I Then the closest floating-pont number less than or equal to x is

x− = 1.b1b2 · · · bp−1 × 2E

i.e., x− is obtained by truncating.

I The next floating-point number bigger than x− (also the next one
that bigger than x) is

x+ = (1.b1b2 · · · bp−1 + 0.00 · · · 01)× 2E

I If x is negative, the situtation is reversed.

rounding modes, cont’d

Four rounding modes:

1. round down: fl(x) = x−

2. round up: fl(x) = x+

3. round towards zero:
fl(x) = x− of x ≥ 0
fl(x) = x+ of x ≤ 0

4. round to nearest (IEEE default rounding mode):
fl(x) = x− or x+ whichever is nearer to x.

Note: except that if x > Nmax , fl(x) = ∞, and if x < −Nmax , fl(x) = −∞. In the case of tie, i.e., x− and x+ are the same distance from x, the one with its least significant bit equal to zero is chosen. Rounding error I When the round to nearest is in effect, relerr(x) = |fl(x)− x| |x| ≤ 1 2 �m. I Therefore, we have relerr =   1 2 · 21−24 = 2−24 ≈ 5.96 · 10−8, single precision 1 2 · 2−52 ≈ 1.11× 10−16, double precision. Floating-point arithmetic I IEEE rules for correctly rounded floating-point operations: if x and y are correctly rounded floating-point numbers, then fl(x+ y) = (x+ y)(1 + δ) fl(x− y) = (x− y)(1 + δ) fl(x× y) = (x× y)(1 + δ) fl(x/y) = (x/y)(1 + δ) where |δ| ≤ 1 2 �m for the round to nearest, I IEEE standard also requires that correctly rounded remainder and square root operations be provided. Floating-point arithmetic, cont’d IEEE standard response to exceptions Event Example Set result to Invalid operation 0/0, 0×∞ NaN Division by zero Finite nonzero/0 ±∞ Overflow |x| > Nmax ±∞ or ±Nmax
underflow x 6= 0, |x| < Nmin ±0, ±Nmin or subnormal Inexact whenever fl(x ◦ y) 6= x ◦ y correctly rounded value Floating-point arithmetic error I Let x̂ and ŷ be the floating-point numbers and that x̂ = x(1 + τ1) and ŷ = y(1 + τ2), for |τi| ≤ τ � 1 where τi could be the relative errors in the process of “collecting/getting” the data from the original source or the previous operations, and I Question: how do the four basic arithmetic operations behave? Floating-point arithmetic error: +,− Addition and subtraction: fl(x̂+ ŷ) = (x̂+ ŷ)(1 + δ), |δ| ≤ 1 2 �m = x(1 + τ1)(1 + δ) + y(1 + τ2)(1 + δ) = x+ y + x(τ1 + δ +O(τ�m)) + y(τ2 + δ +O(τ�m)) = (x+ y) ( 1 + x x+ y (τ1 + δ +O(τ�m)) + y x+ y (τ2 + δ +O(τ�m)) ) ≡ (x+ y)(1 + δ̂), where δ̂ can be bounded as follows: |δ̂| ≤ |x|+ |y| |x+ y| ( τ + 1 2 �m +O(τ�m) ) . Floating-point arithmetic error: +,− Three possible cases: 1. If x and y have the same sign, i.e., xy > 0, then |x+ y| = |x|+ |y|;
this implies

|δ̂| ≤ τ +
1

2
�m +O(τ�m)� 1.

Thus fl(x̂+ ŷ) approximates x+ y well.

2. If x ≈ −y ⇒ |x+ y| ≈ 0, then (|x|+ |y|)/|x+ y| � 1; this implies
that |δ̂| could be nearly or much bigger than 1.
This is so called catastrophic cancellation, it causes relative errors or
uncertainties already presented in x̂ and ŷ to be magnified.

3. In general, if (|x|+ |y|)/|x+ y| is not too big, fl(x̂+ ŷ) provides a
good approximation to x+ y.

Catastrophic cancellation: example 1
I Computing

√
x+ 1−

√
x straightforward causes substantial loss of

significant digits for large n
x fl(

√
x + 1) fl(

√
x) fl(fl(

√
x + 1) − fl(

√
x)

1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-06

1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-06

1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-07

1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-07

1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-08

1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-08

1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+00

Catastrophic cancellation: example 1
I Catastrophic cancellation can sometimes be avoided if a formula is

properly reformulated.

I For example, one can compute
√
x+ 1−

√
x almost to full precision

by using the equality

√
x+ 1−

√
x =

1
√
x+ 1 +

√
x
.

Consequently, the computed results are

n fl(1/(
√
n+ 1 +

√
n))

1.00e+10 4.999999999875000e-06

1.00e+11 1.581138830080237e-06

1.00e+12 4.999999999998749e-07

1.00e+13 1.581138830084150e-07

1.00e+14 4.999999999999987e-08

1.00e+15 1.581138830084189e-08

1.00e+16 5.000000000000000e-09

Catastrophic cancellation: example 2
I Consider the function

h(x) =
1− cosx

Note that 0 ≤ f(x) < 1/2 for all x 6= 0. I Let x = 1.2× 10−8, then the computed fl(h(x)) = 0.770988... is completely wrong! I Alternatively, the function can be re-written as h(x) = ( sin(x/2) x/2 )2 . I Consequently, for x = 1.2× 10−8, then the computed function fl(h(x)) = 0.499999... < 1/2 is fine! Floating-point arithmetic error: ×, / Multiplication and Division: fl(x̂× ŷ) = (x̂× ŷ)(1 + δ) = xy(1 + τ1)(1 + τ2)(1 + δ) ≡ xy(1 + δ̂×), fl(x̂/ŷ) = (x̂/ŷ)(1 + δ) = (x/y)(1 + τ1)(1 + τ2) −1(1 + δ) ≡ xy(1 + δ̂÷), where δ̂× = τ1 + τ2 + δ +O(τ�m) δ̂÷ = τ1 − τ2 + δ +O(τ�m). Thus |δ̂×| ≤ 2τ + 12�m +O(τ�m) and |δ̂÷| ≤ 2τ + 1 2 �m +O(τ�m). Multiplication and division are very well-behaved! Reading I Section 1.7 of Numerical Computing with MATLAB by C. Moler I Websites discussions of numerical disasters: I T. Huckle, Collection of software bugs http://www5.in.tum.de/∼huckle/bugse.html I K. Vuik, Some disasters caused by numerical errors http://ta.twi.tudelft.nl/nw/users/vuik/wi211/disasters.html I D. Arnold, Some disasters attributable to bad numerical computing http://www.ima.umn.edu/∼arnold/disasters/disasters.html I In-depth material: D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Survey, Vol.23(1), pp.5-48, 1991 LU factorization – revisited: the need of pivoting in LU factorization, numerically I LU factorization without pivoting A = [ .0001 1 1 1 ] = [ 10−4 1 1 1 ] = LU = [ 1 l21 1 ] [ u11 u12 u22 ] I In three decimal-digit floating-point arithmetic, we have L̂ = [ 1 0 fl(1/10−4) 1 ] = [ 1 0 104 1 ] , Û = [ 10−4 1 fl(1− 104 · 1) ] = [ 10−4 1 −104 ] , I Check: L̂Û = [ 1 0 104 1 ] [ 10−4 1 −104 ] = [ 10−4 1 1 0 ] 6≈ A, LU factorization – revisited: the need of pivoting in LU factorization, numerically I Consider solving Ax = [ 1 2 ] for x using this LU factorization. I Solving L̂y = [ 1 2 ] =⇒ ŷ1 = fl(1/1) = 1 ŷ2 = fl(2− 104 · 1) = −104. note that the value 2 has been “lost” by subtracting 104 from it. I Solving Ûx = ŷ = [ 1 −104 ] =⇒ x̂2 = fl((−104)/(−104)) = 1 x̂1 = fl((1− 1)/10−4) = 0, I x̂ = [ x̂1 x̂2 ] = [ 0 1 ] is a completely erroneous solution to the correct answer x ≈ [ 1 1 ] . LU factorization – revisited: the need of pivoting in LU factorization, numerically I LU factorization with partial pivoting PA = LU, I By exchanging the order of the rows of A, i.e., P = [ 0 1 1 0 ] , Then for the LU factorization of PA is given by L̂ = [ 1 0 fl(10−4/1) 1 ] = [ 1 0 10−4 1 ] , Û = [ 1 1 fl(1− 10−4 · 1) ] = [ 1 1 1 ] . I The computed LU approximates A very accurately. I As a result, the computed solution x is also perfect!

Related Posts