Floating-Point Arithmetic
ECS130 Winter 2017
January 27, 2017
Floating point numbers
I Floating-point representation of numbers (scientific notation) has four
components, for example,
− 3.1416× 101
↑
sign
↑
significand
↑
base
← exponent
I Computers use binary (base 2) representation, each digit is the integer
0 or 1, e.g.
10110 = 1× 24 + 0× 23 + 1× 22 + 1× 21 + 0× 20
= 22in base 10
and
1
10
=
1
16
+
1
32
+
1
256
+
1
512
+
1
4096
+
1
8192
+ . . .
=
1
24
+
1
25
+
1
28
+
1
29
+
1
212
+
1
213
+ . . .
= 1.100110011 . . .× 2−4
Floating point numbers
I The representation of a floating point number:
x = ±b0.b1b2 · · · bp−1 × 2E
where
I It is normalized, i.e., b0 = 1 (the hidden bit)
I Precision (= p) is the number of bits in the significand (mantissa)
(including the hidden bit).
I Machine epsilon �m = 2
−(p−1), the gap between the number 1 and the
smallest floating point number that is greater than 1.
I Special numbers
0, −0, Inf =∞, −Inf = −∞, NaN = “Not a Number”.
IEEE standard
I All computers designed since 1985 use the IEEE Standard for Binary
Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), represent each
number as a binary number and use binary arithmetic.
I Essentials of the IEEE standard:
1. consistent representation of floating-point numbers by all machines
adopting the standard;
2. correctly rounded floating-point operations, using various rounding
modes;
3. consistent treatment of exceptional situation such as division by zero.
Citation of 1989 Turing Award to William Kahan: “… his devotion to
providing systems that can be safely and robustly used for numerical
computations has impacted the lives of virtually anyone who will use a
computer in the future”
IEEE single precision format
I Single format takes 32 bits (=4 bytes) long:
s E f
sign exponent
t
binary point fraction
←− −→8 ←− −→23
I It represents the number (−1)s · (1.f)× 2E−127
I The leading 1 in the fraction need not be stored explicitly since it is
always 1 (hidden bit)
I Precision p = 24 and machine epsilon �m = 2
−23 ≈ 1.2× 10−7
I The “E − 127” in the exponent is to avoid the need for storage of a
sign bit.
Emin = (00000001)2 = (1)10, Emax = (11111110)2 = (254)10.
I The range of positive normalized numbers:
Nmin = 1.00 · · · 0× 2Emin−127 = 2−126 ≈ 1.2× 10−38
Nmax = 1.11 · · · 1× 2Emax−127 ≈ 2128 ≈ 3.4× 1038.
I Special repsentations for 0, ±∞ and NaN.
IEEE double pecision format
I IEEE double format takes 64 bits (= 8 bytes) long:
s E f
sign exponent
t
binary point fraction
←− −→11 ←− −→52
I It represents the numer (−1)s · (1.f)× 2E−1023
I Precision p = 53 and machine epsilon �m = 2
−52 ≈ 2.2× 10−16
I The range of positive normalized numbers is from
Nmin = 2
−1022 ≈ 2.2× 10−308
Nmax = 1.11 · · · 1× 21023 ≈ 21024 ≈ 1.8× 10308.
I Special repsentations for 0, ±∞ and NaN.
Rounding modes
I Let a positive real number x be in the normalized range, i.e.,
Nmin ≤ x ≤ Nmax, and write in the normalized form
x = 1.b1b2 · · · bp−1bpbp+1 . . .× 2E ,
I Then the closest floating-pont number less than or equal to x is
x− = 1.b1b2 · · · bp−1 × 2E
i.e., x− is obtained by truncating.
I The next floating-point number bigger than x− (also the next one
that bigger than x) is
x+ = (1.b1b2 · · · bp−1 + 0.00 · · · 01)× 2E
I If x is negative, the situtation is reversed.
rounding modes, cont’d
Four rounding modes:
1. round down: fl(x) = x−
2. round up: fl(x) = x+
3. round towards zero:
fl(x) = x− of x ≥ 0
fl(x) = x+ of x ≤ 0
4. round to nearest (IEEE default rounding mode):
fl(x) = x− or x+ whichever is nearer to x.
Note: except that if x > Nmax , fl(x) = ∞, and if x < −Nmax , fl(x) = −∞. In the case of tie, i.e., x− and x+
are the same distance from x, the one with its least significant bit equal to zero is chosen.
Rounding error
I When the round to nearest is in effect,
relerr(x) =
|fl(x)− x|
|x|
≤
1
2
�m.
I Therefore, we have
relerr =
1
2
· 21−24 = 2−24 ≈ 5.96 · 10−8, single precision
1
2
· 2−52 ≈ 1.11× 10−16, double precision.
Floating-point arithmetic
I IEEE rules for correctly rounded floating-point operations:
if x and y are correctly rounded floating-point numbers, then
fl(x+ y) = (x+ y)(1 + δ)
fl(x− y) = (x− y)(1 + δ)
fl(x× y) = (x× y)(1 + δ)
fl(x/y) = (x/y)(1 + δ)
where
|δ| ≤
1
2
�m
for the round to nearest,
I IEEE standard also requires that correctly rounded remainder and
square root operations be provided.
Floating-point arithmetic, cont’d
IEEE standard response to exceptions
Event Example Set result to
Invalid operation 0/0, 0×∞ NaN
Division by zero Finite nonzero/0 ±∞
Overflow |x| > Nmax ±∞ or ±Nmax
underflow x 6= 0, |x| < Nmin ±0, ±Nmin or subnormal
Inexact whenever fl(x ◦ y) 6= x ◦ y correctly rounded value
Floating-point arithmetic error
I Let x̂ and ŷ be the floating-point numbers and that
x̂ = x(1 + τ1) and ŷ = y(1 + τ2), for |τi| ≤ τ � 1
where τi could be the relative errors in the process of
“collecting/getting” the data from the original source or the previous
operations, and
I Question: how do the four basic arithmetic operations behave?
Floating-point arithmetic error: +,−
Addition and subtraction:
fl(x̂+ ŷ) = (x̂+ ŷ)(1 + δ), |δ| ≤
1
2
�m
= x(1 + τ1)(1 + δ) + y(1 + τ2)(1 + δ)
= x+ y + x(τ1 + δ +O(τ�m)) + y(τ2 + δ +O(τ�m))
= (x+ y)
(
1 +
x
x+ y
(τ1 + δ +O(τ�m)) +
y
x+ y
(τ2 + δ +O(τ�m))
)
≡ (x+ y)(1 + δ̂),
where δ̂ can be bounded as follows:
|δ̂| ≤
|x|+ |y|
|x+ y|
(
τ +
1
2
�m +O(τ�m)
)
.
Floating-point arithmetic error: +,−
Three possible cases:
1. If x and y have the same sign, i.e., xy > 0, then |x+ y| = |x|+ |y|;
this implies
|δ̂| ≤ τ +
1
2
�m +O(τ�m)� 1.
Thus fl(x̂+ ŷ) approximates x+ y well.
2. If x ≈ −y ⇒ |x+ y| ≈ 0, then (|x|+ |y|)/|x+ y| � 1; this implies
that |δ̂| could be nearly or much bigger than 1.
This is so called catastrophic cancellation, it causes relative errors or
uncertainties already presented in x̂ and ŷ to be magnified.
3. In general, if (|x|+ |y|)/|x+ y| is not too big, fl(x̂+ ŷ) provides a
good approximation to x+ y.
Catastrophic cancellation: example 1
I Computing
√
x+ 1−
√
x straightforward causes substantial loss of
significant digits for large n
x fl(
√
x + 1) fl(
√
x) fl(fl(
√
x + 1) − fl(
√
x)
1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-06
1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-06
1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-07
1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-07
1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-08
1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-08
1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+00
Catastrophic cancellation: example 1
I Catastrophic cancellation can sometimes be avoided if a formula is
properly reformulated.
I For example, one can compute
√
x+ 1−
√
x almost to full precision
by using the equality
√
x+ 1−
√
x =
1
√
x+ 1 +
√
x
.
Consequently, the computed results are
n fl(1/(
√
n+ 1 +
√
n))
1.00e+10 4.999999999875000e-06
1.00e+11 1.581138830080237e-06
1.00e+12 4.999999999998749e-07
1.00e+13 1.581138830084150e-07
1.00e+14 4.999999999999987e-08
1.00e+15 1.581138830084189e-08
1.00e+16 5.000000000000000e-09
Catastrophic cancellation: example 2
I Consider the function
h(x) =
1− cosx
x2
Note that 0 ≤ f(x) < 1/2 for all x 6= 0. I Let x = 1.2× 10−8, then the computed fl(h(x)) = 0.770988... is completely wrong! I Alternatively, the function can be re-written as h(x) = ( sin(x/2) x/2 )2 . I Consequently, for x = 1.2× 10−8, then the computed function fl(h(x)) = 0.499999... < 1/2 is fine! Floating-point arithmetic error: ×, / Multiplication and Division: fl(x̂× ŷ) = (x̂× ŷ)(1 + δ) = xy(1 + τ1)(1 + τ2)(1 + δ) ≡ xy(1 + δ̂×), fl(x̂/ŷ) = (x̂/ŷ)(1 + δ) = (x/y)(1 + τ1)(1 + τ2) −1(1 + δ) ≡ xy(1 + δ̂÷), where δ̂× = τ1 + τ2 + δ +O(τ�m) δ̂÷ = τ1 − τ2 + δ +O(τ�m). Thus |δ̂×| ≤ 2τ + 12�m +O(τ�m) and |δ̂÷| ≤ 2τ + 1 2 �m +O(τ�m). Multiplication and division are very well-behaved! Reading I Section 1.7 of Numerical Computing with MATLAB by C. Moler I Websites discussions of numerical disasters: I T. Huckle, Collection of software bugs http://www5.in.tum.de/∼huckle/bugse.html I K. Vuik, Some disasters caused by numerical errors http://ta.twi.tudelft.nl/nw/users/vuik/wi211/disasters.html I D. Arnold, Some disasters attributable to bad numerical computing http://www.ima.umn.edu/∼arnold/disasters/disasters.html I In-depth material: D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Survey, Vol.23(1), pp.5-48, 1991 LU factorization – revisited: the need of pivoting in LU factorization, numerically I LU factorization without pivoting A = [ .0001 1 1 1 ] = [ 10−4 1 1 1 ] = LU = [ 1 l21 1 ] [ u11 u12 u22 ] I In three decimal-digit floating-point arithmetic, we have L̂ = [ 1 0 fl(1/10−4) 1 ] = [ 1 0 104 1 ] , Û = [ 10−4 1 fl(1− 104 · 1) ] = [ 10−4 1 −104 ] , I Check: L̂Û = [ 1 0 104 1 ] [ 10−4 1 −104 ] = [ 10−4 1 1 0 ] 6≈ A, LU factorization – revisited: the need of pivoting in LU factorization, numerically I Consider solving Ax = [ 1 2 ] for x using this LU factorization. I Solving L̂y = [ 1 2 ] =⇒ ŷ1 = fl(1/1) = 1 ŷ2 = fl(2− 104 · 1) = −104. note that the value 2 has been “lost” by subtracting 104 from it. I Solving Ûx = ŷ = [ 1 −104 ] =⇒ x̂2 = fl((−104)/(−104)) = 1 x̂1 = fl((1− 1)/10−4) = 0, I x̂ = [ x̂1 x̂2 ] = [ 0 1 ] is a completely erroneous solution to the correct answer x ≈ [ 1 1 ] . LU factorization – revisited: the need of pivoting in LU factorization, numerically I LU factorization with partial pivoting PA = LU, I By exchanging the order of the rows of A, i.e., P = [ 0 1 1 0 ] , Then for the LU factorization of PA is given by L̂ = [ 1 0 fl(10−4/1) 1 ] = [ 1 0 10−4 1 ] , Û = [ 1 1 fl(1− 10−4 · 1) ] = [ 1 1 1 ] . I The computed LU approximates A very accurately. I As a result, the computed solution x is also perfect!