CS计算机代考程序代写 Computer Arithmetic

Computer Arithmetic
CS/SE 4X03
Ned Nedialkov McMaster University January 14, 2021

What is the output
a(1) = (1/cos(100*pi+pi/4))^2; % (1/ cos(100π + π/4))2 = 2 a(2) = 3*tan(atan(1e7))/1e7; % 3tan(arctan(107))/107 = 3 x = 4;
for i=1:20 x = sqrt(x); end
for i=1:20 x = x*x; end
a(3) = x; % = 4
a(4) = 5*(1+exp(-100)-1)/(1+exp(-100)-1); % 51+e−100−1 = 5 1+e−100 −1
a(5) = log(exp(6e3))/1e3; % ln(e6000)/1000 = 6 for i = 1:5
fprintf(’%d: %.16f\n’, i+1, a(i));
end

Outline
Floating-point number system Rounding
Machine epsilon
IEEE 754
Cancellations

FP system Rounding Machine epsilon IEEE 754 Cancellations Floating-point number system
A floating-point (FP) system is characterized by four integers (β,t,L,U), where
• β base of the system or radix • t number of digits or precision • [L,U] exponent range
A number x in the system is represented as
􏰍 d1 d2 dt−1 􏰎
×β
e
x=± d0+β+β2+···+β
where
• 0≤di ≤β−1,i=0,…,t−1 • e∈[L,U]
t−1
Copyright © 2021 N. Nedialkov. All rights reserved.
4/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Floating-point number system cont.
The string of base β digits d0d1 · · · dt−1 is called mantissa or significand
d1d2 · · · dt−1 is called fraction
A common way of expressing x is
± d0.d1 · · · dt−1 × βe A FP number is normalized if d0 is nonzero
denormalized otherwise
Copyright © 2021 N. Nedialkov. All rights reserved. 5/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Floating-point number system cont.
Example 1. Consider the FP (10, 3, −2, 2). • Numbers are of the form
d0.d1d2×10e, d0̸=0,e∈[−2,2]
• largest positive number 9.99 × 102
• smallest positive normalized number 1.00 × 10−2
• smallest positive denormalized number 0.01 × 10−2
• denormalized numbers are e.g. 0.23 × 10−2, 0.11 × 10−2 • 0 is represented as 0.00 × 100
Copyright © 2021 N. Nedialkov. All rights reserved. 6/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Rounding
How to store a real number
x = ± d0.d1 · · · dt−1dtdt+1 · · · × βe
in t digits?
Denote by fl (x) the FP representation of x
• Rounding by chopping (also called rounding towards zero)
• Rounding to nearest. fl (x) is the nearest FP to x
If a tie, round to the even FP
• Rounding towards +∞. fl (x) is the smallest FP ≥ x
• Rounding towards −∞. fl (x) is the largest FP ≤ x
Copyright © 2021 N. Nedialkov. All rights reserved. 7/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Rounding cont.
Example 2. Consider the FP (10, 3, −2, 2). Let x = 1.2789 × 101
• chopping: fl (x) = 1.27 × 101 • nearest: fl (x) = 1.28 × 101
• +∞: fl(x)=1.28×101
• −∞: fl(x)=1.27×101
Let x = 1.275000 × 101.
• nearest: fl (x) = 1.28 × 101
Copyright © 2021 N. Nedialkov. All rights reserved. 8/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Machine epsilon
Machine epsilon: the distance from 1 to the next larger FP number E.g. in(10,3,−2,2),εmach =1.01−1=0.01=10−2
Another definition: smallest ε > 0 such that fl (1 + ε) > 1 These two definitions are not equivalent
Unit roundoff: u = εmach/2 When rounding to the nearest
i.e.
fl(x)=x(1+ε), where|ε|≤u
fl(x)−x 􏰣􏰣fl(x)−x􏰣􏰣
x =ε, 􏰣 x 􏰣≤u
􏰣􏰣
Copyright © 2021 N. Nedialkov. All rights reserved. 9/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754
• IEEE 754 standard for FP arithmetic (1985)
• IEEE 754-2008, IEEE 754-2019
• Most common (binary) single and double precision since 2008 half precision
bits t L U εmach
single double
32 24 64 53
−126 127 −1022 1023
1.2 × 10−7 2.2 × 10−16
smallest
single double
normalized
±1.2 × 10−38
denormalized ±1.4 × 10−45 ±4.9 × 10−324
range ±3.4 × 1038
±1.8 × 10308
Copyright © 2021 N. Nedialkov. All rights reserved.
±2.2 × 10−308 (These are ≈ values)
10/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754 cont.
Exceptional values
• Inf, -Inf when the result overflows, e.g. 1/0.0
• NaN ”Not a Number” results from undefined operations e.g. 0/0, 0*Inf, Inf/Inf
NaNs typically propagate through computations
Copyright © 2021 N. Nedialkov. All rights reserved. 11/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754 cont.
(From https:
//www.geeksforgeeks.org/ieee- standard- 754- floating- point- numbers/)
• sign 0 positive, 1 negative
• exponent is biased
• first bit of mantissa is not stored, sticky bit, assumed 1
Copyright © 2021 N. Nedialkov. All rights reserved.
12/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754 cont.
Single precision
• Inf
◦ sign: 0 for +Inf, 1 for -Inf
◦ biased exponent: all 1’s, 255
◦ fraction: all 0’s
• NaN
◦ sign: 0 or 1
◦ biased exponent: all 1’s, 255
◦ fraction: at least one 1
•0
◦ sign: 0for+0,1for−0
◦ biased exponent: all 0’s
◦ mantissa: all 0’s
• FP numbers
◦ biased exponent: from 1 to 254, bias: 127
◦ actual exponent: 1 − 127 = −126 to 254 − 127 = 127
Copyright © 2021 N. Nedialkov. All rights reserved.
13/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754 cont.
Double precision
• bias 1023
• biased exponent: from 1 to 2046
• actual exponent: from −1022 to 1023 • rest similar to single
Copyright © 2021 N. Nedialkov. All rights reserved.
14/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754 cont.
FP arithmetic
For a real x and rounding to nearest fl(x)=x(1+ε), |ε|≤u
u is the unit roundoff of the precision
The arithmetic operations are correctly rounded, i.e. for x and y IEEE numbers (e.g. in double or single precision) and
rounding to the nearest
fl(x◦y)=(x◦y)(1+ε), ◦∈{+,−,∗,/}, |ε|≤u
Copyright © 2021 N. Nedialkov. All rights reserved. 15/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754 cont.
Example 3. Assume x, y, z machine numbers. Find the error in fl(z(x+y))
fl (z(x + y)) = fl (z) fl (x + y) (1 + δ1)
= z(fl (x) + fl (y))(1 + δ1)(1 + δ2)
= z(x + y)(1 + δ1)(1 + δ2) =z(x+y)(1+δ1 +δ2 +δ1δ2) =z(x+y)(1+δ1 +δ2),
where |δ1,2| ≤ u. |δ1δ2| ≪ u is very small, so we neglect it Denoting δ = δ1 + δ2
fl(z(x+y))=z(x+y)(1+δ), where|δ|≤2u Copyright © 2021 N. Nedialkov. All rights reserved.
16/22

FP system Rounding Machine epsilon IEEE 754 Cancellations IEEE 754 cont.
Example 4. Assume x, y real. What is the error in fl (xy)? .
fl (xy) = (fl (x) fl (y))(1 + δ1)
= x(1 + δ2)y(1 + δ3)(1 + δ1) ≈xy(1+δ1 +δ2 +δ3)
where |δ1,2,3| ≤ u Denoting δ = δ1 + δ2 + δ3,
fl(xy)=xy(1+δ), where|δ|≤3u
Copyright © 2021 N. Nedialkov. All rights reserved.
17/22

FP system Rounding Machine epsilon IEEE 754 Cancellations
Example 5 (Computing 􏰓x2 + y2).
• One can do sqrt(x*x+y*y)
• Assume double precision and suppose x = 1e200 and y= 1e100
• x*x will overflow and the result is Inf
• sqrt(Inf+1e200) gives Inf
• Let M = max{|x|, |y|} and assume M = |x|. Then
􏰓x2 + y2 = M􏰓1 + (y/M)2
• Setting M=1e200, y1=y/M, compute M*sqrt(1+y1*y1),
which gives 1e200
Copyright © 2021 N. Nedialkov. All rights reserved.
18/22

FP system Rounding
IEEE 754 cont.
Machine epsilon
IEEE 754
Cancellations
Note
expression
y1*y1
1+y1*y1 sqrt(1+y1*y1) 1
produces
1e-100 1
Copyright © 2021 N. Nedialkov. All rights reserved.
19/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Cancellations
Example 6.
• Assume a decimal FP system with t = 5 digits.
• Let x = 1.2347 and y = 1.2316, and assume the underlined digits are incorrect, e.g. resulting from a previous computation.
• x−y=0.0031
• If we normalize 0.0031, we have 3.1000 × 10−3
• No accurate digits in the result: catastrophic cancellation
Copyright © 2021 N. Nedialkov. All rights reserved.
20/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Cancellations cont.
Example 7.
• Consider FP arithmetic with 8 decimal digits • Letx=1,h=10−4
• Compute sin(x + h) − sin(x)
sin(x + h) = 8.4152501 × 10−1 sin(x) = 8.4147098 × 10−1 sin(x + h) − sin(x) = 0.0005403 × 10−1 normalize = 5.4030000 × 10−5 accurate = 5.4026023 × 10−5
• We lose 3 digits: cancellations. Relative error ≈ 7.3 × 10−5 • Note there are no roundoff errors in the subtraction
Copyright © 2021 N. Nedialkov. All rights reserved. 21/22

FP system Rounding Machine epsilon IEEE 754 Cancellations Cancellations cont.
Example 8.
• Consider FP arithmetic with 8 decimal digits • Letx=1,h=10−1
• Compute sin(x + h) − sin(x)
sin(x + h) = 8.9120736 × 10−1 sin(x) = 8.4147098 × 10−1 sin(x + h) − sin(x) = 0.4973638 × 10−1 normalize = 4.9736380 × 10−2 accurate = 4.9736375 × 10−2
• We lose 1 digit. Relative error ≈ 9.5 × 10−8 ≈ 10−7 • Note there are no roundoff errors in the subtraction
Copyright © 2021 N. Nedialkov. All rights reserved.
22/22