程序代写

I.1 Numbers¶
Reference: this chapter, we introduce the Two’s-complement
storage for integers and the

Copyright By PowCoder代写 加微信 powcoder

IEEE Standard for Floating-Point Arithmetic.
There are many possible ways of representing real numbers on a computer, as well as
the precise behaviour of operations such as addition, multiplication, etc.
Before the 1980s each processor had potentially a different representation for
real numbers, as well as different behaviour for operations.

IEEE introduced in 1985 was a means to standardise this across
processors so that algorithms would produce consistent and reliable results.

This chapter may seem very low level for a mathematics course but there are
two important reasons to understand the behaviour of integers and floating-point numbers:

Integer arithmetic can suddenly start giving wrong negative answers when numbers
become large.
Floating-point arithmetic is very precisely defined, and can even be used
in rigorous computations as we shall see in the problem sheets. But it is not exact
and its important to understand how errors in computations can accumulate.
Failure to understand floating-point arithmetic can cause catastrophic issues
in practice, with the extreme example being the
explosion of the Ariane 5 rocket.

In this chapter we discuss the following:

Binary representation: Any real number can be represented in binary, that is,
by an infinite sequence of 0s and 1s (bits). We review binary representation.
Integers: There are multiple ways of representing integers on a computer. We discuss the
the different types of integers and their representation as bits, and how arithmetic operations behave
like modular arithmetic. As an advanced topic we discuss BigInt, which uses variable bit length storage.
Floating-point numbers: Real numbers are stored on a computer with a finite number of bits.
There are three types of floating-point numbers: normal numbers, subnormal numbers, and special numbers.
Arithmetic: Arithmetic operations in floating-point are exact up to rounding, and how the
rounding mode can be set. This allows us to bound errors computations.
High-precision floating-point numbers: As an advanced topic, we discuss how the precision of floating-point arithmetic can be increased arbitrary
using BigFloat.

Before we begin, we load two external packages. SetRounding.jl allows us
to set the rounding mode of floating-point arithmetic. ColorBitstring.jl
implements functions printbits (and printlnbits)
which print the bits (and with a newline) of floating-point numbers in colour.

using SetRounding, ColorBitstring

1. Binary representation¶
Any integer can be presented in binary format, that is, a sequence of 0s and 1s.

Definition
For $B_0,\ldots,B_p \in \{0,1\}$ denote a non-negative integer in binary format by:
(B_p\ldots B_1B_0)_2 := 2^pB_p + \cdots + 2B_1 + B_0
For $b_1,b_2,\ldots \in \{0,1\}$, Denote a non-negative real number in binary format by:
(B_p \ldots B_0.b_1b_2b_3\ldots)_2 = (B_p \ldots B_0)_2 + {b_1 \over 2} + {b_2 \over 2^2} + {b_3 \over 2^3} + \cdots

First we show some examples of verifying a numbers binary representation:

Example (integer in binary)
A simple integer example is $5 = 2^2 + 2^0 = (101)_2$.

Example (rational in binary)
Consider the number 1/3. In decimal recall that:
1/3 = 0.3333\ldots = \sum_{k=1}^\infty {3 \over 10^k}
We will see that in binary
1/3 = (0.010101\ldots)_2 = \sum_{k=1}^\infty {1 \over 2^{2k}}
Both results can be proven using the geometric series:
\sum_{k=0}^\infty z^k = {1 \over 1 – z}
provided $|z| < 1$. That is, with $z = {1 \over 4}$ we verify the binary expansion: \sum_{k=1}^\infty {1 \over 4^k} = {1 \over 1 - 1/4} - 1 = {1 \over 3} A similar argument with $z = 1/10$ shows the decimal case. 2. Integers¶ On a computer one typically represents integers by a finite number of $p$ bits, with $2^p$ possible combinations of 0s and 1s. For unsigned integers (non-negative integers) these bits are just the first $p$ binary digits: $(B_{p-1}\ldots B_1B_0)_2$. Integers on a computer follow modular arithmetic: Definition (ring of integers modulo $m$) Denote the ring {\mathbb Z}_{m} := \{0 \ ({\rm mod}\ m), 1 \ ({\rm mod}\ m), \ldots, m-1 \ ({\rm mod}\ m) \} Integers represented with $p$-bits on a computer actually represent elements of ${\mathbb Z}_{2^p}$ and integer arithmetic on a computer is equivalent to arithmetic modulo $2^p$. Example (addition of 8-bit unsigned integers) Consider the addition of two 8-bit numbers: 255 + 1 = (11111111)_2 + (00000001)_2 = (100000000)_2 = 256 The result is impossible to store in just 8-bits! It is way too slow for a computer to increase the number of bits, or to throw an error (checks are slow). So instead it treats the integers as elements of ${\mathbb Z}_{256}$: 255 + 1 \ ({\rm mod}\ 256) = (00000000)_2 \ ({\rm mod}\ 256) = 0 \ ({\rm mod}\ 256) We can see this in Julia: x = UInt8(255) y = UInt8(1) printbits(x); println(" + "); printbits(y); println(" = ") printbits(x + y) 11111111 + 00000001 = Example (multiplication of 8-bit unsigned integers) Multiplication works similarly: for example, 254 * 2 \ ({\rm mod}\ 256) = 252 \ ({\rm mod}\ 256) = (11111100)_2 \ ({\rm mod}\ 256) We can see this behaviour in code by printing the bits: x = UInt8(254) # 254 represented in 8-bits as an unsigned integer y = UInt8(2) # 2 represented in 8-bits as an unsigned integer printbits(x); println(" * "); printbits(y); println(" = ") printbits(x * y) 11111110 * 00000010 = Signed integer¶ Signed integers use the Two's complemement convention. The convention is if the first bit is 1 then the number is negative: the number $2^p - y$ is interpreted as $-y$. Thus for $p = 8$ we are interpreting $2^7$ through $2^8-1$ as negative numbers. Example (converting bits to signed integers) What 8-bit integer has the bits 01001001? Adding the corresponding decimal places we get: 2^0 + 2^3 + 2^6 What 8-bit (signed) integer has the bits 11001001? Because the first bit is 1 we know it's a negative number, hence we need to sum the bits but then subtract 2^p: 2^0 + 2^3 + 2^6 + 2^7 - 2^8 We can check the results using printbits: printlnbits(Int8(73)) printbits(-Int8(55)) Arithmetic works precisely the same for signed and unsigned integers. Example (addition of 8-bit integers) Consider (-1) + 1 in 8-bit arithmetic. The number $-1$ has the same bits as $2^8 - 1 = 255$. Thus this is equivalent to the previous question and we get the correct result of 0. In other words: -1 + 1 \ ({\rm mod}\ 2^p) = 2^p-1 + 1 \ ({\rm mod}\ 2^p) = 2^p \ ({\rm mod}\ 2^p) = 0 \ ({\rm mod}\ 2^p) Example (multiplication of 8-bit integers) Consider (-2) * 2. $-2$ has the same bits as $2^{256} - 2 = 254$ and $-4$ has the same bits as $2^{256}-4 = 252$, and hence from the previous example we get the correct result of -4. In other words: (-2) * 2 \ ({\rm mod}\ 2^p) = (2^p-2) * 2 \ ({\rm mod}\ 2^p) = 2^{p+1}-4 \ ({\rm mod}\ 2^p) = -4 \ ({\rm mod}\ 2^p) Example (overflow) We can find the largest and smallest instances of a type using typemax and typemin: printlnbits(typemax(Int8)) # 2^7-1 = 127 printbits(typemin(Int8)) # -2^7 = -128 As explained, due to modular arithmetic, when we add 1 to the largest 8-bit integer we get the smallest: typemax(Int8) + Int8(1) # returns typemin(Int8) This behaviour is often not desired and is known as overflow, and one must be wary of using integers close to their largest value. Variable bit representation (advanced)¶ An alternative representation for integers uses a variable number of bits, with the advantage of avoiding overflow but with the disadvantage of a substantial speed penalty. In Julia these are BigInts, which we can create by calling big on an x = typemax(Int64) + big(1) # Too big to be an `Int64` 9223372036854775808 Note in this case addition automatically promotes an Int64 to a BigInt. We can create very large numbers using BigInt: 3082994025277634745700106821545665721371798533305697458855342277921093731984476404705966539412410898240561729912372038501228893141921080152404642393776599077294434061519905424124601394226943601430916434383714716724720227331596950613701661034548948388721097667275438763758128508403297199458260277707301202460980093818414167080563342761482395862435185093942443540722363151770022221783243959592531336062998494209914752408019060720805124534382646051093613814848646062038662423487504326044361203708430489305864234333801401547140023376295718383390360728662900230671437151716615826286842267917560749586018165739492101920429719261285640125596833063891562865262157026023955919873792846823095854484520920509345944712871675691790827690907778485058829248588945681685288179787963931181062068092463984296225973082494056307958089189726701678735576365394146232076917088075949053636690459581128773097212746967276496496010810878000638239143750075543163240049874489986642327436441234458040254480825038220479904594615300602390556385799245276805580024937804723029319565942013515817048714543455250235208789745701165279569026248145395218985062991831707830217974393158466067785199581037714968820628241051867119832966361530047910339065726550260741036716100932205969655083257714244071120221654679340461084001560321676025443801248355439305974923873624147980728110581452806109011739005060060604228087667499288851218705078807364237925455813890575257569981450090997117697469299234094394984840574025401463942099019413361096233909056117427663439764954916401592565651111571414769257187704568268701243082044838400201357613851006471104244828842270232637747398962711875413488415772647088571125272932490717217468263604683325933469555629785507020775366368002753612709901526248456328209643292122899677436613886360765877886748185299249994921843183573130403496311896614949399409796011301191280067209053259341918813967552543176532349157376 Note the number of bits is not fixed, the larger the number, the more bits required to represent it, so while overflow is impossible, it is possible to run out of memory if a number is astronomically large: go ahead and try x^x (at your own risk). In addition to +, -, and * we have integer division ÷, which rounds down: 5 ÷ 2 # equivalent to div(5,2) Standard division / (or \ for division on the right) creates a floating-point number, which will be discussed shortly: 5 / 2 # alternatively 2 \ 5 We can also create rational numbers using //: (1//2) + (3//4) Rational arithmetic often leads to overflow so it is often best to combine big with rationals: big(102324)//132413023 + 23434545//4243061 + 23434545//42430534435 26339037835007648477541540//4767804878707544364596461 3. Floating-point numbers¶ Floating-point numbers are a subset of real numbers that are representable using a fixed number of bits. Definition (floating-point numbers) Given integers $σ$ (the "exponential shift") $Q$ (the number of exponent bits) and $S$ (the precision), define the set of Floating-point numbers by dividing into normal, sub-normal, and special number subsets: F_{σ,Q,S} := F^{\rm normal}_{σ,Q,S} \cup F^{\rm sub}_{σ,Q,S} \cup F^{\rm special}. The normal numbers $F^{\rm normal}_{σ,Q,S} \subset {\mathbb R}$ are defined by F^{\rm normal}_{σ,Q,S} = \{\pm 2^{q-σ} \times (1.b_1b_2b_3\ldots b_S)_2 : 1 \leq q < 2^Q-1 \}. The sub-normal numbers $F^{\rm sub}_{σ,Q,S} \subset {\mathbb R}$ are defined as F^{\rm sub}_{σ,Q,S} = \{\pm 2^{1-σ} \times (0.b_1b_2b_3\ldots b_S)_2\}. The special numbers $F^{\rm special} \not\subset {\mathbb R}$ are defined later. Note this set of real numbers has no nice algebraic structure: it is not closed under addition, subtraction, etc. We will therefore need to define approximate versions of algebraic operations later. Floating-point numbers are stored in $1 + Q + S$ total number of bits, in the format sq_{Q-1}\ldots q_0 b_1 \ldots b_S The first bit ($s$) is the sign bit: 0 means positive and 1 means negative. The bits $q_{Q-1}\ldots q_0$ are the exponent bits: they are the binary digits of the unsigned integer $q$: q = (q_{Q-1}\ldots q_0)_2. Finally, the bits $b_1\ldots b_S$ are the significand bits. If $1 \leq q < 2^Q-1$ then the bits represent the normal number x = \pm 2^{q-σ} \times (1.b_1b_2b_3\ldots b_S)_2. If $q = 0$ (i.e. all bits are 0) then the bits represent the sub-normal number x = \pm 2^{1-σ} \times (0.b_1b_2b_3\ldots b_S)_2. If $q = 2^Q-1$ (i.e. all bits are 1) then the bits represent a special number, discussed IEEE floating-point numbers¶ Definition (IEEE floating-point numbers) IEEE has 3 standard floating-point formats: 16-bit (half precision), 32-bit (single precision) and 64-bit (double precision) defined by: \begin{align*} F_{16} &:= F_{15,5,10} \\ F_{32} &:= F_{127,8,23} \\ F_{64} &:= F_{1023,11,52} \end{align*} In Julia these correspond to 3 different floating-point types: Float64 is a type representing double precision ($F_{64}$). We can create a Float64 by including a decimal point when writing the number: 1.0 is a Float64. Float64 is the default format for scientific computing (on the Floating-Point Unit, FPU). Float32 is a type representing single precision ($F_{32}$). We can create a Float32 by including a f0 when writing the number: 1f0 is a Float32. Float32 is generally the default format for graphics (on the Graphics Processing Unit, GPU), as the difference between 32 bits and 64 bits is indistinguishable to the eye in visualisation, and more data can be fit into a GPU's limitted memory. Float16 is a type representing half-precision ($F_{16}$). It is important in machine learning where one wants to maximise the amount of data and high accuracy is not necessarily helpful. Example (rational in Float32) How is the number $1/3$ stored in Float32? Recall that 1/3 = (0.010101\ldots)_2 = 2^{-2} (1.0101\ldots)_2 = 2^{125-127} (1.0101\ldots)_2 and since $ 125 = (1111101)_2 $ the exponent bits are 01111101. For the significand we round the last bit to the nearest element of $F_{32}$, (this is explained in detail in the section on rounding), so we have 1.010101010101010101010101\ldots \approx 1.01010101010101010101011 \in F_{32} and the significand bits are 01010101010101010101011. Thus the Float32 bits for $1/3$ are: printbits(1f0/3) 00111110101010101010101010101011 For sub-normal numbers, the simplest example is zero, which has $q=0$ and all significand bits zero: printbits(0.0) 0000000000000000000000000000000000000000000000000000000000000000 Unlike integers, we also have a negative zero: printbits(-0.0) 1000000000000000000000000000000000000000000000000000000000000000 This is treated as identical to 0.0 (except for degenerate operations as explained in special numbers). Special normal numbers¶ When dealing with normal numbers there are some important constants that we will use to bound errors. Definition (machine epsilon/smallest positive normal number/largest normal number) Machine epsilon is denoted ϵ_{{\rm m},S} := 2^{-S}. When $S$ is implied by context we use the notation $ϵ_{\rm m}$. The smallest positive normal number is $q = 1$ and $b_k$ all zero: \min |F_{σ,Q,S}^{\rm normal}| = 2^{1-σ} where $|A| := ${|x| : x \in A }$. The _largest (positive) normal number_ is \max F_{σ,Q,S}^{\rm normal} = 2^{2^Q-2-σ} (1.11\ldots1)_2 = 2^{2^Q-2-σ} (2-ϵ_{\rm m}) We confirm the simple bit representations: σ,Q,S = 127,23,8 # Float32 εₘ = 2.0^(-S) printlnbits(Float32(2.0^(1-σ))) # smallest positive Float32 printlnbits(Float32(2.0^(2^Q-2-σ) * (2-εₘ))) # largest Float32 00000000100000000000000000000000 01111111100000000000000000000000 For a given floating-point type, we can find these constants using the following functions: eps(Float32),floatmin(Float32),floatmax(Float32) (1.1920929f-7, 1.1754944f-38, 3.4028235f38) Example (creating a sub-normal number) If we divide the smallest normal number by two, we get a subnormal number: mn = floatmin(Float32) # smallest normal Float32 printlnbits(mn) printbits(mn/2) 00000000100000000000000000000000 00000000010000000000000000000000 Can you explain the bits? Special numbers¶ The special numbers extend the real line by adding $\pm \infty$ but also a notion of "not-a-number". Definition (not a number) Let ${\rm NaN}$ represent "not a number" and define F^{\rm special} := \{\infty, -\infty, {\rm NaN}\} Whenever the bits of $q$ of a floating-point number are all 1 then they represent an element of $F^{\rm special}$. If all $b_k=0$, then the number represents either $\pm\infty$, called Inf and -Inf for 64-bit floating-point numbers (or Inf16, Inf32 for 16-bit and 32-bit, respectively): printlnbits(Inf16) printbits(-Inf16) 0111110000000000 1111110000000000 All other special floating-point numbers represent ${\rm NaN}$. One particular representation of ${\rm NaN}$ is denoted by NaN for 64-bit floating-point numbers (or NaN16, NaN32 for 16-bit and 32-bit, respectively): printbits(NaN16) 0111111000000000 These are needed for undefined algebraic operations such as: Example (many NaNs) What happens if we change some other $b_k$ to be nonzero? We can create bits as a string and see: i = parse(UInt16, "0111110000010001"; base=2) reinterpret(Float16, i) Thus, there are more than one NaNs on a computer. 4. Arithmetic¶ Arithmetic operations on floating-point numbers are exact up to rounding. There are three basic rounding strategies: round up/down/nearest. Mathematically we introduce a function to capture the notion of rounding: Definition (rounding) ${\rm fl}^{\rm up}_{σ,Q,S} : \mathbb R \rightarrow F_{σ,Q,S}$ denotes the function that rounds a real number up to the nearest floating-point number that is greater or equal. ${\rm fl}^{\rm down}_{σ,Q,S} : \mathbb R \rightarrow F_{σ,Q,S}$ denotes the function that rounds a real number down to the nearest floating-point number that is greater or equal. ${\rm fl}^{\rm nearest}_{σ,Q,S} : \mathbb R \rightarrow F_{σ,Q,S}$ denotes the function that rounds a real number to the nearest floating-point number. In case of a tie, it returns the floating-point number whose least significant bit is equal to zero. We use the notation ${\rm fl}$ when $σ,Q,S$ and the rounding mode are implied by context, with ${\rm fl}^{\rm nearest}$ being the default rounding mode. In Julia, the rounding mode is specified by tags RoundUp, RoundDown, and RoundNearest. (There are also more exotic rounding strategies RoundToZero, RoundNearestTiesAway and RoundNearestTiesUp that we won't use.) WARNING (rounding performance, advanced) These rounding modes are part of the FPU instruction set so will be (roughly) equally fast as the default, RoundNearest. Unfortunately, changing the rounding mode is expensive, and is not thread-safe. Let's try rounding a Float64 to a Float32. printlnbits(1/3) # 64 bits printbits(Float32(1/3)) # round to nearest 32-bit 0011111111010101010101010 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com