1
1.
Least Squares Fitting
Find the best least squares straight line fit to the following measurements, and sketch your solution:
y = 2 at t = −1, y = 0 at t = 0,
y = −3 at t = 1, y = −5 at t = 2.
A middle-aged man was stretched on a rack to lengths L = 5, 6, and 7 feet under applied forces of F = 1, 2 and 4 tones. Assuming Hooke’s Law L = a + bF , find his normal length a by least squares.
Let
y = b0 +b1x1 +···+bp−1xp−1 (1)
denote the hyperplane in Rp that is the best least square hyperplane fit to a given collection of data points (yk, xk,1, . . . , xk,p−1) ∈ Rp, 1 ≤ k ≤ n. Derive the norm equations, in matrix notation, for determining this hyperplane. (That is:
Either
(a) Describe, in terms of partial derivatives, the normal equations that determine the constants b0, . . . , bn.
(b) Then use matrix notation to express these normal equations, making sure to define those matrices involved.
Or, more easily and more preferably,
Observe that on letting || || denote the Euclidean norm and writing
2.
3.
y1 1 x1,1 ··· x1,p−1 b0 .. .
y= . , A= . , b= . , yn 1 xn,1 ··· xn,p−1 bp
our least squares requirement is simply that b should be chosen so that ||Ab − y|| is as small as possible. In other words, the vector Ab − y should be perpendicular to the plane spanned by the vectors of the form Au with u ∈ Rp. In other words, for an arbitrary vector u ∈ Rp we need 0 = utAt(Ab − y) or, since u is arbitrary,
0 = At(Ab − y) (2)
Convince yourself that (2) is the (correct way to think of the) system of normal equations. )
1
4. Lety=b0+b1xdenotethebestleastsquaresstraightlinefittogivendatapoints(y1,x1),…,(yn,xn).
(a) Define the fitted values yˆi, residuals ei, sample mean y, total sum of squares SST O, error sum of squares SSE, regression sum of squares SSR, and coefficient of determination R2.
(b) Prove that ni=1 ei = 0.
(c) Prove that ni=1 yˆiei = 0.
(d) ProvethatSSTO=SSE+SSR.
(e) Provethat1≤R2 ≤1.
The theory of statistical inference can be applied to the output from a least squares fit. This topic is outside the main focus of this module. Nevertheless, the following two questions touch on the topic. These kinds of questions won’t appear on the exam.
2.
2
1.
2. 3.
4.
1.
The theory of statistical inference can be applied to the output from a least squares fitting. Suppose given a random variable
yi =β0 +β1xi,1 +···βp−1xi,p−1 +ǫi
where
• i = 1,2,…,n;
• xi,1,…,xi,p−1 areknownconstants;
• β0,…,βp−1areunkownfixedparameters;
• ǫi are independent random variables with common normal distribution N(0,σ2).
Describe a criterion for choosing between the two hypotheses C1 : β1 =β2 =···βp−1 =0
C2 : βi ̸=0foratleastone1≤i≤p−1
that controls Type I errors at level α.
In the context of the previous question, suppose that q ≤ p of the parameters βk need to be estimated jointly. Describe the Bonferroni confidence intervals with family coefficient 1 − α for these q parametrers.
Principal Component Analysis
Explain what is meant by Principal Component Analysis. Your explanation should include explanations of the terms: geometric information; covariance matrix; orthogonal transforma- tion; Spectral Theorem and describe how the technique can be used to reduce dimensionality while retaining much geometric information.
Prove that any real symmetric matrix has at least one real eigenvector.
Use the fact that any real symmetric n × n matrix A has at least one eigenvector to prove
that it has n linearly independent real eigenvectors.
Determinethemaximumvalueofthefunctionf(x,y)=x2+4xy+4y2 ontheunitsphere S1 = {(x,y) ∈ R2 : x2 + y2 = 1}. Also, find a linear homomorphism φ:R2 → R2,(x,y) → (x′,y′)suchthatfφ(x,y)=λ1x′2+λ2y′2 forsomeλ1,λ2 ∈R.
2
5.
6.
3
1.
2.
3.
4.
A data set S ⊂ Rp consists of vectors whose first component is the number of kilometers that a salesperson has travelled during the last month. A principal component analysis is performed on S, the set S is then projected onto the three principal components with largest eigenvalues, and the projected points are visualized in R3. Would this visualization be any different if distance had been measured in miles? Justify your answer.
Suppose given a finite set S of data points in R3 and that a visual inspection suggests that all points look to lie close to some 2-dimensional plane containing the origin. We could construct a plane by regarding the first coordinate as a dependent variable and taking a least squares fit. Alternatively, we could construct a plane using Principal Component Analysis and taking the span of the eigenvectors corresponding to the two larger eignevalues. In general, would the two constructed planes differ? If so, in what way?
Clustering and Persistence
Describe an algorithm that inputs an n×n distance (or dissimilarity) matrix for n items, ap- plies single-linkage hierarchical clustering, and returns the corrsponding barcode. Determine a worst-case time estimate for the algorithm.
Describe the Smith-Waterman algorithm for determining the optimal score of a local align- ment of two sequences of letters. Include an explanation of the terms local alignment and optimal score.
Perform the Smith-Waterman algorithm to find an optimal local alignment for the sequences X=TGCATA, Y =ATCTGAT
using column scores
δ x =1, δ xy =−1if x̸=y, δ x =−1.
Explain how cluster analysis and barcodes can be used to estimate the number of objects in the digital photograph.
Explain how cluster analysis can also be used to estimate the number of objects with holes.
3
6.
4
1. 2.
3. 4. 5.
5
1.
5.
Explain how cluster analysis and barcodes can be used to estimate the number of ‘limbs’ of an object such as a starfish from a digital image of the object.
A property I(S) of a set S ⊂ Rn is said to be a geometric invariant if its value does not change when S undergoes an orthogonal tranaformation. Explain why the barcodes in the preceding questions (3) and (4) are geometric invariants.
Nearest Neighbours
Describe a real-life situation where a k-th nearest neighbour algorithm could be used to make decisions.
Describe a method, based on Voronoi regions, for finding the point in a finite subset S of low- dimensional Euclidean space Rd that is closest to some given point v ∈ Rn. Your description should define the terms Voronoi region, facet, neighbour and describe an algorithm which assumes Voronoi regions have been computed for all points in S.
State the Johnson-Lindenstrauss Theorem. State the Norm Preservation Proposition.
Use the Norm Preservation Proposition to prove the Johnson-Lindenstrauss Theorem. (This wasn’t covered this year due to lack of time.)
Python Part I
Suppose that you have a text file with the following format:
p[0] = (12.3, 4.5) p[1] = (-1.6, 7.9) p[2] = (11.0, 9.8) …
(a) Write Python code that reads the file and creates two lists: xs and ys with the x-values and y-values of the points. For the example file above the initial parts of the lists would be xs = [12.3, -1.6, 11.0,…] and ys = [4.5, 7.9, 9.8, …]
(b) Write Python code that creates a plot of the points using matplotlib. (a) Give a concise explanation of the following concepts
2.
4
i. Classes and objects in Python.
ii. Abstract classes and how they can be simulated in Python.
iii. Some of Python’s special methods, namely __init__, __str__ and __call__.
(b) In a Python program dealing with vectors, you might need to calculate the distance between two vectors using different norms. Therefore you decide to implement classes EuclideanDist and InfinityDist for calculating the distance using the Euclidean norm, and the infinity norm, respectively. Show how to do this, by completing the code below:
class Distance:
def __init__(self):
# your code here
def __call__(self, p, q): # your code here
def __str__(self): # your code here
class EuclideanDistance(Distance): def __init__(self):
# your code here
def __call__(self, p, q): # your code here
def __str__(self): # your code here
class InfinityDistance(Distance): def __init__(self):
# your code here
def __call__(self, p, q): # your code here
def __str__(self): # your code here
Your implementation should allow for the following usage in the program.
euclideanDist = EuclideanDistance()
d = euclideanDist(vec1, vec2)
print(“The distance between vec1 and vec2 is”,
d, “using the”, euclideanDist, “distance”) assuming that vec1 and vec2 are vectors on the appropriate form.
3. Suppose that you are given two text files named a.txt and b.txt respectively, both con- taining n lines with n entries each. For example, one of the files could have the following contents:
12 4 9-1 1600 3457
21 15 -4 18
Write Python code that
(a) Reads the contents of the files into two numpy arrays called a and b. (b) Compute a new numpy array named c.
(c) Write c to the file c.txt using the same file format as for a.txt and b.txt 5
6
1.
2.
Use exceptions to handle any IO errors. Your program should not crash even if the reading and writing to files does not succeed.
CS4103
Simplicial complexes, clique complexes, and Mapper clus- tering
Define what is meant by a geometric k-simplex, a simplicial complex K, and the geometric realization |K| of a finite simplicial complex K. Illustrate your answers with a simple example.
Which of the following K are simplicial complexes? For those that are, sketch the simplicial complex, and determine the Euler characteristic
χ(K)=α0 −α1 +α2 −α3 +··· where αk denotes the number of k-simplexes.
(a) V = {1,2,3,4,5,6,7,8},K = {{1},{2},{3},{4},{5},{6},{7},{8}, {1,2},{1,6},{2,3}, {2,4},{2,5},{3,4},{3,5},{4,5},{4,6},{5,6}, {2,3,4},{2,3,5},{2,4,5},{3,4,5}}.
(b) V = {1,2,3,4,5,6,7,8},K = {{1},{2},{3},{4},{5},{6},{7},{8}, {1,2},{1,6},{2,3}, {2,4},{1,5},{3,4},{3,5},{4,5},{4,6},{5,6}, {2,3,4},{2,3,5},{2,4,5},{3,4,5}}.
Explain how, for each ǫ ≥ 0, one can associate a clique complex Kǫ to a symmetric n × n matrix of distances between n items. For ǫ = 2.5 and for the following 6 × 6 matrix of
3.
4. 5. 6. 7. 8. 9.
determine the simplicial complex Kǫ; then sketch the geometric realization |Kǫ| and calculate the Euler characteristic χ(Kǫ).
Consider the collection U = {{1, 2, 3}, {2, 3, 4}, {3, 4, 5}, {4.6}, {1, 6}} of sets. Sketch the nerve NU and calculate its Euler characteristic χ(NU).
Define what is means for two maps to be homotopic and for two spaces to be homotopy equivalent.
Let Y ⊂ En be an arbitrary convex subset of Euclidean space and let X be an arbirary topological space. Prove that any two continuous maps f, g: X → Y are homotopy equivalent.
Prove that any convex subspace Y ⊂ En is homotopy equivalent to the space consisting of a single point.
Prove that homotopy equivalence of maps f ≃ g is an equivalence relation on the set of continuous maps X → Y from a given space X to a given space Y .
Give a careful proof that S1 = {z ∈ C : |z| = 1} is homotopy equivalent to C \ {0}.
distances
012333 101333 2 1 0 1 3 3 3 3 1 0 1 2 3 3 3 1 0 1 333210
6
10.
11. 12.
7
1.
2.
3.
4.
State Leray’s theorem about the nerve NU of an open cover U of a space X. Illustrate the theorem by using an appropriate open cover of the annulus X = {z ∈ C : 1 ≤ |z| ≤ 2}. Calculate the Euler characteristic NU in your illustration.
Give an account of the Mapper clustering algorithm. In your account, illustrate the algorithm on a set of points chosen from an annulus.
Give an example of a finite data set S ⊂ R3 for which ‘topological information’ will be lost under any linear projection R3 → R2, but for which this information might be preserved under a map S → K to a 2-dimensional simplicial complex K produced from the Mapper clustering procedure.
Homology and Persistent Homology
Consider the mod-2 chain complex C∗K of the simplicial complex with V = {1, 2, 3, 4}, K = {{1}, {2}, {3}, {4}, {1, 2}, {1, 3}, {2, 3}, {2, 4}, {3, 4}, {1, 2, 3}}.
(a) Determine the matrices for the linear homomorphisms d2: C1K → C0K and d1: C2K → C1K.
(b) Use these matrices to compute H0(C∗K) and H1(C∗K).
Consider the simplicial complex with V = {1, 2, 3, 4, 5, 6}, and with K consisting of the six 2-simplices {1, 2, 3}, {2, 3, 4}, {3, 4, 5}, {4, 5, 6}, {1, 5, 6}, {1, 2, 6} together with all non-empty subsets of these 2-simplices. Sketch the geometric realization K. Then exhibit a simplicial subcomplex L of K for which the inclusion |L| ֒→ |K| is a homotopy equivalence. (Don’t prove the homotopy equivalence.) Hence determine the mod-2 homology Hn(K) for all n ≥ 0.
Explain what is meant by:
(a) a simplicial complex K,
(b) a filtered simplicial complex K,
(c) the degree n persistent Betti numbers βs,t of a filtered simplicial complex K. n
The degree 0 and degree 1 persistent Betti numbers of a certain filtered simplicial complex K are given by the matrices
3000 1000 βs,t=0 3 1 0 βs,t=0 3 2 2.
0 0 0 2 1 1 0 0 2 2 0003 0003
How many ‘persistent connected components’ and how many ‘persistent 1-dimensional holes’ does K have?
7