python机器学习代写 COMP6245: Coursework

COMP6245: Coursework

1 Due date

The hand-in date for the assignment is Monday, December 3, 2018 . 2 Introduction

You should implement all the exercises for yourself. Tools such as scikit- learn are there for you to use in practical contexts, but for understanding the underpinnings of many of those implementations you need to work them out for yourself. However, for many of the data preparation steps such as load- ing, splitting data into training and test tests randomly and so on, you can use canned routines such as those in scikit-learn.

A useful blog for many of the machine learning algorithms you will en- counter in the module is at https://jermwatt.github.io/mlrefined/index. html based on the book [5]. There are umpteen references on the web you may wish to consult. However, for most of the exercises [2] and [3] should give you food for thought and influence the way you write your report.

The report should focus on the key conceptual steps that underpin each algorithm, illustrating your understanding by pointing to the results you obtain that should be summarised in graphical or tabular form. Some of these are asked for explicitly in the questions below, others are left to your judgement. All else being the same, a well articulated report will get a higher mark than one where the same results are described, but less cogently.

3 Classification

You will explore two methods for some toy datasets in order to get familiar with some of the issues to do with projections and data transformations. You

1

will generate one of the two datasets yourself using a random number genera- tor. The other is an old classic – Ronald Fisher’s Iris dataset that can be loaded using scikit-learn’s data loader. (Just as you loaded the diabetes dataset.)

You will first explore Fisher’s LDA for binary classification for class labels a ()

and b. In Fisher’s method a direction defined by vector w = w1 is chosen w2

so that the data xn projected on to w maximises the separation between the a and b type distributions. (n is a data index, ranging from 1 to na for class a, and from 1 to nb for class b, here used in superscript; it is not the n-th power of x.) Setting ync ≜ w · xnc for class label c ∈ {a, b} the scalar means and standard deviations of the projected data become

nc nc 1∑n21∑n2

μc = n yc , σc = n (yc − μc) , c ∈ {a, b}. c n=1 c n=1

The direction w is chosen in order to maximise the Fisher ratio F(w): (μa − μb)2

F(w) ≜ na σ2 + nb σ2 . na + nb a na + nb b

3.1 Data 1: separate 2 Gaussians

The data for this part are to be generated by you. Generate data from two 2-dimensional Gaussian distributions

xa ∼ N(x|ma, Sa), xb ∼ N(x|mb, Sb)

where ma,mb are the (2 × 1) mean vectors and Sa and Sb are the (2 × 2) covariance matrices that define normal distributions. Let the number of data points from each type a, b be na and nb. If the classes are widely separated or if they overlap significantly it will be hard for you to explore and demonstrate the consequences of choosing different directions, and thus learn from the experience. Hence, make the differences of the means of the order of mag- nitude of the standard deviations. The precise values will affect the results of the following tasks and you can experiment with numerical values that help you appreciate the pros and cons of the classification method.

The following tasks are broken up into two chunks to demarcate bow the marks are to be allotted, even though they are all connected. You will need up

2

your observations in no more than 2 pages, providing evidence based on your experiments, and discuss what you have learned. You should read Section 4.4 of [4] (or Section 4.3 of [3]) for guidance and ideas.

1. This part is for visual exploration of the consequences of projecting data onto a lower dimension.

[3 marks]

  1. (a)  Make a few (between 2 − 4) illustrative choices for the direction w

    and plot the histograms of the values yna and ynb .

  2. (b)  Plot the dependence of F(w) on the direction of w by rotating

    somerandomstartingweightvectorw(0)=(w1,w2)T androtat-

    ing it by angles θ to get w(θ) = R(θ)w(0) where ()

    R(θ)= cosθ −sinθ . sin θ cos θ

    Find the maximum value of F(w(θ)) and the corresponding direc-

tion w∗:
2. This part makes you look at probability distributions visually by drawing

contour plots and/or histograms.

[4 marks]

  1. (a)  Since the generating distributions for the classes are known, plot the equi-probable contour lines for each class and draw the direc- tion of the optimal choice vector w.
  2. (b)  Use Bayes’ theorem to write down the logarithm of the ratio of conditional probabilities (called log-odds)

    (P(c = a|xn)) ln P(c = b|xn)

    and plot the decision boundary where this quantity vanishes. In other words, for each point x ∈ R2 check whether the log-odds is positive, negative or zero. The decision boundary is the set of points where the log-odds vanishes.

    DothisforthetwocasesSa =Sb andSa ̸=Sb. 3

w∗ = argmax F(w) = argmax F(w(θ)). wθ

(c) Explore the consequences of using a formula for the discriminant

F (w) ≜ (μa − μb)2 unbalanced σ2a + σ2b

that does not account for the different fractions of data in each class. How is this accounted for in the application of Bayes’ rule?

3.2 Data 2: Iris data

[5 marks]

In this section you will perform the same LDA task on the famous Iris dataset https://en.wikipedia.org/wiki/Iris_flower_data_set which can be downloaded fromhttp://archive.ics.uci.edu/ml/datasets/Iris. While the previous exercise was done by finding the weight vector to project on by direct search, here you follow Fisher and solve the generalised eigenvalue con- dition for optimal weights.

With more features and classes, you will need to compute the between- class and within-class variance-covariance matrices ΣB and ΣW:

Σ = (μ −μ)(μ −μ)T,whereμisthemeanofclassmeans, Bcc

c

and ΣW is the sum of the covariance matrices for each class. In case there are different numbers of training data points from each class, you have to scale any class dependence by the corresponding fraction of class members in the population. (In the Iris data set all three classes have 50 members, so you can skip this step.)

1. Find the optimal direction w∗ for projecting the data onto. You will need to solve the generalised eigenvalue problem ΣBw = λΣWw. Sec- tion 16.2, 16.3 of [1] and eq. (4.15) of [3] has further details.

NB: T he standard libraries in numpy/scipy (and others) can give you the gen- eralised eigenvalues and eigenvectors. Since the covariance matrix is symmetric, the function you should call in numpy/scipy is eigh, and not eig, al though for problems of this scale it won’t make a difference and you can eyeball the eigen- values. In particular, the eigenvalues returned by eigh are sorted. You must

4

4

also check the answer provided by verifying that the generalised eigenvalue con- dition(ΣB−λΣW)w=0holds.Thiswillclarifythenotationalconventionsof the software used. Sometimes it is the transpose of the returned matrix of vectors that contains the eigenvectors, so please make sure you understand what is being returned.

2. Display the histograms of the three classes in the reduced dimensional space defined by w∗.

3. What would happen if, instead of choosing the generalised eigenvector w∗ you project on a different vector w = w∗ + a? You should consider a to be constructed out of the other generalised eigenvectors (why?).

4. Present your results with reflections and evidence in no more than 2 pages. You need to discuss the relation between the class separation task and the generalised eigenvector formulation taking into account the question in the last part. You should comment on how this method com- pares with the method in the 2-gaussian separation exercise above.

Linear Regression with non-linear functions

[4 marks]

In this exercise you have to perform the task of fitting polynomial func- tions to y(x) = sin(x) + ε where ε is a noise term. You have to generating a number N of points for 0 ⩽ x < 2π and a small noise term.

4.1 Performing linear regression

Fit polynomial functions φ of different degrees of your choice to this data. 1. Learn the weights w = (w0, . . . , wp) by minimising the loss function

∑N( ∑p )2
yn − w0 − wjφj(xn) + λC(w)

n=1
by gradient descent. C(w) is a penalty on the norm of the weights.

Choose C(w) = ∥w∥2, the L2 norm penalty. 5

j=1

2. For the L2 norm penalty, obtain the weights from the analytical expres-

sion
where Ip+1 stands for a (p + 1) × (p + 1) identity matrix.

w = (XT X + λIp+1)−1XT y, (1)

3. Split up your data into a training Str and test set Sts and measure the mean of the squared residuals on the test set Sts as a performance metric for each model. Plot this measure as a function of:

(a) the complexity of the model. This could be the number of basis components φj (and therefore, the number of weights p or the de- gree of the polynomial);

(b) the strength of the regularisation coefficient λ. 4.2 How does linear regression generalise?

[4 marks]

This section gets you to explore the models learned by linear regression, by looking at the dependence of the optimal weights on training data and the scale of regularisation. When writing up the results of this section about the ability of the models to generalise using the tasks below, try to relate the weights from eq.(1) to your analysis.

1. Splitupyourdatainto10(roughly)equalpartsfor10-foldcross-validation. Evaluate the models learned on each of the 9/10 of the N data points on the remaining N/10. The results of this evaluation relates to the gener- alisation performance of the model.

References

[1] D. Barber. Bayesian R easoning and M achine L earning. Cambridge University Press, 04-2011 edition, 2011.

[2] Christopher M. Bishop. P attern R ecognition and M achine L earning. Springer, 2006.

6

[3] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. T he Elements of S tatistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2nd edition, 2009.

[4] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to S tatistical Learning: With Applications in R. Springer Publish- ing Company, Incorporated, 2014.

[5] Jeremy Watt, Reza Borhani, and Aggelos K. Katsaggelos. Machine Learn- ing R efined: Foundations, Algorithms, and Applications. Cambridge University Press, 2016.

7