代写代考 CSCI 567: Machine Learning

CSCI 567: Machine Learning

Copyright By PowCoder代写 加微信 powcoder

Lecture 9, Nov 3

Administrivia
• Project details are out
• Make groups (of 4) by Nov 11, minimum team size is 3.
• Q5 on HW4 will help you get started on it.
• We’ll give an overview of the project and general tips in today’s discussion.

• HW4 is due in about two weeks (Nov 16 at 2pm).
• We’ll release another question on PCA tomorrow.

• Today’s plan:
• Finish ensemble methods
• Unsupervised learning:
• Clustering

Ensemble methods:

Ensemble methods

• Random forests

• Boosting: Basics

• Adaboost

• Gradient boosting

Collect T subsets each of some fixed size (say m) by sampling with replacement from
training data.

Let ft(x) be the classifier (such as a decision tree) obtained by training on the subset
t ! {1, . . . , T}. Then the aggregrated classifier fagg(x) is given by:

ft(x) for regression,

= Majority Vote{ft(x)}
t=1 for classification.

• Reduces overfitting (i.e., variance)
• Can work with any type of classifier (here focus on trees)
• Easy to parallelize (can train multiple trees in parallel)
• But loses on interpretability to single decision tree (true for all ensemble methods..)

Ensemble methods

• Random forests

• Boosting: Basics

• Adaboost

• Gradient boosting

Random forests: When growing a tree on a bootstrapped (i.e. subsampled) dataset,
before each split select ! ≤ # of the # input variables at random as candidates for
splitting.

When ! = # → same as Bagging
When ! < # → Random forests ! is a hyperparameter, tuned via cross-validation Random forests Ensemble methods • Random forests • Boosting: Basics • Adaboost • Gradient boosting Boosting: Idea The boosted predictor is of the form fboost(x) = sign(h(x)), where, !tft(x) for !t ! 0 and ft " F . The goal is to minimize "(h(x), y) for some loss function ". Q: We know how to find the best predictor in F on some data, but how do we find the best weighted combination h(x)? A: Minimize the loss by a greedy approach, i.e. find !t, ft(x) one by one for t = 1, . . . , T . Specifically, let ht(x) = !!f! (x). Suppose we have found ht!1(x), how do we find !t, ft(x)? Find the !t, ft(x) which minimizes the loss "(ht(x), y). Different loss function " give different boosting algorithms. "(h(x), y) = (h(x)# y)2 $ Least squares boosting, exp(#h(x)y) $ AdaBoost. Ensemble methods • Random forests • Boosting: Basics • Adaboost • Gradient boosting Given a training set S and a base algorithm A, initialize D1 to be uniform For t = 1, . . . , T • obtain a weak classifier ft(x)! A(S,Dt) • calculate the weight !t of ft(x) as (!t > 0# “t < 0.5) where "t = i:ft(xi) !=yi Dt(i) is the weighted error of ft(x). • update distributions Dt+1(i) $ Dt(i)e "!tyift(xi) = "!t if ft(xi) = yi Output the final classifier fboost = sgn t=1 !tft(x) AdaBoost: Full algorithm Put more weight on difficult to classify instances and less on those already handled well New weak learners are added sequentially that focus their training on the more difficult patterns Adaboost: Example 10 data points in R2 The size of + or - indicates the weight, which starts from uniform D1 Toy ExampleToy ExampleToy ExampleToy ExampleToy Example weak classifiers = vertical or horizontal half-planes Base algorithm is decision stump: Go through the calculations in the example to make sure you understand the algorithm Ensemble methods: Gradient boosting Ensemble methods • Random forests • Boosting: Basics • Adaboost • Gradient boosting Gradient Boosting Recall ht(x) = !=1 !!f! (x). For Adaboost (exponential loss), given ht!1(x), we found what ft(x) should be. Gradient boosting provides an iterative approach for general (any) loss function "(h(x), y): • For all training datapoints (xi, yi) find the gradient #"(h(xi), yi) h(xi)=ht!1(xi) • Use the weak learner to find ft which fits (xi, ri) as well as possible: ft = argmin (ri ! f(xi)) • Update ht(x) = ht!1(x) + $ft(x), for some step size $. Gradient Boosting Usually we add some regularization term to prevent overfitting (penalize the size of the tree etc.) Gradient boosting is extremely successful!! A variant XGBoost is one of the most popular algorithms for structured data (tables etc. with numbers and categories where each feature typically has some meaning, unlike images or text). (for e.g. during Kaggle competitions back in 2015, 17 out of 29 winning solutions used XGBoost) Unsupervised learning: PCA A simplistic taxonomy of ML Supervised learning: Aim to predict outputs of future datapoints Unsupervised Aim to discover hidden patterns and explore data Reinforcement Aim to make sequential decisions Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion Acknowledgement & further reading Our presentation is closely based on ’s notes for CS168 at Stanford. https://web.stanford.edu/class/cs168/l/l7.pdf https://web.stanford.edu/class/cs168/l/l8.pdf You can refer to these notes for further reading. Also review our Linear algebra Colab notebooks: https://web.stanford.edu/class/cs168/l/l7.pdf https://web.stanford.edu/class/cs168/l/l8.pdf https://colab.research.google.com/drive/1Wo6pC6ybebfIV8Nz06aHAVa1amjYDGQO?usp=sharing https://colab.research.google.com/drive/1Wo6pC6ybebfIV8Nz06aHAVa1amjYDGQO?usp=sharing We’ll start with a simple and fundamental unsupervised learning problem: dimensionality reduction. Goal: reduce the dimensionality of a dataset so that • it is easier to visualize and discover patterns • it takes less time and space to process for any downstream application (classification, regression, etc) • noise is reduced There are many approaches, we focus on a linear method: Principal Component Analysis (PCA). Dimensionality reduction & PCA Consider the following dataset: • 17 features, each represents the average consump- tion of some food • 4 data points, each represents some country. What can you tell? Hard to say anything looking at all these 17 features. Picture from here See this for more details PCA: Motivation http://setosa.io/ev/principal-component-analysis/ https://people.duke.edu/~hpgavin/SystemID/References/Richardson-PCA-2009.pdf PCA: Motivation PCA can help us! The projection of the data onto its first principal component: i.e. we reduce the dimensionality from 17 to just 1. Now one data point is clearly different from the rest! PCA: Motivation i.e. we reduce the dimensionality from 17 to just 1. Now one data point is clearly different from the rest! That turns out to be data from Northern Ireland, the only country not on the island of Great Britain out of the 4 samples. Can also interpret components: PC1 tells us that the Northern Irish eat more grams of fresh potatoes and fewer of fresh fruits and alcoholic drinks. PCA can help us! The projection of the data onto its first principal component (PC1): PCA: Motivation We can find the second (and more) principal components of the data too: PCA: Motivation And the components themselves are interpretable too: See this for more details https://people.duke.edu/~hpgavin/SystemID/References/Richardson-PCA-2009.pdf Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion Suppose we have a dataset of n datapoints x1,x2, . . . ,xn ! R The high level goal of PCA is to find a set of k principal components (PCs) /principal vectors v1, . . . ,vk ! d such that for each xi, for some coefficients !ij ! R. High-level goal Preprocessing the data • Before we apply PCA, we usually preprocess the data to center it • In many applications, it is also important to scale each coordinate properly. This is especially true if the coordinates are in different units or scales. Figure from CS168 notes Objective function for PCA Key difference from supervised learning problems: No labels given, which means no ground-truth to measure the quality of the answer! However, we can still write an optimization problem based on our high-level goal. For clarity, we first discuss the special case of ! = 1. Optimization problem for finding the 1st principal component %!: Figure from CS168 notes An example: Objective function for larger values of ! The generalization of the original formulation for general k is to find a k-dimensional subspace S such that the points are as close to it as possible: S = argmin k-dim subspaces S (distance between xi and subspace S) By the same reasoning as for k = 1, this is equivalent to, S = argmax k-dim subspaces S (length of xi’s projection on S) It is useful to think of the subspace S as the span of k orthonormal vectors v1, . . . ,vk ! R • k = 1, span is line through the origin. • k = 2, if v1,v2 are linearly independent, the span is a plane through the origin, and so on. Formal problem solved by PCA: Given x1, . . . ,xn ! R d and a parameter k " 1, compute orthonormal vectors v1, . . . ,vk ! R d to maximize, Equivalent view: • Pick v1 to be the variance maximizing direction. • Pick v2 to be the variance maximizing direction, orthogonal to v1. • Pick v3 to be the variance maximizing direction, orthogonal to v1 and v2, and so on. Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion Input: n datapoints x1,x2, . . . ,xn ! R d, #components k we want Step 1 Perform PCA to get top k principal components v1, . . . ,vk ! R Step 2 For each datapoint xi, define its “v1-coordinate” as "xi,v1#, its “v2-coordinate” as "xi,v2#. Therefore we associate k coordinates to each datapoint xi, where the j-th coordinate denotes the extent to which xi points in the direction of vj . Step 3 We now have a new “compressed” dataset where each datapoint is k-dimensional. For visu- alization, we can plot the point xi in R k as the point ("xi,v1#, "xi,v2#, . . . , "xi,vk#). Using PCA for data compression and visualization Visualization example: Do Genomes Encode Geography? Dataset: genomes of 1,387 Europeans (each individual’s genotype at 200,000 locations in the genome) & = 1387, + ≈ 200,000 Project the datapoints onto top 2 PCs Plot shown below; looks remarkably like the map of Europe! From paper: ``Genes mirror geography within Europe” Novembre et al., Nature’08 Compression example: : 256x256 (≈65K pixels) dimensional images of about 2500 faces, all framed similarly & = 2500, + ≈ 65,000 We can represent each image with high accuracy using only 100-150 principal components! The principal components (called eigenfaces here) are themselves interpretable too! From paper: ``Eigenfaces for recognition” Turk & Pentland, Journal of Cognitive Neuroscience’91 Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion How to solve the PCA optimization problem? The diagonal case Let’s solve argmax Av for the special case where A is a diagonal matrix. Any d ! d matrix A can be thought of as a function that maps points in Rd back to points in Rd: The matrix maps (x, y) to (2x, y): Figure from CS168 notes Points on circle {(x, y) : x2 + y2 = 1} are mapped to the ellipse {(x, y) : + y2 = 1}. So what direction v should maximize vTAv for diagonal A? It should be the direction of maximum stretch: Diagonals in disguise The previous figure, rotated 45 degrees. 1 still does nothing other than stretch out different orthogonal axes, possibly with these axes being a “rotated version” of the original ones. How do we formalize the concept of a rotation in high dimensions as a matrix operation? Answer: Orthogonal matrix (also called orthonormal matrix). Recall that we want to find v1 = argmaxv:!v!2=1 v Now consider A that can be written as A = QDQ for an orthogonal matrix Q and diagonal matrix D with diagonal entries !1 ! !2 ! !3 ! . . .!d ! 0. What is the direction which gets stretched the maximum? (Informal answer) The maximum possible stretch by D is !1. The direction of maximum stretch under D is e1. Therefore, direction of maximum stretch under DQ is v s.t. Q v = e1 =! v = (QT)!1e1 = Qe1. General covariance matrices When k = 1, the solution to argmax vTAv is the first column of Q, where A = X with Q orthogonal and D diagonal with sorted entries. General values of ! What is the solution to the PCA objective for general values of k? Solution: Pick the first k columns of Q, where the covariance X with Q orthogonal and D diagonal with sorted entries. Since Q is orthogonal, the first k columns of Q are orthogonal vectors. These are called the top k principal components (PCs). Eigenvalues & eigenvectors How to compute the top k columns of Q in the decomposition X Solution: Eigenvalue decomposition! Eigenvectors: axes of stretch in geometric intuition Eigenvalues: stretch factors PCA boils down to computing the ! eigenvectors of the covariance matrix 2"2 that have the largest eigenvalues. Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion For visualization, we usually choose k to be small and just pick the first few principal components. In other applications such as compression, it is a good idea to plot the eigenvalues and see. A lot of data is close to being low rank, so the eigenvalues may decay and become small. We can also choose the threshold based on how much variance we want to capture. Suppose we want to capture 90% of the variance in the data. Then we can pick k such that i.e. where !1 ! · · · ! !d are sorted eigenvalues. j=1 !j = trace(X X), so no need to actually find all eigenvalues. How many PCs to use? When and why does PCA fail? 1. Data is not properly scaled/normalized. 2. Non-orthogonal structure in data: PCs are forced to be orthogonal, and there may not be too many orthogonal components in the data which are all interpretable. 3. Non-linear structure in data. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com