CS代考 APS1070

APS1070
Foundations of Data Analytics and Machine Learning
Fall 2021
Week 3:
• End-to-endMachineLearning
• Data Retrieval and Preparation
• Plotting and Visualization
• MakingPredictions
• Decision Trees
Prof.

Agenda
➢Today’s focus is on Foundations of Learning
1. End-to-end machine learning
2. Python Libraries —NumPy
—Matlplotlib —Pandas —Scikit-Learn
3. Decision Trees
2

Part 1
End-to-End Machine Learning

End-to-End Machine Learning
1. Understand the problem
2. Retrieve the data
3. Explore and visualize the data to gain insights
4. Prepare the data for the algorithm/model
5. Select and train the algorithm/model
6. Fine-tune your algorithm/model
7. Present your solution
8. Launch, monitor, and maintain your system
4

End-to-End Machine Learning
Understand Problem
Data Visualization
Model Selection
Test and Assess
Data Collection
Data Preparation
Model Training
5

Classification vs. Regression
➢Classification: Discrete target
➢Separate the Dataset ➢Apples or oranges?
➢Dog or Cat?
➢Handwritten digit recognition
➢Regression: Continues Target
➢Fit the dataset ➢Price of a house
➢Revenue of a company ➢Age of a tree
Feature # 1
Feature # 1
6
Feature # 2
Target

Understand the Problem
➢Often, we need to make some sort of decisions (predictions)
➢Two common types of decisions that we make are:
➢ Classification
➢ Discrete number of possibilities
➢ Regression
➢ Continuous number of real-valued possibilities
Supervised
Unsupervised
classification
clustering
regression
dimensionality reduction
7
Continuous Discrete

Understand the Problem
Input data is represented by features that can come in many forms:
➢Raw pixels
➢ Histograms ➢Tabular data ➢ Spectrograms ➢…
8

Data Exploration
➢Understand your data through visualization
➢Assess the difficulty of the problem ➢You have a data set D = {(x(i),y(i))}
➢You want to learn y = f(x) from D ➢more precisely, you want to minimize
error in predictions
➢What kind of model (algorithm) do you need?
MNIST low-dimensional projection
9

Model Selection
Many classifiers to choose from
➢Support-Vector Machine (SVM) ➢Logistic Regression
➢Random Forests
➢Naive Bayes
➢Bayesian network ➢K-Nearest Neighbour ➢(Deep) Neural networks ➢ Etc.
10

Model Selection
➢Often the easiest algorithm to implement is k-Nearest Neighbours
➢Match to similar data using a distance metric
Q: What happens as we increase #data?
Q: What about as #data approaches infinity?
11

Test and Assess
➢Unlike us, computers have no trouble with memorization.
➢The real question is, how well does our algorithm make predictions on new data?
➢We need a way to measure how well our algorithm (model) generalizes to new, never before seen, data.
12

Regression Example
➢Let’s look at a more concrete example…
➢Given noisy sample data (blue), we want to find the polynomial that generated the data
➢Q: What kind of a problem is this?
media exposure
stock price
13

Mean Squared Error
➢Need to first define our error term, in this case we can use the mean squared error (MSE):
➢Error is measured by finding the squared error in the prediction of y(x) from x.
➢ The error for the red polynomial is the sum of the squared vertical errors
14

Fitting the Data
Q: Which polynomial fits the data best?
➢ based on error term? ➢ based on test data?
15

Overfitting vs Underfitting
High training error Acceptable training error Perfect training error (zero) and High test error (Underfit) and test error and high test error (Overfit)
16

Generalization
➢Giving the model a greater capacity (more complexity) to fit the data… does not necessarily help
➢How do we evaluate the model performance?
Verify model on
17

Overfitting
➢In brief: fitting characteristics of training data that do not generalize to future test data
➢Central problem in machine learning
➢Particularly problematic if #data << #parameters ➢... don’t have enough data to “identify” parameters 18 Generalization ➢Machine learning is a game of balance, with our objective being to generalize to all possible future data New samples Under-fitting Over-fitting Training samples Model Capacity (Complexity) 19 Error (% Incorrect) Bias-Variance Trade-off ➢Models with too few parameters are inaccurate because of a large bias (not enough flexibility). ➢Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample). 20 Inductive Bias ➢ Let’s avoid making assumptions about the model (polynomial order) ➢Assume for simplicity that D = {(x(i),y(i))} is noise free ➢x(i)’s in D only cover small subset of input space x ➢Q: What’s the best we can do? ➢If we’ve seen x=x(i) report y=y(i) ➢If we have not seen x= x(i), can’t say anything (no assumptions) ➢This is called rote learning... boring, eh? ➢ Key idea: you can't generalize to unseen data w/o assumptions! ➢Thus, key to ML is generalization ➢ To generalize, ML algorithm must have some inductive bias ➢ Bias usually in the form of a restricted model (hypothesis) space ➢ Important to understand restrictions (and whether appropriate) 21 Inductive Bias ➢ Example: Nearest neighbors – – – We suppose that most of the cases in a small neighborhood in feature space belong to the same class. Given a case for which the class is unknown, we assume that it belongs to the same class as the majority in its immediate neighborhood. This is the bias used in the k-nearest neighbors algorithm. The assumption is that cases that are near each other tend to belong to the same class. 22 Training and Testing Data ➢Track generalization error by splitting data into training and testing ➢80% training and 20% testing ➢More data = better model ➢Would like to use all our data for training, however we need some way to evaluate our model 23 The problem with tracking test accuracy ➢What K should be? ➢If we track test error/accuracy in our training curve, then: ➢We may make decisions about model architecture using the test accuracy and make the testing meaningless. ➢The final test accuracy will not be a realistic estimate of how our model will perform on a new data set! 24 Validation Set ➢We still want to track the loss/accuracy on a data set not used for training ➢Idea: set aside a separate data set, called the validation set ➢Track validation accuracy in the training curve ➢Make decisions about model architecture using the validation set 25 Validation Set ➢We still want to track the loss/accuracy on a data set not used for training ➢Idea: set aside a separate data set, called the validation set ➢Track validation accuracy in the training curve ➢Make decisions about model architecture using the validation set K is a hyperparameter. We tune hyperparameters using the validation set 26 Validation and Holdout Data ➢Training, Validation and Testing Data ➢Less data for your training model ➢Ideally use the holdout data only once ➢Requires a great deal of discipline to not look at the holdout data Holdout Data 27 Cross-Validation ➢Splitting training and validation data into several folds during training ➢This is known as k-fold Cross-Validation ➢Model parameters selected based on average achieved over k folds Source: scikit-learn 28 Data Processing ➢Q: You test your model on new data and you find it fails to predict certain samples. Why could be happening? Training Data 29 Test Data Data Augmentation ➢For example, how can your algorithms (models) predict on rotations if it has never seen a rotated sample? ➢Apply Data Augmentation! ➢ translation, ➢ scaling, ➢rotation, ➢ reflection, ➢... Linear Algebra to the Rescue! Source: https://morioh.com/p/928228425a08 30 More Data Processing ➢Q: Large input feature size (short and wide data) is problematic? Why do you think that is? ➢Curse of dimensionality! ➢As features grow you require more model capacity (complexity) to represent the data ➢Models of greater complexity require exponentially more training data 31 Dimensionality Reduction Solution: ➢Reduce the number of features using dimensionality reduction ➢Principal Component Analysis ➢more details provided in weeks 7 and 8 Source: Data Courses 32 Deep Learning ➢Principle Component Analysis (PCA) is limited to linear transformations ➢Deep Learning techniques can be used to learn and apply nonlinear transformations for dimensionality reduction ➢More detail on model-based machine learning techniques in weeks 9 – 11 33 Roadmap for the rest of APS1070 Understand Problem Data Visualization Model Selection Test and Assess Data Collection Data Preparation Model Training End-to-end machine learning is just one piece of the pie. The concepts we’ll cover in this course have utility that goes far beyond machine learning. 34 Part 2 Python Libraries and Titanic Basic Python Check-up Tutorials 0 and 1: Python Basics ❑Data Types ❑Single: int, float, bool ❑Multiple: str, list, set, tuple, dict ❑index [], slice [::], mutability ❑Conditionals ❑if, elif, else ❑Functions ❑def, return, recursion, default vals ❑Loops ❑for, while, range ❑list comprehension ❑Operations ❑arithmetic: +,*,-,/,//,%, ** ❑boolean: not, and, or ❑relational: ==, !=, >, <, >=, <= ❑Display ❑print, end, sep ❑Files ❑open, close, with ❑read, write ❑CSV ❑Object-Oriented Programming (OOP) ❑class, methods, attributes ❑__init__, __str__, polymorphism 37 Other resources for Python ➢Coursera - University of Toronto MOOCs ➢Learn to Program: The Fundamentals (https://www.coursera.org/learn/learn-to-program) ➢Learn to Program: Crafting Quality Code (https://www.coursera.org/learn/program-code) ➢Google is your (BEST) friend? ➢APS1070 Board 38 Scientific Computing Tools for Python ➢ Scientific computing in Python builds upon a small core of packages: ➢ NumPy, the fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them. ➢ The SciPy library, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more. ➢ Matplotlib, a mature and popular plotting package, that provides publication-quality 2D plotting as well as rudimentary 3D plotting ➢ Data and computation: ➢ pandas, providing high-performance, easy to use data structures. ➢ scikit-learn is a collection of algorithms and tools for machine learning. Source: https://www.scipy.org/about.html 39 NumPy ➢Let’s start with NumPy. Among other things, NumPy contains: ➢ A powerful N-dimensional array object. ➢ Sophisticated (broadcasting/universal) functions. ➢ Tools for integrating C/C++ and Fortran code. ➢ Useful linear algebra, Fourier transform, and random number capabilities. ➢Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. ➢Many other python libraries are built on NumPy ➢Provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance 40 NumPy ➢The key to NumPy is the ndarray object, an n-dimensional array of homogeneous data types, with many operations being performed in compiled code for performance. ➢There are several important differences between NumPy arrays and the standard Python sequences: ➢NumPy arrays have a fixed size. Modifying the size means creating a new array. ➢NumPy arrays must be of the same data type, but this can include Python objects. ➢More efficient mathematical operations than built-in sequence types 41 NumPy ➢To begin, NumPy supports a wider variety of data types than are built-in to the Python language by default. They are defined by the numpy.dtype class and include: ➢ intc (same as a C integer) and intp (used for indexing) ➢ int8, int16, int32, int64 ➢ uint8, uint16, uint32, uint64 ➢ float16, float32, float64 ➢ complex64, complex128 ➢ bool_, int_, float_, complex_ are shorthand for defaults. 42 NumPy ➢There are a couple of mechanisms for creating arrays in NumPy: ➢Conversion from other Python structures (e.g., lists, tuples). ➢Built-in NumPy array creation (e.g., arrange, ones, zeros, etc.). ➢Reading arrays from disk, either from standard or custom formats (e.g. reading in from a CSV file). ➢and others ... 43 NumPy ➢There are a couple of mechanisms for creating arrays in NumPy: ➢Conversion from other Python structures (e.g., lists, tuples). ➢Built-in NumPy array creation (e.g., arrange, ones, zeros, etc.). ➢Reading arrays from disk, either from standard or custom formats (e.g. reading in from a CSV file). ➢and others ... ➢In general, any numerical data that is stored in an array-like container can be converted to an ndarray through use of the array() function. The most obvious examples are sequence types like lists and tuples. 44 SciPy ➢Collection of algorithms for linear algebra, differential equations, numerical integration, optimization, statistics and much more ➢Part of Sci ➢Built on NumPy ➢With SciPy an interactive Python session becomes a data- processing and system-prototyping environment rivaling systems such as MATLAB, IDL, Octave, R-Lab, and SciLab. 45 SciPy ➢SciPy’s functionality is implemented in a number of specific sub- modules. These include: ➢ Special mathematical functions (scipy.special) -- airy, elliptic, bessel, etc. ➢ Integration (scipy.integrate) ➢ Optimization (scipy.optimize) ➢ Interpolation (scipy.interpolate) ➢ Fourier Transforms (scipy.fftpack) ➢ Signal Processing (scipy.signal) ➢ Linear Algebra (scipy.linalg) ➢ Statistics (scipy.stats) ➢ Multidimensional image processing (scipy.ndimage) ➢ Data IO (scipy.io) ➢ and more! 46 Pandas ➢Adds data structures and tools designed to work with table-like data (similar to Series and Data Frames in R) ➢Provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc. ➢Aggregation - computing a summary statistic for groups ➢min, max, count, sum, prod, mean, median, mode, mad, std, var ➢Allows for handling missing data Source: http://pandas.pydata.org/ 47 Matplotlib ➢Matplotlib is an incredibly powerful (and beautiful!) 2-D plotting library. It’s easy to use and provides a huge number of examples for tackling unique problems. ➢Similar to MATLAB 48 pyplot ➢At the center of most matplotlib scripts is pyplot. ➢The pyplot module is stateful and tracks changes to a figure. All pyplot functions revolve around creating or manipulating the state of a figure. 49 pyplot ➢The plot function can actually take any number of arguments. ➢The format string argument associated with a pair of sequence objects indicates the color and line type of the plot (e.g. ‘bs’ indicates blue squares and ‘ro’ indicates red circles). ➢Generally speaking, the x_values and y_values will be numpy arrays and if not, they will be converted to numpy arrays internally. ➢Line properties can be set via keyword arguments to the plot function. Examples include label, linewidth, animated, color, etc... 50 Jupyter Notebook ➢All of these libraries come preinstalled on Google Colab ➢Google Colab uses a Jupyter notebook environment that runs in the cloud and requires no setup to use ➢Runs in Python 3 ➢Includes all the commonly used machine learning (data science) libraries ➢i.e. NumPy, SciPy, Matplotlib, Pandas, PyTorch, Tensorflow, etc. ➢Alternatively, can use Jupyter notebook on your computer 51 Let’s take a look at week 3 Jupyter Notebook 52 Part 3 Decision Trees Decision Trees ➢A rule-based supervised learning algorithm ➢Powerful algorithm capable of fitting complex datasets. ➢Can be applied to classification (discrete) and regression (continuous) tasks. ➢Highly interpretable! ➢A fundamental component of Random Forests which are one of the most used Machine Learning algorithms today 54 Lemon Vs. Orange! Flowchart-like structure! 55 Test example 56 Constructing a Decision Tree ➢Decision trees make predictions by recursively splitting on different attributes according to a tree structure width (cm) 57 What if the attributes are discrete? 58 What if the attributes are discrete? Attributes: Features (inputs)! Discrete or Continuous 59 Output is Discrete Wait for table Go somewhere else 60 Output is Continuous (Regression) ➢Instead of predicting a class at each leaf node, predict a value based on the average of all instances at the leaf node. Source: GDCoder 61 Summary: Discrete vs Continuous Output ➢Classification Tree: ➢discrete output ➢output node (leaf) typically set to the most common value ➢Regression Tree: ➢continuous output ➢output node (leaf) value typically set to the mean value in data 62 Generalization ➢Decision trees can fit any function arbitrarily closely ➢Could potentially create a leaf for each example in the training dataset ➢Not likely to generalize to test data! ➢Need some way to prune the tree! 63 Managing Overfitting ➢Add parameters to reduce potential for overfitting ➢Parameters include: ➢depth of tree ➢minimum number of samples 64 Random Forests ➢One of the most popular variants of decision trees ➢Addresses overfitting by training multiple trees on subsampling of features among other things ➢Majority vote of all the trees is used to make the final output Source: 65 Model Interpretation ➢ Decision Trees are intuitive with easy to interpret decisions ➢ ...not the case for random forests, neural networks and many other machine learning algorithms Gini Impurity is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set. 66 Comparison to k-NN ➢There are many advantages of Decision Trees over k-Nearest Neighbours: ➢Good with discrete attributes ➢Robust to scale of inputs (does not require normalization) ➢Easily handle missing values ➢Good at handling lots of attributes, especially when only a few are important ➢Fast test time ➢More interpretable ➢Decision trees not good at handling rotations in data 67 Decision Trees Code Example (Google Colab) 68 Next Time ➢Week 3 Q&A Support Session on Thursday, Sep 23rd ➢Help with Python and Project 1 ➢Reading assignment 3 Due - Sep. 27 at 21:00 ➢Pages 184-189 (page numbers from the pdf file) Section 6.2 in Chapter 6 of “Mathematics for Machine Learning” by . Deisenroth et al., 2020 link ➢Project 1 Due - Oct. 1 at 21:00 ➢Week 4 Lecture – Uncertainty and Performance 69