DSCC 201/401
Tools and Infrastructure for Data Science
March 15, 2021
• Brief history and overview
• R interfaces
• Language syntax and examples • Useful libraries
R
2
Data Pre-Processing
• One of the most essential functions before data analysis can be performed
• Data pre-processing can be categorized into 4 main operations: • Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
3
Machine Learning with R
4
What is Machine Learning?
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. (Tom M. Mitchell)
5
• Supervised Learning
• Unsupervised Learning
Types of Machine Learning
Goal is usually classification or regression!
6
Supervised Learning
• Training data fed into algorithm includes desired solutions (e.g. classification from given labels)
• Common Algorithms
• k-NN (k-nearest neighbors)
• Linear Regression
• Logistic Regression
• SVM (Support Vector Machines)
• Decision Trees and Random Forests • Neural Networks
7
Unsupervised Learning
• Training data is unlabeled and classification is discovered • Common Algorithms
• Clustering • k-Means
• Dimensionality Reduction
• Principal Component Analysis (PCA)
• Association Rule Learning • Apriori
8
• Classification
• Clustering
library(class)
R Libraries
library(cluster)
• Classification and Regression (Classification and Regression Training)
library(caret)
9
Classification Library
library(class)
• Provides convenient classification library (k-NN) for R
• k-NN = k Nearest Neighbors
• “Supervised Learning” method
• Algorithm looks for the k observations in the training set that are closest to the new value
• Prediction for the new value is class of the majority of k nearest neighbors
• For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random
10
Classification Example with R: k-NN
• k-NN = k Nearest Neighbors
• Data scientist specifies a training set – 80% vs. 20%
• Algorithm looks for the k observations in the training set that are closest to the new value
• Prediction for the new value is class of the majority of k nearest neighbors
• Example: Restaurant data for NYC
• 3 attributes for each restaurant: average price per meal, number of visitors per weekend, number of stars
• Using a training set, can you predict the number of stars based on the price and number of visitors?
11
Classification (k-NN) Demo
library(class)
12
Clustering Example with R: k-means
• Unsupervised machine learning method
• Algorithm: Partition n objects into k clusters in which each object belongs to the cluster with the nearest mean; i.e. minimize the objective function F
• Useful for determining relationships among data
• We will look at an example using supermarket data
• Can check quality of model with silhouette score from output of model and distance matrix (row to row)
F =
||x(j) cj||2 13
Xk Xn
i
j=1 i=1
Clustering (k-means) Demo
library(cluster)
14
Classification and Regression Library
library(caret)
• Caret (Classification and Regression Training) library provides functions for classification and regression
• Library includes functions for preprocessing data, splitting data, feature selection, and tuning models
• Large selection of models:
https://topepo.github.io/caret/available-models.html
15
yˆ=✓0 +✓1×1 +✓2×2 +…+✓nxn
Linear Regression
• Statistical process for estimating the relationships among variables
• Relationships between a dependent variable and independent variables
• Useful for prediction and forecasting
• Example: mpg ~ weight for car data using linear regression in caret (lm)
16
Linear Regression Demo
library(caret)
17