January 4, 2017
January 4, 2017 1 / 77
January 4, 2017
January 4, 2017 2 / 77
Today’s Class
Part I
Announcements
Course Admin
Course Overview
motivation
topics
timelines
Part II
Understanding and Preparing Data for Analysis
Basic definitions of data and how to manage, clean, analyse data at a
high level.
January 4, 2017 3 / 77
Part I
Course Introduction
January 4, 2017 4 / 77
Course Admin
Course Admin
Announcements
Evaluation
Schedule
Course Goals
Background
Learning Objectives
Topics
Tools and Resources
January 4, 2017 5 / 77
Course Admin
Data and Knowledge Modelling and Analysis
Instructor: Mark Crowley – mcrowley@uwaterloo.ca – E5 4114
Material: learn.uwaterloo.ca
Lectures: Wednesdays 11:30am-2:20pm
Location: E5 5106/5128
TA: Rasoul Mohammadi Nasiri – r26mohammadinasiri@uwaterloo.ca –
E5 5013
TA Office Hours: email to arrange time for that week
January 4, 2017 6 / 77
Course Admin Announcements
Announcements
Registering/Waiting List – course is full
Log on to learn.uwaterloo.ca
enable email notifications
use the message boards, let me know if you want specific groups or
categories, I can create them
talk to each other
January 4, 2017 7 / 77
Course Admin Evaluation
Work load and Evaluation
Assignments 20%
Final Exam 50%
Project 30%
Proposal 5%
Presentation 7%
Proposal/Presentation Peer Reviews 4%
Report 14%
Assignments and Projects can be done in groups of 2-3
January 4, 2017 8 / 77
Course Admin Schedule
Course Dates
First Class January 4, 2017
No Class February 22, 2017
Last Class March 29, 2017
Final Exam To Be Determined
January 4, 2017 9 / 77
Course Admin Schedule
Important Dates (Subject to change)
Task Issued Due
Assignment 1 January 16 February 6
Project Pitch Session Feb 1
Project Proposal soon February 15
Assignment 2 February 15 March 8
Feedback on Proposal February 22
Project Presentations March 22, 29
Project Report March 29
January 4, 2017 10 / 77
Course Goals
Course Admin
Announcements
Evaluation
Schedule
Course Goals
Background
Learning Objectives
Topics
Tools and Resources
January 4, 2017 11 / 77
Course Goals
What are the goals of this course?
Everyone has data to process, many tools and best practices already
exist to do this.
Data could come from experiments, databases, the internet, sensors
or any other files.
This course aims to
provide engineering graduate students with essential knowledge of
data representation, grouping, mining and knowledge discovery.
Level the playing field on data representation, processing, basic
statistics, analysis, data mining.
Introduce basic Machine Learning techniques.
January 4, 2017 12 / 77
Course Goals Background
Required Background
Math and Linear Algebra : sets, marticies, transpose, cross product,
dot product, matrix multiplication, solving system of linear equations
Programming :
You should be comfortable programming in some language, not large
software application but lots of calculations, plotting, etc.
Writing and Presenting :
Assignments and Project require written reports, suggested tools :
LaTex (local or shareLaTeX), Word, Google Doc
Presentation : Latex Beamer, Powerpoint, Keynote
Probability and Statistics : (not required, we will define or review
these, but it would help)
definition of probability, Bayes theorem, information, entropy,
KL-divergence, probability distributions (Gaussian, Bernoullie, Poisson,
. . . )
hypothesis testing, chi-squared
January 4, 2017 13 / 77
Course Goals Learning Objectives
Learning Objectives
By the end of this course you will…
Explain the sources and nature of data.
Demonstrate how to best represent given data, summarize it, select
proper metrics to evaluate the quality of the data and preprocess it
for full analysis.
Demonstrate ability to process data to extract useful information and
knowledge.
January 4, 2017 14 / 77
Course Goals Learning Objectives
Answering Questions
What is Data?
How do I prepare data for analysis to remove sources of error, bias,
noise?
What can I say about the questions I can answer using a given
dataset?
How do I train, test and evaluate my hypotheses using data?
What algorithm are the most appropriate answer my questions using
this data?
January 4, 2017 15 / 77
Course Goals Topics
Topics to be covered
1 Data types, sources, nature, scales and distributions
2 Data representations, transformation, dimensionality reduction and
normalization
3 Classification: Statistical based, Distance based, Decision based,
Deep Learning.
4 Clustering: Partitional, Hierarchical, Model and Density based, others.
5 Retrieval and Mining: Similarity measures and matching techniques.
6 Knowledge discovery in data: Rule induction, Association rules
mining, text mining.
7 Performance measures and tools: Statistical Analysis, Validity and
Assessment Measures.
January 4, 2017 16 / 77
Course Goals Tools and Resources
Tools for Data Management and Analysis
Use whatever you prefer but. . .
suggested: Matlab
See short tutorial slides on LEARN
Documentation for matlab very detailed.
See Computing Resources page on course page at
Good choice: python
numpy, scipy, scikit-learn
Lots of resources online, communities, modules, new code tools all the
time.
Other: R
The statistician’s choice. Very powerful, less support from me, but
large online community too.
Other other Java – Weka – data analysis tools for Java.
See Computing Resources page on Learn/website with tips on
servers/systems you can use on campus.
Sharcnet: research students could have supervisor sponsor them to
use Sharcnet, no cost.
If you find useful resources, add to them the resources discussion
forum on LEARN.
January 4, 2017 17 / 77
Course Goals Tools and Resources
Other Tools and Resources
Mendeley.com Community – online resource for academic papers.
Group for this course, once you have an account you can join it and
post your own paper or comments on papers.
(https://www.mendeley.com/community/fb96334c-a81c-3f85-8f10-
0acf493c1423/)
Kaggle Competition
Azure Tools – Microsoft, free to use for single user, single machine
smaller runs.
Similar tools from Amazon, Google?
January 4, 2017 18 / 77
Course Goals Tools and Resources
Other Relevant Courses:
ECE 657: Tools of Intelligent Systems Design
ECE 750: Topic 5 – Distributed and Network-Centric Computing
ECE ???: Real time systems
CS 489/698: Big Data Infrastructure
CS 848/858: Models and Applications of Distributed Data Processing
Systems
CS 685: Machine Learning: Statistical and Computational
Foundations
STAT 841: Statistical Learning – Classification
SYDE 675: Pattern Recognition (similar to this course)
January 4, 2017 19 / 77
Part II
Understanding and Preparing Data
January 4, 2017 20 / 77
Outline of Today’s Lecture
Data, Data Types and Information
Types of Data
Data Representations
Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables
Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction
January 4, 2017 21 / 77
Data, Data Types and Information
Data, Data Types and Information
Types of Data
Data Representations
Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables
Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction
January 4, 2017 22 / 77
Data, Data Types and Information
Data and Information
One way to think about it…
Data : Value that is measured (continuous, e.g 25, 108.3) or
counted/observerd (discrete, e.g male, married, 5). Data by itself
does not have a meaning.
Information : Interpreted data- adding meaning to data,
understanding relations on data. e.g measured data is 25, measuring
device is thermometer then the reading is temperature. The attribute
temperature adds a meaning to the data.
January 4, 2017 23 / 77
Data, Data Types and Information
What is Information?
Another way to think about it… Entropy measures the uncertainty that is
resolved after observing a binary variable.
H = −
m∑
i=1
Pi log2 Pi
If each trial is equality probability and independent then you can add
them to get the cumulative entropy.
If the next outcome is certain then entropy is 0.
Outcome of a coin flip provides 1 bit of information.
January 4, 2017 24 / 77
Data, Data Types and Information
Data Something Something. . .
Data Sources: measurements from sensors, records, files, document,
archives, transactions.
Data Modeling: Creating a structure, organization, function or an
abstract view of the data.
Data Analysis: Transforming or operating on data to extract useful
information, knowledge or conclusions.
Data Mining: Carrying this further to discover unforeseen or hidden
patterns in the data.
January 4, 2017 25 / 77
Data, Data Types and Information
Big Data
Big data is about quality and performance given *huge* amounts of data,
that is not primarily the focus of this course. But the tools and analysis
methods we learn are part of the basis you need to deal with Big Data.
Volume – Large amounts of data, social networks, phone, location,
embedded systems, environmental, satellites, ”full firehose”
Velocity – Streaming, online data, arriving quickly, time series,
real-time
Variety – Heterogeneous (many types), many sources, category data,
numerical data, continuous/discrete, text, images, audio,
video
January 4, 2017 26 / 77
Data, Data Types and Information
Big Data
Some people say it is also includes:
Veracity – Solution requires: Accuracy, Confidence, Precision, Error
Variability – Changes moment to moment, distribution can change at
different times (seasonal, trends, fads, memes)
Complexity – Combinatorial connections between entities in the data,
form networks, hierarchies
January 4, 2017 27 / 77
Data, Data Types and Information
Types of Data Attributes
A data point has a set of attributes (also called dimensions, features or
variables):
25, 30, -1.282, 8.3e5
1st, 3rd
blue, red, green
hi, med, lo
January 4, 2017 28 / 77
Data, Data Types and Information Types of Data
Types of Data (Qualitative)
Nominal: no implication for quantity (qualitative)
e.g occupation: engineer, teacher, dentist, bus driver. Or Color value:
blue, red, green, Binary is a special case 0,1. (=, !=)
Ordinal: relative ranking among values
(i.e. order inNominal Ordinal relation to each other.
e.g (hi,med,lo), (disagree,neutral, agree),(5,3,1), (<, >)
[From [2] Chp 2]
January 4, 2017 29 / 77
Data, Data Types and Information Types of Data
Intervals
Interval: like the ordinal type but on a scale of equal-size units
i.e. a unit of measurement exists.
Interpretation of numbers depends on the unit. The interval (range)
is important for the interpretation. (+,-)
E.g the significance of a mark of 10 will be different if the interval is
0-10 from that of interval of 0-100.
Interval numbers represent differences between values, not absolute
quantities.
Temperature in C or F is an interval attribute
January 4, 2017 30 / 77
Data, Data Types and Information Types of Data
Ratios
Ratio: like the interval in terms of order and uniformity but the scale
has an inherent zero-point. (*, /)
e.g Kelvin temperature scale for heat. 0 K means no heat, 50K is
double the heat of 25 K.
Cant say that for Celsius scale, 0 C is the amount of heat at freezing
point.
Used for physical quantities: height, weight, length etc.
Also locations, distance, money
January 4, 2017 31 / 77
Data, Data Types and Information Data Representations
Structural Data
Values are represented in a tree or hierarchy or graph
Hierarchical structure could be whole-part, abstract-specific,
classes-instances
Canada
BC Ontario Quebec
Vehicle
Bus Motorcycle Car
Examples of tree structured data.
January 4, 2017 32 / 77
Data, Data Types and Information Data Representations
Structural Data
Data Modelling
Predictive
Models
Descriptive
Models
Classification
Regression Time
Series
Predictors Clustering
SummarizationAssociationRules
Sequence
Discovery
Examples of class-instance tree structured data.
January 4, 2017 33 / 77
Data, Data Types and Information Data Representations
Graphs or Trees
Year
Month Season
Day
Hour AM/PM
Winter Spring Summer Autumn
January 4, 2017 34 / 77
Data, Data Types and Information Data Representations
Data Bases
data that has the same structure (schema) or abstract view
independent of the physical layer
ECE 750 T17
16
| Data Base: data that has the same structure
(schema) or abstract view independent of the
physical layer
ID Name Job-no Job-des
employee Job
address salary Pay-range
Entity-relationship model
Relationship is used to describe association between entities
From [1]
January 4, 2017 35 / 77
Data, Data Types and Information Data Representations
Lists/Vector/Matrix/Data Cube
mainly table or vector or attribute-sample matrix which provides
relational view
ECE 750 T17
14
| Data Representation:
� Lists: mainly table or vector or attribute-sample
matrix which provides relational view
sample
attribute
a1 a2 a3 …………………….
x1
x2
x3
x4
.
.
.
cube
One attribute
e.g time
January 4, 2017 36 / 77
Data, Data Types and Information Data Representations
Data Bases
OLTP: Data bases provide online transaction processing
traditional transactional database approach
Data Warehouse: set of data to support a decision support system,
Online Analytic Processing (OLAP)
Function: is a representation that describes the data as a
mathematical function, curve, nonlinear function etc.
January 4, 2017 37 / 77
Summarizing Data
Data, Data Types and Information
Types of Data
Data Representations
Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables
Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction
January 4, 2017 38 / 77
Summarizing Data
Summarizing Data
We have data we need to find patterns in it.
Simplest pattern is a summary of the data.
January 4, 2017 39 / 77
Summarizing Data Central Tendency
Summarizing A Single Variable
Given a univariate sample X1, . . . ,Xn (could be Real, Natural,
Integers)
Goal: Summarize the variable compactly with a few numbers:
We want to summarize properties like spread, variation, range.
Anything that can provide a summary statistic for the variable.
Average : simplest and most common and estimate of central
tendency.
mean(x)) = µ = x̄ =
1
n
n∑
i=1
Xi
Pro: If the samples come from a normal distribution then the average
is the optimal estimate.
Con: Sensitive to outliers. (could be noise, data entry error, actual
outliers)
January 4, 2017 40 / 77
Summarizing Data Central Tendency
Summarizing A Single Variable
Median: If the samples are sorted then the median is the value that
splits the list into half
Mode: is the most common value in the list of samples (data can be
bimodal or more)
Skew: (third moment) high skew means the bulk of the data is at
one end. Result: Median will be a better measure than mean.
Kurtosis: (fourth moment) A measure of the heaviness of the tail of
the distribution with respect to a set of points with a
normal/Gaussian distribution and the same variance.
HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 47 #9
2.2 Basic Statistical Descriptions of Data 47
lower than the median interval, freqmedian is the frequency of the median interval, and
width is the width of the median interval.
The mode is another measure of central tendency. The mode for a set of data is the
value that occurs most frequently in the set. Therefore, it can be determined for qualita-
tive and quantitative attributes. It is possible for the greatest frequency to correspond to
several different values, which results in more than one mode. Data sets with one, two,
or three modes are respectively called unimodal, bimodal, and trimodal. In general, a
data set with two or more modes is multimodal. At the other extreme, if each data value
occurs only once, then there is no mode.
Example 2.8 Mode. The data from Example 2.6 are bimodal. The two modes are $52,000 and
$70,000.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the
following empirical relation:
mean � mode ⇡ 3 ⇥ (mean � median). (2.4)
This implies that the mode for unimodal frequency curves that are moderately skewed
can easily be approximated if the mean and median values are known.
The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set. This measure is easy to
compute using the SQL aggregate functions, max() and min().
Example 2.9 Midrange. The midrange of the data of Example 2.6 is 30,000+110,0002 = $70,000.
In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value, as shown in Figure 2.1(a).
Data in most real applications are not symmetric. They may instead be either posi-
tively skewed, where the mode occurs at a value that is smaller than the median
(Figure 2.1b), or negatively skewed, where the mode occurs at a value greater than the
median (Figure 2.1c).
Mode
Median
Mean Mode
Median
MeanMean
Median
Mode
(a) Symmetric data (b) Positively skewed data (c) Negatively skewed data
Figure 2.1 Mean, median, and mode of symmetric versus positively and negatively skewed data.
January 4, 2017 41 / 77
Summarizing Data Central Tendency
Central Moments of a Set of Points
Mean(1), Variance(2), Skew(3) and Kurtosis(4) are unified by a single
type of calculation on the n data points.
µk ≈
∫ ∞
−∞
(x − c)nf (x)dx
µk ≈
1
n − k + 1
n∑
i=1
(Xi − µk−1)k
The 3rd and 4th moments are usually normazlied by sk just as Standard
Deviation is.
January 4, 2017 42 / 77
Summarizing Data Central Tendency
Types of Mean Functions
Trimmed Mean: ignoring small percentage of highest and lowest
values
Geometric Mean:(
n∏
i=1
xi
) 1
n
≤ Mean (1)
= exp
[
1
n
n∑
i=1
log xi
]
(2)
Arithmetic mean of logarithm transformed x
Good for positive values and output of growth rates
Most appropriate for ranking normalized results (different normalization
can alter ordering for arithmetic or hamonic means)
January 4, 2017 43 / 77
Summarizing Data Central Tendency
Types of Mean Functions
Harmonic mean: average of rates
H =
n
1/x1 + 1/x2 + · · ·+ 1/xn
It is the reciprocal of arithmetic mean of the reciprocals of the sample
points.
Appropriate for values that are inversely proportional to time such as
“speedup”.
January 4, 2017 44 / 77
Summarizing Data Central Tendency
Mean Examples (in Matlab)
Data: X=[1,1,1,1,1,1,100]
n = 7
Mean=sum(X)/n=106/7=15.4
Median=median(X)=1
Mode=Mode(X)=1
Trimmed mean(25%)=1
Geometric Mean=1.9307
Harmonic mean=1.1647
January 4, 2017 45 / 77
Summarizing Data Measures of Dispersion
Measures of Dispersion: Variance and Deviation
measure the spread of the data range
Standard Deviation:
σ =
√√√√1
n
n∑
i=1
(xi − x̄)
2
Pro: Same units as the data
Con: Sensitive to outliers
matlab:std(x)
Variance:
matlab:var(x) = σ2 = S2 =
1
n
n∑
i=1
(xi − x̄)
2
January 4, 2017 46 / 77
Summarizing Data Measures of Dispersion
Variance and Deviation
Mean Absolute Deviation (MAD)
1
n
n∑
i=1
|xi − x̄ |
Less sensitive to outliers than STD
matlab:mad(x)
Interquartile Range (IQR): Difference between 75th (Q3) and 25th
(Q1) percentile of data
January 4, 2017 47 / 77
Summarizing Data Measures of Dispersion
Deviation Examples
Data: X=[1,1,1,1,1,1,100]
n = 7
Range=range(X)/n=99
Std=std(X)=37.42
MAD=mad(X)=24.24
IQR=0
ECE 750 T17
24
� Example: the same one
X= 1 1 1 1 1 1 100, n=7
Range= 99, Std=37.42
MAD=24.24
IQR= 0
5-number
median
Q3
Q1
min
max
Box-plot
January 4, 2017 48 / 77
Summarizing Data Measures of Dispersion
Pearson Correlation Coefficient
Measure of how strongly one attribute implies another
r = cov(v1, v2)/s1s2
cov(v1, v2) =
1
n
{(v1 − v̄1)(v2 − v̄2)T}
Interpretation:
−1 ≤ r ≤ 1
-1 corresponds to negative correlation
+1 corresponds to positive correlation
Variance is a special case of covariance where v1 = v2
r 6= 0 implies dependency
Independence implies covariance or correlation =0
However, in general covariance or r=0 doesn’t necessarily imply
independence
January 4, 2017 49 / 77
Summarizing Data Measures of Dispersion
PCC Examples
r = cov(v1, v2)/s1s2
cov(v1, v2) =
1
n
{(v1 − v̄1)(v2 − v̄2)T}
X = (2, 1, 3) Y = (1, 3, 2)
X̄ = 2 S2X =
2
3
Ȳ = 2 S2Y =
2
3
X − X̄ = (0,−1, 1) Y − Ȳ = (−1, 1, 0)
r =
(
1
3
)(
−1
2/3
)
= −0.5
January 4, 2017 50 / 77
Summarizing Data Measures of Dispersion
PCC Examples
X=(2,1,3) Y=(1,3,2) r= -0.5 weak negative correlation
X=(2,1,2) Y=(1,3,1) r= -1 strong negative correlation
X=(2,1,2) Y=(4,2,4) r= 1 strong posative correlation
X=(2,1,2) Y=(5,6,7) r= 0 independent
January 4, 2017 51 / 77
Summarizing Data Measures of Dispersion
Cross Correlation
Between two time series: association between values in the same time
series separated by some lag v1(i), v2(i)
Measures similarity between them by applying a time lag to one of
them.
It can be used to find repeated pattern or periodic nature so it can be
used for prediction.
Correlation coefficient r
Autocorrelation: cross-correlation between two values at different
points in time in the same time series (also called autocovariance)
series separated by some lag v1(i), v1(i + lag)
it can be used to find repeated pattern or periodic nature so it can be
used for prediction.
January 4, 2017 52 / 77
Summarizing Data Multiple Variables
Multivariate Data Representation
Most common is sample-attribute matrix (pattern matrix or feature
matrix or observation matrix)
others: linked list, hierarchical
ECE 750 T17
28
| Multivariate Data Representation
Most common is sample-attribute matrix (pattern matrix or
feature matrix or observation matrix)
Others include linked list, hierarchical etc.
attribute1 attribute2 attribute3……… attribute d
sample1
sample2
sample3
.
.
.
.
.
.
sample n
value11 value12 value13 ………. value 1d
…… ….. ….. ……… ……
…… ….. ….. ……… ……
value 1n value 2n value 3n ………..value nd
nxd
January 4, 2017 53 / 77
Data Preprocessing
Data, Data Types and Information
Types of Data
Data Representations
Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables
Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction
January 4, 2017 54 / 77
Data Preprocessing Data Examination
Data examination
Data Quality:
Accuracy: incorrect (eg.birthdates), inaccurate, transmission errors,
duplicates
Completeness: not recorded values, unavailable, ..
Consistency: delete inconsistent data? acquire more data? average?
Interpretability: how easily the data can be understood, correction of
errors or removal of inconsistent data could make it harder to interpret
[From [2] Chp 3]
January 4, 2017 55 / 77
Data Preprocessing Data Cleaning
Data Cleaning
Examine the data to correct for:
missing values
outliers
noise.
[From [2] Chp 3]
January 4, 2017 56 / 77
Data Preprocessing Data Cleaning
Missing Values
Use attribute mean (or majority nominal value) to fill in missing values
If there are classes, use attribute mean (or majority nominal value) for
all samples in the same class
Can use prediction or interpolation to fill in missing values
Can remove the samples that have too many values missing
January 4, 2017 57 / 77
Data Preprocessing Data Cleaning
Dealing with Outliers or Noise
Detection:
Use histograms to detect outliers
Use difference between mean, mode , median to indicate outliers
Use clustering to detect outliers.
Observe fluctuation in the values
Inconsistent values (negative values for positive attributes)
Fixing:
Remove samples that are way out of range.
Smoothing the data to get rid of fluctuations.
Use logic check to correct inconsistency.
Use prediction methods or fitting.
January 4, 2017 58 / 77
Data Preprocessing Data Transformation: Smoothing
Data, Data Types and Information
Types of Data
Data Representations
Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables
Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction
January 4, 2017 59 / 77
Data Preprocessing Data Transformation: Smoothing
Binning Methods for Smoothing
Data Smoothing: Focus here is not correcting the data but softening it.
Sort data and partition into bins
equal width
equal frequency by number of samples
Smooth the values in each bin by:
replacing with the mean or median
replacing with the nearest bin boundary value
January 4, 2017 60 / 77
Data Preprocessing Data Transformation: Smoothing
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
January 4, 2017 61 / 77
Data Preprocessing Data Transformation: Smoothing
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
January 4, 2017 61 / 77
Data Preprocessing Data Transformation: Smoothing
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
January 4, 2017 61 / 77
Data Preprocessing Data Transformation: Smoothing
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
January 4, 2017 61 / 77
Data Preprocessing Data Transformation: Smoothing
Smoothing within a Window
If the values fluctuate so rapidly, we can do smoothing.
Smoothing within a window using a moving average
For example, for window size 3, using the median or mean to smooth
i.e. mean or median of 3 consecutive values
January 4, 2017 62 / 77
Data Preprocessing Data Transformation: Smoothing
Smoothing within a Window
Example
X: [0,1,2,3,4,5,6]
Y: [1,3,1,4,25,6]
ECE 750 T17
34
| Smoothing within a window (From [5])
Example
X 0 1 2 3 4 5 6
Y 1 3 1 4 25 3 6
y
0
5
10
15
20
25
30
1 2 3 4 5 6 7
y
[From [5] ]
January 4, 2017 63 / 77
Data Preprocessing Data Transformation: Smoothing
Smoothing within a Window
ECE 750 T17
34
| Smoothing within a window (From [5])
Example
X 0 1 2 3 4 5 6
Y 1 3 1 4 25 3 6
y
0
5
10
15
20
25
30
1 2 3 4 5 6 7
y
ECE 750 T17
35
If the values fluctuate so rapidly, we can do
smoothing.
Smoothing within a window (using a moving
average)
For example for window size 3, using the median
or mean to smooth
i.e., median of 3 or mean of 3
1 3 1 4 25 3 6
1
1.67
3
1 3
2.67
4
10
4
10.67
6
11.33
4
25
3
1 1 3 4 4 6 6 gets rid of 25
1 1.67 2.67 10 10.67 11.33 6
becomes
January 4, 2017 64 / 77
Data Preprocessing Data Transformation: Smoothing
Data, Data Types and Information
Types of Data
Data Representations
Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables
Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction
January 4, 2017 65 / 77
Data Preprocessing Data Transformation: Normalization, Scaling
Data, Data Types and Information
Types of Data
Data Representations
Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables
Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction
January 4, 2017 66 / 77
Data Preprocessing Data Transformation: Normalization, Scaling
Normalization
Map the values x1, x2, . . . , xn of the attribute A to a new value x
′
i in
the interval [0,1] (or any other interval).
Min-Max normalization:
x ′i =
xi −mini xi
maxi xi −mini xi
PRO: This makes the values invariant to rigid displacement of
coordinates.
CON: It will encounter an out-of-bounds error if a future input case
for normalization falls outside of the original data range for A
Subtract the mean: x ′i = (xi − Ā)
[From [2] Chp 3.5]
January 4, 2017 67 / 77
Data Preprocessing Data Transformation: Normalization, Scaling
Normalization
Z-score (standard score) normalization: Scale by mean and standard
deviation
x ′i = (xi − Ā)/σA (3)
Positive means value is above the mean, Negative means it is below
the mean.
Pro: This method of normalization is useful when the actual
minimum and maximum of attribute A are unknown, or when there
are outliers that dominate the min-max normalization
Con: Normalization may or may not be desirable in some cases. It
may make samples that are dispersed in space closer to each other
and hence not separable.
January 4, 2017 68 / 77
Data Preprocessing Data Transformation: Normalization, Scaling
Normalization Examples
A = (−200, 400, 600, 800)
Min= −200, Max= 800, max-min= 1000
Min-Max Normalization
X ′ = (0, 0.6, 0.8, 1)
mean′ = 0.6, σX ′ = 0.374
Subtracting Mean Normalization
mean=400
X ′′ = (−600, 0, 200, 400)
mean′′ = 0, σX ′′ = 100
√
14
Z-score normalization
X ′′′ = ( −6√
14
, 0, 2√
14
, 4√
14
)
mean′′′ = 0, σX ′′′ = 1
January 4, 2017 69 / 77
Data Preprocessing Data Transformation: Normalization, Scaling
Normalization By Data Scaling
x ′ =
x
10j
Where j is the smallest integer such that max |x ′| < 1 Example: if x ∈ [−986, 917] then max |x | = 986 then x ′ = 1000 So -986 will normalize to -0.986 and 917 to .917 Note: normalization can change the characteristics of the original data. But it is good for comparing values of different scales and reduces influence large numbers in the data. January 4, 2017 70 / 77 Data Preprocessing Data Transformation: Normalization, Scaling Normalization of Matrix Data If A is the sample-feature matrix in terms of normalized data then let R = 1 n ATA r = 1 n n∑ k=1 xkixkj Under subtraction normalization, R is a covariance matrix Under z-score normalization rij becomes the correlation coefficient between features i and j and rjj = 1 for all j R is then called the correlation matrix. January 4, 2017 71 / 77 Data Preprocessing Data Reduction Data, Data Types and Information Types of Data Data Representations Summarizing Data Central Tendency Measures of Dispersion Multiple Variables Data Preprocessing Data Examination Data Cleaning Data Transformation: Smoothing Data Transformation: Normalization, Scaling Data Reduction January 4, 2017 72 / 77 Data Preprocessing Data Reduction Data Reduction Goal: improve performance, without hurting accuracy too much Dimensionality Reduction (more on this later) wavelet transforms principle component analysis (more on this later) Numerosity Reduction regression clustering (lots of time on this later) sampling data cube aggregation? January 4, 2017 73 / 77 Data Preprocessing Data Reduction Sampling for Data Reduction Sampling: obtaining a small subset of datapoints s to represent the whole data set n Allows a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Key principle: Choose a representative subset of the data Simple random sampling may have very poor performance in the present of skew Use adaptive sampling methods such as stratified sampling [From [2] Chp 3.4.8] January 4, 2017 74 / 77 Data Preprocessing Data Reduction Types of Sampling Simple random sampling: Draw a random number form the sample indices and select the object. Treats all sample as equally likely to be selected. Sampling without replacement: Remove the objects you select from the remaining samples. Original sample gets reduced every time you make a selection. Sampling with replacement: A selected object is put back in the original sample. You may select the same object more than one time. Stratified sampling: Similar to binning where the data set is partitioned and samples are selected from each partition. Good for skewed data. January 4, 2017 75 / 77 Data Preprocessing Data Reduction Summary Definition of Types of Data Summarizing Data: mean, variance, deviation, correlation Examination of Data: Quality accuracy, completeness, consistency, interpretability Data Cleaning: missing values, outliers, noise Transformation: Smoothing with bins and windows Normalization and Scaling Data Reduction: dimensionality, numerosity January 4, 2017 76 / 77 Data Preprocessing Data Reduction [Dunham, Data Mining Intro and Advanced Topics, 2003] Margaret Dunham, Data Mining Introductory and Advanced Topics, ISBN:0130888923, Prentice Hall, 2003. [Han,Kamber and Pei. Data Mining, 2011] Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed, Morgan Kaufmann Publishers, May 2011. [Duda, Pattern Classification, 2001] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd ed.), John Wiley and Sons, 2001. [Jain and Dubes. Algs for Clustering Data, 1988] A. K. Jain and R.C. Dubes, Algorithms for Clustering Data, ISBN: 0-13-022278-x, Prentice Hall, 1988. [Cohen,Empirical Methods for Artificial Intelligence, 1995] P. Cohen, Empirical Methods for Artificial Intelligence, ISBN:0-262-03225-2, MIT Press, 1995. [Ackoff, From Data to Wisdom, 1989] January 4, 2017 76 / 77 Data Preprocessing Data Reduction Ackoff, From Data to Wisdom, Journal of Applied Systems Analysis, 1989. January 4, 2017 76 / 77 Data Preprocessing Data Reduction Wrap-Up data set demo Let’s see the class data so far, fill it all in! Data entries will close tonight at midnight! Link to spreadsheet will be posted in LEARN/dropbox January 4, 2017 77 / 77 Course Introduction Course Admin Announcements Evaluation Schedule Course Goals Background Learning Objectives Topics Tools and Resources Understanding and Preparing Data Data, Data Types and Information Types of Data Data Representations Summarizing Data Central Tendency Measures of Dispersion Multiple Variables Data Preprocessing Data Examination Data Cleaning Data Transformation: Smoothing Data Transformation: Normalization, Scaling Data Reduction