程序代写代做代考 python deep learning SQL matlab data mining Java algorithm database Hive January 4, 2017

January 4, 2017

January 4, 2017 1 / 77

January 4, 2017

January 4, 2017 2 / 77

Today’s Class

Part I

Announcements

Course Admin

Course Overview

motivation
topics
timelines

Part II

Understanding and Preparing Data for Analysis

Basic definitions of data and how to manage, clean, analyse data at a
high level.

January 4, 2017 3 / 77

Part I

Course Introduction

January 4, 2017 4 / 77

Course Admin

Course Admin
Announcements
Evaluation
Schedule

Course Goals
Background
Learning Objectives
Topics
Tools and Resources

January 4, 2017 5 / 77

Course Admin

Data and Knowledge Modelling and Analysis

Instructor: Mark Crowley – mcrowley@uwaterloo.ca – E5 4114

Material: learn.uwaterloo.ca

Lectures: Wednesdays 11:30am-2:20pm

Location: E5 5106/5128

TA: Rasoul Mohammadi Nasiri – r26mohammadinasiri@uwaterloo.ca –
E5 5013

TA Office Hours: email to arrange time for that week

January 4, 2017 6 / 77

Course Admin Announcements

Announcements

Registering/Waiting List – course is full

Log on to learn.uwaterloo.ca

enable email notifications
use the message boards, let me know if you want specific groups or
categories, I can create them
talk to each other

January 4, 2017 7 / 77

Course Admin Evaluation

Work load and Evaluation

Assignments 20%

Final Exam 50%

Project 30%

Proposal 5%
Presentation 7%
Proposal/Presentation Peer Reviews 4%
Report 14%

Assignments and Projects can be done in groups of 2-3

January 4, 2017 8 / 77

Course Admin Schedule

Course Dates

First Class January 4, 2017
No Class February 22, 2017
Last Class March 29, 2017
Final Exam To Be Determined

January 4, 2017 9 / 77

Course Admin Schedule

Important Dates (Subject to change)

Task Issued Due

Assignment 1 January 16 February 6
Project Pitch Session Feb 1
Project Proposal soon February 15
Assignment 2 February 15 March 8
Feedback on Proposal February 22
Project Presentations March 22, 29
Project Report March 29

January 4, 2017 10 / 77

Course Goals

Course Admin
Announcements
Evaluation
Schedule

Course Goals
Background
Learning Objectives
Topics
Tools and Resources

January 4, 2017 11 / 77

Course Goals

What are the goals of this course?

Everyone has data to process, many tools and best practices already
exist to do this.

Data could come from experiments, databases, the internet, sensors
or any other files.

This course aims to

provide engineering graduate students with essential knowledge of
data representation, grouping, mining and knowledge discovery.

Level the playing field on data representation, processing, basic
statistics, analysis, data mining.

Introduce basic Machine Learning techniques.

January 4, 2017 12 / 77

Course Goals Background

Required Background

Math and Linear Algebra : sets, marticies, transpose, cross product,
dot product, matrix multiplication, solving system of linear equations

Programming :

You should be comfortable programming in some language, not large
software application but lots of calculations, plotting, etc.

Writing and Presenting :

Assignments and Project require written reports, suggested tools :
LaTex (local or shareLaTeX), Word, Google Doc
Presentation : Latex Beamer, Powerpoint, Keynote

Probability and Statistics : (not required, we will define or review
these, but it would help)

definition of probability, Bayes theorem, information, entropy,
KL-divergence, probability distributions (Gaussian, Bernoullie, Poisson,
. . . )
hypothesis testing, chi-squared

January 4, 2017 13 / 77

Course Goals Learning Objectives

Learning Objectives

By the end of this course you will…

Explain the sources and nature of data.

Demonstrate how to best represent given data, summarize it, select
proper metrics to evaluate the quality of the data and preprocess it
for full analysis.

Demonstrate ability to process data to extract useful information and
knowledge.

January 4, 2017 14 / 77

Course Goals Learning Objectives

Answering Questions

What is Data?

How do I prepare data for analysis to remove sources of error, bias,
noise?

What can I say about the questions I can answer using a given
dataset?

How do I train, test and evaluate my hypotheses using data?

What algorithm are the most appropriate answer my questions using
this data?

January 4, 2017 15 / 77

Course Goals Topics

Topics to be covered

1 Data types, sources, nature, scales and distributions

2 Data representations, transformation, dimensionality reduction and
normalization

3 Classification: Statistical based, Distance based, Decision based,
Deep Learning.

4 Clustering: Partitional, Hierarchical, Model and Density based, others.

5 Retrieval and Mining: Similarity measures and matching techniques.

6 Knowledge discovery in data: Rule induction, Association rules
mining, text mining.

7 Performance measures and tools: Statistical Analysis, Validity and
Assessment Measures.

January 4, 2017 16 / 77

Course Goals Tools and Resources

Tools for Data Management and Analysis

Use whatever you prefer but. . .
suggested: Matlab

See short tutorial slides on LEARN
Documentation for matlab very detailed.
See Computing Resources page on course page at

Good choice: python
numpy, scipy, scikit-learn
Lots of resources online, communities, modules, new code tools all the
time.

Other: R
The statistician’s choice. Very powerful, less support from me, but
large online community too.

Other other Java – Weka – data analysis tools for Java.
See Computing Resources page on Learn/website with tips on
servers/systems you can use on campus.
Sharcnet: research students could have supervisor sponsor them to
use Sharcnet, no cost.
If you find useful resources, add to them the resources discussion
forum on LEARN.

January 4, 2017 17 / 77

Course Goals Tools and Resources

Other Tools and Resources

Mendeley.com Community – online resource for academic papers.
Group for this course, once you have an account you can join it and
post your own paper or comments on papers.
(https://www.mendeley.com/community/fb96334c-a81c-3f85-8f10-
0acf493c1423/)

Kaggle Competition

Azure Tools – Microsoft, free to use for single user, single machine
smaller runs.

Similar tools from Amazon, Google?

January 4, 2017 18 / 77

Course Goals Tools and Resources

Other Relevant Courses:

ECE 657: Tools of Intelligent Systems Design

ECE 750: Topic 5 – Distributed and Network-Centric Computing

ECE ???: Real time systems

CS 489/698: Big Data Infrastructure

CS 848/858: Models and Applications of Distributed Data Processing
Systems

CS 685: Machine Learning: Statistical and Computational
Foundations

STAT 841: Statistical Learning – Classification

SYDE 675: Pattern Recognition (similar to this course)

January 4, 2017 19 / 77

Part II

Understanding and Preparing Data

January 4, 2017 20 / 77

Outline of Today’s Lecture

Data, Data Types and Information
Types of Data
Data Representations

Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables

Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction

January 4, 2017 21 / 77

Data, Data Types and Information

Data, Data Types and Information
Types of Data
Data Representations

Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables

Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction

January 4, 2017 22 / 77

Data, Data Types and Information

Data and Information

One way to think about it…

Data : Value that is measured (continuous, e.g 25, 108.3) or
counted/observerd (discrete, e.g male, married, 5). Data by itself
does not have a meaning.

Information : Interpreted data- adding meaning to data,
understanding relations on data. e.g measured data is 25, measuring
device is thermometer then the reading is temperature. The attribute
temperature adds a meaning to the data.

January 4, 2017 23 / 77

Data, Data Types and Information

What is Information?

Another way to think about it… Entropy measures the uncertainty that is
resolved after observing a binary variable.

H = −
m∑
i=1

Pi log2 Pi

If each trial is equality probability and independent then you can add
them to get the cumulative entropy.

If the next outcome is certain then entropy is 0.

Outcome of a coin flip provides 1 bit of information.

January 4, 2017 24 / 77

Data, Data Types and Information

Data Something Something. . .

Data Sources: measurements from sensors, records, files, document,
archives, transactions.

Data Modeling: Creating a structure, organization, function or an
abstract view of the data.

Data Analysis: Transforming or operating on data to extract useful
information, knowledge or conclusions.

Data Mining: Carrying this further to discover unforeseen or hidden
patterns in the data.

January 4, 2017 25 / 77

Data, Data Types and Information

Big Data

Big data is about quality and performance given *huge* amounts of data,
that is not primarily the focus of this course. But the tools and analysis
methods we learn are part of the basis you need to deal with Big Data.

Volume – Large amounts of data, social networks, phone, location,
embedded systems, environmental, satellites, ”full firehose”

Velocity – Streaming, online data, arriving quickly, time series,
real-time

Variety – Heterogeneous (many types), many sources, category data,
numerical data, continuous/discrete, text, images, audio,
video

January 4, 2017 26 / 77

Data, Data Types and Information

Big Data

Some people say it is also includes:

Veracity – Solution requires: Accuracy, Confidence, Precision, Error

Variability – Changes moment to moment, distribution can change at
different times (seasonal, trends, fads, memes)

Complexity – Combinatorial connections between entities in the data,
form networks, hierarchies

January 4, 2017 27 / 77

Data, Data Types and Information

Types of Data Attributes

A data point has a set of attributes (also called dimensions, features or
variables):

25, 30, -1.282, 8.3e5

1st, 3rd

blue, red, green

hi, med, lo

January 4, 2017 28 / 77

Data, Data Types and Information Types of Data

Types of Data (Qualitative)

Nominal: no implication for quantity (qualitative)
e.g occupation: engineer, teacher, dentist, bus driver. Or Color value:
blue, red, green, Binary is a special case 0,1. (=, !=)

Ordinal: relative ranking among values
(i.e. order inNominal Ordinal relation to each other.
e.g (hi,med,lo), (disagree,neutral, agree),(5,3,1), (<, >)

[From [2] Chp 2]

January 4, 2017 29 / 77

Data, Data Types and Information Types of Data

Intervals

Interval: like the ordinal type but on a scale of equal-size units
i.e. a unit of measurement exists.

Interpretation of numbers depends on the unit. The interval (range)
is important for the interpretation. (+,-)

E.g the significance of a mark of 10 will be different if the interval is
0-10 from that of interval of 0-100.

Interval numbers represent differences between values, not absolute
quantities.

Temperature in C or F is an interval attribute

January 4, 2017 30 / 77

Data, Data Types and Information Types of Data

Ratios

Ratio: like the interval in terms of order and uniformity but the scale
has an inherent zero-point. (*, /)

e.g Kelvin temperature scale for heat. 0 K means no heat, 50K is
double the heat of 25 K.

Cant say that for Celsius scale, 0 C is the amount of heat at freezing
point.

Used for physical quantities: height, weight, length etc.

Also locations, distance, money

January 4, 2017 31 / 77

Data, Data Types and Information Data Representations

Structural Data

Values are represented in a tree or hierarchy or graph

Hierarchical structure could be whole-part, abstract-specific,
classes-instances

Canada

BC Ontario Quebec

Vehicle

Bus Motorcycle Car

Examples of tree structured data.

January 4, 2017 32 / 77

Data, Data Types and Information Data Representations

Structural Data

Data Modelling

Predictive
Models

Descriptive
Models

Classification

Regression Time
Series

Predictors Clustering

SummarizationAssociationRules

Sequence
Discovery

Examples of class-instance tree structured data.

January 4, 2017 33 / 77

Data, Data Types and Information Data Representations

Graphs or Trees

Year

Month Season

Day

Hour AM/PM

Winter Spring Summer Autumn

January 4, 2017 34 / 77

Data, Data Types and Information Data Representations

Data Bases

data that has the same structure (schema) or abstract view
independent of the physical layer

ECE 750 T17

| Data Base: data that has the same structure
(schema) or abstract view independent of the
physical layer

ID Name Job-no Job-des

employee Job

address salary Pay-range

Entity-relationship model

Relationship is used to describe association between entities
From [1]

January 4, 2017 35 / 77

Data, Data Types and Information Data Representations

Lists/Vector/Matrix/Data Cube

mainly table or vector or attribute-sample matrix which provides
relational view

ECE 750 T17

| Data Representation:
� Lists: mainly table or vector or attribute-sample

matrix which provides relational view

sample

attribute
a1 a2 a3 …………………….

x1
x2
x3
x4
.
.
.

cube

One attribute
e.g time

January 4, 2017 36 / 77

Data, Data Types and Information Data Representations

Data Bases

OLTP: Data bases provide online transaction processing

traditional transactional database approach

Data Warehouse: set of data to support a decision support system,
Online Analytic Processing (OLAP)

Function: is a representation that describes the data as a
mathematical function, curve, nonlinear function etc.

January 4, 2017 37 / 77

Summarizing Data

Data, Data Types and Information
Types of Data
Data Representations

Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables

Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction

January 4, 2017 38 / 77

Summarizing Data

We have data we need to find patterns in it.

Simplest pattern is a summary of the data.

January 4, 2017 39 / 77

Summarizing Data Central Tendency

Summarizing A Single Variable

Given a univariate sample X1, . . . ,Xn (could be Real, Natural,
Integers)

Goal: Summarize the variable compactly with a few numbers:

We want to summarize properties like spread, variation, range.
Anything that can provide a summary statistic for the variable.

Average : simplest and most common and estimate of central
tendency.

mean(x)) = µ = x̄ =
1

n∑
i=1

Pro: If the samples come from a normal distribution then the average
is the optimal estimate.
Con: Sensitive to outliers. (could be noise, data entry error, actual
outliers)

January 4, 2017 40 / 77

Summarizing Data Central Tendency

Summarizing A Single Variable

Median: If the samples are sorted then the median is the value that
splits the list into half
Mode: is the most common value in the list of samples (data can be
bimodal or more)
Skew: (third moment) high skew means the bulk of the data is at
one end. Result: Median will be a better measure than mean.
Kurtosis: (fourth moment) A measure of the heaviness of the tail of
the distribution with respect to a set of points with a
normal/Gaussian distribution and the same variance.

HAN 09-ch02-039-082-9780123814791 2011/6/1 3:15 Page 47 #9

2.2 Basic Statistical Descriptions of Data 47

lower than the median interval, freqmedian is the frequency of the median interval, and
width is the width of the median interval.

The mode is another measure of central tendency. The mode for a set of data is the
value that occurs most frequently in the set. Therefore, it can be determined for qualita-
tive and quantitative attributes. It is possible for the greatest frequency to correspond to
several different values, which results in more than one mode. Data sets with one, two,
or three modes are respectively called unimodal, bimodal, and trimodal. In general, a
data set with two or more modes is multimodal. At the other extreme, if each data value
occurs only once, then there is no mode.

Example 2.8 Mode. The data from Example 2.6 are bimodal. The two modes are $52,000 and
$70,000.

For unimodal numeric data that are moderately skewed (asymmetrical), we have the
following empirical relation:

mean � mode ⇡ 3 ⇥ (mean � median). (2.4)

This implies that the mode for unimodal frequency curves that are moderately skewed
can easily be approximated if the mean and median values are known.

The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set. This measure is easy to
compute using the SQL aggregate functions, max() and min().

Example 2.9 Midrange. The midrange of the data of Example 2.6 is 30,000+110,0002 = $70,000.

In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value, as shown in Figure 2.1(a).

Data in most real applications are not symmetric. They may instead be either posi-
tively skewed, where the mode occurs at a value that is smaller than the median
(Figure 2.1b), or negatively skewed, where the mode occurs at a value greater than the
median (Figure 2.1c).

Mode

Median

Mean Mode

Median

MeanMean
Median
Mode

(a) Symmetric data (b) Positively skewed data (c) Negatively skewed data

Figure 2.1 Mean, median, and mode of symmetric versus positively and negatively skewed data.
January 4, 2017 41 / 77

Summarizing Data Central Tendency

Central Moments of a Set of Points

Mean(1), Variance(2), Skew(3) and Kurtosis(4) are unified by a single
type of calculation on the n data points.

µk ≈
∫ ∞
−∞

(x − c)nf (x)dx

µk ≈
1

n − k + 1

n∑
i=1

(Xi − µk−1)k

The 3rd and 4th moments are usually normazlied by sk just as Standard
Deviation is.

January 4, 2017 42 / 77

Summarizing Data Central Tendency

Types of Mean Functions

Trimmed Mean: ignoring small percentage of highest and lowest
values

Geometric Mean:(
n∏

i=1

) 1
n

≤ Mean (1)

= exp

[
1

n∑
i=1

log xi

]
(2)

Arithmetic mean of logarithm transformed x
Good for positive values and output of growth rates
Most appropriate for ranking normalized results (different normalization
can alter ordering for arithmetic or hamonic means)

January 4, 2017 43 / 77

Summarizing Data Central Tendency

Types of Mean Functions

Harmonic mean: average of rates

H =
n

1/x1 + 1/x2 + · · ·+ 1/xn

It is the reciprocal of arithmetic mean of the reciprocals of the sample
points.
Appropriate for values that are inversely proportional to time such as
“speedup”.

January 4, 2017 44 / 77

Summarizing Data Central Tendency

Mean Examples (in Matlab)

Data: X=[1,1,1,1,1,1,100]

n = 7

Mean=sum(X)/n=106/7=15.4

Median=median(X)=1

Mode=Mode(X)=1

Trimmed mean(25%)=1

Geometric Mean=1.9307

Harmonic mean=1.1647

January 4, 2017 45 / 77

Summarizing Data Measures of Dispersion

Measures of Dispersion: Variance and Deviation

measure the spread of the data range

Standard Deviation:

σ =

√√√√1
n

n∑
i=1

(xi − x̄)
2

Pro: Same units as the data
Con: Sensitive to outliers
matlab:std(x)

Variance:

matlab:var(x) = σ2 = S2 =
1

n∑
i=1

(xi − x̄)
2

January 4, 2017 46 / 77

Summarizing Data Measures of Dispersion

Variance and Deviation

Mean Absolute Deviation (MAD)

n∑
i=1

|xi − x̄ |

Less sensitive to outliers than STD
matlab:mad(x)

Interquartile Range (IQR): Difference between 75th (Q3) and 25th
(Q1) percentile of data

January 4, 2017 47 / 77

Summarizing Data Measures of Dispersion

Deviation Examples

Data: X=[1,1,1,1,1,1,100]

n = 7
Range=range(X)/n=99
Std=std(X)=37.42
MAD=mad(X)=24.24
IQR=0

ECE 750 T17

� Example: the same one
X= 1 1 1 1 1 1 100, n=7
Range= 99, Std=37.42
MAD=24.24
IQR= 0

5-number

median

min

max

Box-plot

January 4, 2017 48 / 77

Summarizing Data Measures of Dispersion

Pearson Correlation Coefficient

Measure of how strongly one attribute implies another

r = cov(v1, v2)/s1s2

cov(v1, v2) =
1

n
{(v1 − v̄1)(v2 − v̄2)T}

Interpretation:
−1 ≤ r ≤ 1
-1 corresponds to negative correlation
+1 corresponds to positive correlation
Variance is a special case of covariance where v1 = v2
r 6= 0 implies dependency

Independence implies covariance or correlation =0

However, in general covariance or r=0 doesn’t necessarily imply
independence

January 4, 2017 49 / 77

Summarizing Data Measures of Dispersion

PCC Examples

r = cov(v1, v2)/s1s2

cov(v1, v2) =
1

n
{(v1 − v̄1)(v2 − v̄2)T}

X = (2, 1, 3) Y = (1, 3, 2)

X̄ = 2 S2X =
2

3
Ȳ = 2 S2Y =

3
X − X̄ = (0,−1, 1) Y − Ȳ = (−1, 1, 0)

r =

(
1

)(
−1
2/3

)
= −0.5

January 4, 2017 50 / 77

Summarizing Data Measures of Dispersion

PCC Examples

X=(2,1,3) Y=(1,3,2) r= -0.5 weak negative correlation

X=(2,1,2) Y=(1,3,1) r= -1 strong negative correlation

X=(2,1,2) Y=(4,2,4) r= 1 strong posative correlation

X=(2,1,2) Y=(5,6,7) r= 0 independent

January 4, 2017 51 / 77

Summarizing Data Measures of Dispersion

Cross Correlation

Between two time series: association between values in the same time
series separated by some lag v1(i), v2(i)

Measures similarity between them by applying a time lag to one of
them.

It can be used to find repeated pattern or periodic nature so it can be
used for prediction.

Correlation coefficient r

Autocorrelation: cross-correlation between two values at different
points in time in the same time series (also called autocovariance)

series separated by some lag v1(i), v1(i + lag)
it can be used to find repeated pattern or periodic nature so it can be
used for prediction.

January 4, 2017 52 / 77

Summarizing Data Multiple Variables

Multivariate Data Representation

Most common is sample-attribute matrix (pattern matrix or feature
matrix or observation matrix)

others: linked list, hierarchical

ECE 750 T17

| Multivariate Data Representation
Most common is sample-attribute matrix (pattern matrix or

feature matrix or observation matrix)
Others include linked list, hierarchical etc.

attribute1 attribute2 attribute3……… attribute d

sample1
sample2
sample3
.
.
.
.
.
.
sample n

value11 value12 value13 ………. value 1d
…… ….. ….. ……… ……
…… ….. ….. ……… ……

value 1n value 2n value 3n ………..value nd

nxd

January 4, 2017 53 / 77

Data Preprocessing

Data, Data Types and Information
Types of Data
Data Representations

Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables

Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction

January 4, 2017 54 / 77

Data Preprocessing Data Examination

Data examination

Data Quality:

Accuracy: incorrect (eg.birthdates), inaccurate, transmission errors,
duplicates

Completeness: not recorded values, unavailable, ..

Consistency: delete inconsistent data? acquire more data? average?

Interpretability: how easily the data can be understood, correction of
errors or removal of inconsistent data could make it harder to interpret

[From [2] Chp 3]

January 4, 2017 55 / 77

Data Preprocessing Data Cleaning

Data Cleaning

Examine the data to correct for:

missing values

outliers

noise.

[From [2] Chp 3]

January 4, 2017 56 / 77

Data Preprocessing Data Cleaning

Missing Values

Use attribute mean (or majority nominal value) to fill in missing values

If there are classes, use attribute mean (or majority nominal value) for
all samples in the same class

Can use prediction or interpolation to fill in missing values

Can remove the samples that have too many values missing

January 4, 2017 57 / 77

Data Preprocessing Data Cleaning

Dealing with Outliers or Noise

Detection:
Use histograms to detect outliers
Use difference between mean, mode , median to indicate outliers
Use clustering to detect outliers.
Observe fluctuation in the values
Inconsistent values (negative values for positive attributes)

Fixing:
Remove samples that are way out of range.
Smoothing the data to get rid of fluctuations.
Use logic check to correct inconsistency.
Use prediction methods or fitting.

January 4, 2017 58 / 77

Data Preprocessing Data Transformation: Smoothing

Data, Data Types and Information
Types of Data
Data Representations

Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables

Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction

January 4, 2017 59 / 77

Data Preprocessing Data Transformation: Smoothing

Binning Methods for Smoothing

Data Smoothing: Focus here is not correcting the data but softening it.

Sort data and partition into bins

equal width
equal frequency by number of samples

Smooth the values in each bin by:

replacing with the mean or median
replacing with the nearest bin boundary value

January 4, 2017 60 / 77

Data Preprocessing Data Transformation: Smoothing

Binning Example

Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.

Bin 1 Bin 2 Bin 3

Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34

January 4, 2017 61 / 77

Data Preprocessing Data Transformation: Smoothing

Binning Example

Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.

Bin 1 Bin 2 Bin 3

Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34

means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34

January 4, 2017 61 / 77

Data Preprocessing Data Transformation: Smoothing

Binning Example

Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.

Bin 1 Bin 2 Bin 3

Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29

boundaries 4,4,4,15 21,21,25,25 26,26,26,34

January 4, 2017 61 / 77

Data Preprocessing Data Transformation: Smoothing

Binning Example

Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.

Bin 1 Bin 2 Bin 3

Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34

January 4, 2017 61 / 77

Data Preprocessing Data Transformation: Smoothing

Smoothing within a Window

If the values fluctuate so rapidly, we can do smoothing.

Smoothing within a window using a moving average

For example, for window size 3, using the median or mean to smooth

i.e. mean or median of 3 consecutive values

January 4, 2017 62 / 77

Data Preprocessing Data Transformation: Smoothing

Smoothing within a Window

Example

X: [0,1,2,3,4,5,6]
Y: [1,3,1,4,25,6]

ECE 750 T17

| Smoothing within a window (From [5])
Example
X 0 1 2 3 4 5 6
Y 1 3 1 4 25 3 6

1 2 3 4 5 6 7

[From [5] ]
January 4, 2017 63 / 77

Data Preprocessing Data Transformation: Smoothing

Smoothing within a Window

ECE 750 T17

| Smoothing within a window (From [5])
Example
X 0 1 2 3 4 5 6
Y 1 3 1 4 25 3 6

1 2 3 4 5 6 7

ECE 750 T17

If the values fluctuate so rapidly, we can do
smoothing.

Smoothing within a window (using a moving
average)

For example for window size 3, using the median
or mean to smooth

i.e., median of 3 or mean of 3

1 3 1 4 25 3 6

1
1.67

1 3
2.67
4
10
4
10.67
6
11.33

1 1 3 4 4 6 6 gets rid of 25

1 1.67 2.67 10 10.67 11.33 6

becomes

January 4, 2017 64 / 77

Data Preprocessing Data Transformation: Smoothing

Data, Data Types and Information
Types of Data
Data Representations

Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables

Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction

January 4, 2017 65 / 77

Data Preprocessing Data Transformation: Normalization, Scaling

Data, Data Types and Information
Types of Data
Data Representations

Summarizing Data
Central Tendency
Measures of Dispersion
Multiple Variables

Data Preprocessing
Data Examination
Data Cleaning
Data Transformation: Smoothing
Data Transformation: Normalization, Scaling
Data Reduction

January 4, 2017 66 / 77

Data Preprocessing Data Transformation: Normalization, Scaling

Normalization

Map the values x1, x2, . . . , xn of the attribute A to a new value x
′
i in

the interval [0,1] (or any other interval).

Min-Max normalization:

x ′i =
xi −mini xi

maxi xi −mini xi

PRO: This makes the values invariant to rigid displacement of
coordinates.
CON: It will encounter an out-of-bounds error if a future input case
for normalization falls outside of the original data range for A

Subtract the mean: x ′i = (xi − Ā)

[From [2] Chp 3.5]

January 4, 2017 67 / 77

Data Preprocessing Data Transformation: Normalization, Scaling

Normalization

Z-score (standard score) normalization: Scale by mean and standard
deviation

x ′i = (xi − Ā)/σA (3)

Positive means value is above the mean, Negative means it is below
the mean.

Pro: This method of normalization is useful when the actual
minimum and maximum of attribute A are unknown, or when there
are outliers that dominate the min-max normalization

Con: Normalization may or may not be desirable in some cases. It
may make samples that are dispersed in space closer to each other
and hence not separable.

January 4, 2017 68 / 77

Data Preprocessing Data Transformation: Normalization, Scaling

Normalization Examples

A = (−200, 400, 600, 800)
Min= −200, Max= 800, max-min= 1000
Min-Max Normalization
X ′ = (0, 0.6, 0.8, 1)
mean′ = 0.6, σX ′ = 0.374
Subtracting Mean Normalization
mean=400
X ′′ = (−600, 0, 200, 400)
mean′′ = 0, σX ′′ = 100

√
14

Z-score normalization
X ′′′ = ( −6√

14
, 0, 2√

14
, 4√

14
)

mean′′′ = 0, σX ′′′ = 1

January 4, 2017 69 / 77

Data Preprocessing Data Transformation: Normalization, Scaling

Normalization By Data Scaling

x ′ =
x

10j

Where j is the smallest integer such that max |x ′| < 1 Example: if x ∈ [−986, 917] then max |x | = 986 then x ′ = 1000 So -986 will normalize to -0.986 and 917 to .917 Note: normalization can change the characteristics of the original data. But it is good for comparing values of different scales and reduces influence large numbers in the data. January 4, 2017 70 / 77 Data Preprocessing Data Transformation: Normalization, Scaling Normalization of Matrix Data If A is the sample-feature matrix in terms of normalized data then let R = 1 n ATA r = 1 n n∑ k=1 xkixkj Under subtraction normalization, R is a covariance matrix Under z-score normalization rij becomes the correlation coefficient between features i and j and rjj = 1 for all j R is then called the correlation matrix. January 4, 2017 71 / 77 Data Preprocessing Data Reduction Data, Data Types and Information Types of Data Data Representations Summarizing Data Central Tendency Measures of Dispersion Multiple Variables Data Preprocessing Data Examination Data Cleaning Data Transformation: Smoothing Data Transformation: Normalization, Scaling Data Reduction January 4, 2017 72 / 77 Data Preprocessing Data Reduction Data Reduction Goal: improve performance, without hurting accuracy too much Dimensionality Reduction (more on this later) wavelet transforms principle component analysis (more on this later) Numerosity Reduction regression clustering (lots of time on this later) sampling data cube aggregation? January 4, 2017 73 / 77 Data Preprocessing Data Reduction Sampling for Data Reduction Sampling: obtaining a small subset of datapoints s to represent the whole data set n Allows a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Key principle: Choose a representative subset of the data Simple random sampling may have very poor performance in the present of skew Use adaptive sampling methods such as stratified sampling [From [2] Chp 3.4.8] January 4, 2017 74 / 77 Data Preprocessing Data Reduction Types of Sampling Simple random sampling: Draw a random number form the sample indices and select the object. Treats all sample as equally likely to be selected. Sampling without replacement: Remove the objects you select from the remaining samples. Original sample gets reduced every time you make a selection. Sampling with replacement: A selected object is put back in the original sample. You may select the same object more than one time. Stratified sampling: Similar to binning where the data set is partitioned and samples are selected from each partition. Good for skewed data. January 4, 2017 75 / 77 Data Preprocessing Data Reduction Summary Definition of Types of Data Summarizing Data: mean, variance, deviation, correlation Examination of Data: Quality accuracy, completeness, consistency, interpretability Data Cleaning: missing values, outliers, noise Transformation: Smoothing with bins and windows Normalization and Scaling Data Reduction: dimensionality, numerosity January 4, 2017 76 / 77 Data Preprocessing Data Reduction [Dunham, Data Mining Intro and Advanced Topics, 2003] Margaret Dunham, Data Mining Introductory and Advanced Topics, ISBN:0130888923, Prentice Hall, 2003. [Han,Kamber and Pei. Data Mining, 2011] Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed, Morgan Kaufmann Publishers, May 2011. [Duda, Pattern Classification, 2001] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd ed.), John Wiley and Sons, 2001. [Jain and Dubes. Algs for Clustering Data, 1988] A. K. Jain and R.C. Dubes, Algorithms for Clustering Data, ISBN: 0-13-022278-x, Prentice Hall, 1988. [Cohen,Empirical Methods for Artificial Intelligence, 1995] P. Cohen, Empirical Methods for Artificial Intelligence, ISBN:0-262-03225-2, MIT Press, 1995. [Ackoff, From Data to Wisdom, 1989] January 4, 2017 76 / 77 Data Preprocessing Data Reduction Ackoff, From Data to Wisdom, Journal of Applied Systems Analysis, 1989. January 4, 2017 76 / 77 Data Preprocessing Data Reduction Wrap-Up data set demo Let’s see the class data so far, fill it all in! Data entries will close tonight at midnight! Link to spreadsheet will be posted in LEARN/dropbox January 4, 2017 77 / 77 Course Introduction Course Admin Announcements Evaluation Schedule Course Goals Background Learning Objectives Topics Tools and Resources Understanding and Preparing Data Data, Data Types and Information Types of Data Data Representations Summarizing Data Central Tendency Measures of Dispersion Multiple Variables Data Preprocessing Data Examination Data Cleaning Data Transformation: Smoothing Data Transformation: Normalization, Scaling Data Reduction

Related Posts