ECE 657A: Data and Knowledge Modelling and Analysis – Lecture 2: Preprocessing, Similarity, Parameter Estimation
ECE 657A: Data and Knowledge Modelling and Analysis
Lecture 2: Preprocessing, Similarity, Parameter Estimation
Mark Crowley
January 11, 2016
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 1 / 79
Opening Data Example
Guess the dataset
Our class data : Spreadsheet will be posted on learn if people want to
try to find a pattern
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 2 / 79
Outline
Announcements
Data Preprocessing
Group forming ice breaker break
Measuring Data Similarity
Basic Parameter Estimation
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 3 / 79
Announcements
Project Description during group icebreaker
Assignment 1 will come out in the next week, be due three weeks
later – Due Monday, Feb 6, 11:59pm – group of 2-3.
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 4 / 79
Part II
Data Preprocessing
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 5 / 79
Outline of Data Preprocessing
Data Examination and Cleaning
Accuracy, Completeness, Consistency, Interpretability
Data Transformation
Filling in Missing Data
Smoothing with Bins
Smoothing with Windows
Normalization
Scaling
Data Reduction
via Sampling
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 6 / 79
Data Prepocessing Overview
Examination of Data: Quality
accuracy, completeness, consistency, interpretability
Data Cleaning: missing values, outliers, noise
Transformation:
Smoothing with bins and windows
Normalization and Scaling
Data Reduction: dimensionality, numerosity
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 7 / 79
Data Examination and Cleaning Accuracy, Completeness, Consistency, Interpretability
Data examination
Data Quality:
Accuracy: incorrect (eg.birthdates), inaccurate, transmission errors,
duplicates
Completeness: not recorded values, unavailable, ..
Consistency: delete inconsistent data? acquire more data? average?
Interpretability: how easily the data can be understood, correction of
errors or removal of inconsistent data could make it harder to interpret
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 8 / 79
Data Examination and Cleaning Accuracy, Completeness, Consistency, Interpretability
Data Cleaning
Examining the data to correct for:
missing values
outliers
noise
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 9 / 79
Data Transformation
Outline of Data Preprocessing
Data Examination and Cleaning
Accuracy, Completeness, Consistency, Interpretability
Data Transformation
Filling in Missing Data
Smoothing with Bins
Smoothing with Windows
Normalization
Scaling
Data Reduction
via Sampling
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 10 / 79
Data Transformation Filling in Missing Data
Missing Values
Use attribute mean (or majority nominal value) to fill in missing values
If there are classes, use attribute mean (or majority nominal value) for
all samples in the same class
Can use prediction or interpolation to fill in missing values : linear,
polynomial, spline
Can remove the samples that have too many values missing
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 11 / 79
Data Transformation Filling in Missing Data
Dealing with Outliers or Noise
Detection:
Use histograms to detect outliers
Use difference between mean, mode, median to indicate outliers
Use clustering to detect outliers.
Observe fluctuation in the values
Inconsistent values (negative values for positive attributes)
Fixing:
Remove samples that are way out of range.
Smoothing the data to get rid of fluctuations.
Use logic check to correct inconsistency.
Use prediction methods or fitting.
Anomaly Detection: Related, what if the outlier is what you are looking
for?
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 12 / 79
Data Transformation Filling in Missing Data
Histograms for Outlier Detection
Histogram is a bar plot of values vs frequency
(matlab:hist(x, nbins))
Values divided into bins (ranges of values)
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 13 / 79
Data Transformation Filling in Missing Data
Histograms for Outlier Detection
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 14 / 79
Data Transformation Smoothing with Bins
Outline of Data Preprocessing
Data Examination and Cleaning
Accuracy, Completeness, Consistency, Interpretability
Data Transformation
Filling in Missing Data
Smoothing with Bins
Smoothing with Windows
Normalization
Scaling
Data Reduction
via Sampling
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 15 / 79
Data Transformation Smoothing with Bins
Binning Methods for Smoothing
Data Smoothing: Focus here is not correcting the data but softening it.
Sort data and partition into bins
equal width
equal frequency by number of samples
Smooth the values in each bin by:
replacing with the mean or median
replacing with the nearest bin boundary value
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 16 / 79
Data Transformation Smoothing with Bins
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 17 / 79
Data Transformation Smoothing with Bins
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 17 / 79
Data Transformation Smoothing with Bins
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 17 / 79
Data Transformation Smoothing with Bins
Binning Example
Sorted data: [4,8,9,15,21,21,24,25,26,28,29,34]
Using 3 bins of equal 4 samples.
Bin 1 Bin 2 Bin 3
Binned Data: 4,8,9,15 21,21,24,25 26,28,29,34
means 9,9,9,9 23,23,23,23 29,29,29,29
boundaries 4,4,4,15 21,21,25,25 26,26,26,34
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 17 / 79
Data Transformation Smoothing with Windows
Smoothing within a Window
If the values fluctuate so rapidly, we can do smoothing.
Smoothing within a window using a moving average
For example, for window size 3, using the median or mean to smooth
i.e. mean or median of 3 consecutive values
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 18 / 79
Data Transformation Smoothing with Windows
Smoothing within a Window
Example
X: [0,1,2,3,4,5,6]
Y: [1,3,1,4,25,6]
ECE 750 T17
34
| Smoothing within a window (From [5])
Example
X 0 1 2 3 4 5 6
Y 1 3 1 4 25 3 6
y
0
5
10
15
20
25
30
1 2 3 4 5 6 7
y
[From [5] ]
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 19 / 79
Data Transformation Smoothing with Windows
Smoothing within a Window
ECE 750 T17
34
| Smoothing within a window (From [5])
Example
X 0 1 2 3 4 5 6
Y 1 3 1 4 25 3 6
y
0
5
10
15
20
25
30
1 2 3 4 5 6 7
y
ECE 750 T17
35
If the values fluctuate so rapidly, we can do
smoothing.
Smoothing within a window (using a moving
average)
For example for window size 3, using the median
or mean to smooth
i.e., median of 3 or mean of 3
1 3 1 4 25 3 6
1
1.67
3
1 3
2.67
4
10
4
10.67
6
11.33
4
25
3
1 1 3 4 4 6 6 gets rid of 25
1 1.67 2.67 10 10.67 11.33 6
becomes
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 20 / 79
Data Transformation Normalization
Outline of Data Preprocessing
Data Examination and Cleaning
Accuracy, Completeness, Consistency, Interpretability
Data Transformation
Filling in Missing Data
Smoothing with Bins
Smoothing with Windows
Normalization
Scaling
Data Reduction
via Sampling
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 21 / 79
Data Transformation Normalization
Normalization
Map the values x1, x2, . . . , xn of the attribute A to a new value x
′
i in
the interval [0,1] (or any other interval).
Min-Max normalization:
x ′i =
xi −mini xi
maxi xi −mini xi
PRO: This makes the values invariant to rigid displacement of
coordinates.
CON: It will encounter an out-of-bounds error if a future input case
for normalization falls outside of the original data range for A
Subtract the mean: x ′i = (xi − Ā)
[From [2] Chp 3.5]
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 22 / 79
Data Transformation Normalization
Z-Score Normalization
Z-score (standard score) normalization: Scale by mean and standard
deviation
x ′i = (xi − Ā)/σA (1)
Positive means value is above the mean, Negative means it is below
the mean.
Pro: This method of normalization is useful when the actual
minimum and maximum of attribute A are unknown, or when there
are outliers that dominate the min-max normalization
Con: Normalization may or may not be desirable in some cases. It
may make samples that are dispersed in space closer to each other
and hence are difficult to separate.
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 23 / 79
Data Transformation Normalization
Normalization Examples
A = (−200, 400, 600, 800)
Min= −200, Max= 800, max-min= 1000
Min-Max Normalization
X ′ = (0, 0.6, 0.8, 1)
mean′ = 0.6, σX ′ = 0.374
Subtracting Mean Normalization
mean=400
X ′′ = (−600, 0, 200, 400)
mean′′ = 0, σX ′′ = 100
√
14
Z-score normalization
X ′′′ = ( −6√
14
, 0, 2√
14
, 4√
14
)
mean′′′ = 0, σX ′′′ = 1
Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 24 / 79
Data Transformation Scaling
Normalization By Data Scaling
x ′ =
x
10j
Where j is the smallest integer such that max |x ′| < 1 Example: if x ∈ [−986, 917] then max |x | = 986 then x ′ = 1000 So -986 will normalize to -0.986 and 917 to .917 Note: normalization can change the characteristics of the original data. But it is good for comparing values of different scales and reduces influence large numbers in the data. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 25 / 79 Data Transformation Scaling Normalization of Matrix Data If A is the sample-feature matrix in terms of normalized data then let R = 1 n ATA r = 1 n n∑ k=1 xkixkj Under subtraction normalization, R is a covariance matrix Under z-score normalization rij becomes the correlation coefficient between features i and j and rjj = 1 for all j R is then called the correlation matrix. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 26 / 79 Data Reduction Outline of Data Preprocessing Data Examination and Cleaning Accuracy, Completeness, Consistency, Interpretability Data Transformation Filling in Missing Data Smoothing with Bins Smoothing with Windows Normalization Scaling Data Reduction via Sampling Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 27 / 79 Data Reduction Data Reduction Goal: improve performance, without hurting accuracy too much Numerosity Reduction regression - replace many points with smaller number of predictions or function clustering - much more on time on this later sampling - to reduce data size Dimensionality Reduction wavelet transforms principle component analysis (more on this later) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 28 / 79 Data Reduction via Sampling Sampling for Data Reduction Sampling: obtaining a small subset of datapoints s to represent the whole data set n Allows a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Key principle: Choose a representative subset of the data Simple random sampling may have very poor performance in the present of skew Use adaptive sampling methods such as stratified sampling [From [2] Chp 3.4.8] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 29 / 79 Data Reduction via Sampling Types of Sampling Simple random sampling: Draw a random number form the sample indices and select the object. Treats all sample as equally likely to be selected. Sampling without replacement: Remove the objects you select from the remaining samples. Original sample gets reduced every time you make a selection. Sampling with replacement: A selected object is put back in the original sample. You may select the same object more than one time. Stratified sampling: Similar to binning where the data set is partitioned and samples are selected from each partition. Good for skewed data. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 30 / 79 Data Reduction via Sampling [Dunham, Data Mining Intro and Advanced Topics, 2003] Margaret Dunham, Data Mining Introductory and Advanced Topics, ISBN:0130888923, Prentice Hall, 2003. [Han,Kamber and Pei. Data Mining, 2011] Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed, Morgan Kaufmann Publishers, May 2011. [Duda, Pattern Classification, 2001] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd ed.), John Wiley and Sons, 2001. [Jain and Dubes. Algs for Clustering Data, 1988] A. K. Jain and R.C. Dubes, Algorithms for Clustering Data, ISBN: 0-13-022278-x, Prentice Hall, 1988. [Cohen,Empirical Methods for Artificial Intelligence, 1995] P. Cohen, Empirical Methods for Artificial Intelligence, ISBN:0-262-03225-2, MIT Press, 1995. [Ackoff, From Data to Wisdom, 1989] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 30 / 79 Data Reduction via Sampling Ackoff, From Data to Wisdom, Journal of Applied Systems Analysis, 1989. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 30 / 79 Data Reduction via Sampling Outline Announcements Data Preprocessing Group forming ice breaker break Measuring Data Similarity Basic Parameter Estimation Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 31 / 79 Part III Measuring Data Similarity Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 32 / 79 Outline of Measuring Data Similarity Definitions and Matrix Measures Data Similarity For Binary and Nominal Types Contingency Tables Nominal Types For Interval, Ratio (and Ordinal) Types For Ordinal Types Cosine Similarity Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 33 / 79 Definitions and Matrix Measures Measuring Data Similarity and Dissimilarity Similarity A measure of how two data objects are close to each other. The more similar the objects the higher the value Dissimilarity (e.g., distance) Dissimilarity (e.g. distance) A measure of how different two data objects are The more dissimilar the objects the larger the value of the measure. If the two objects are identical then the value is 0 Proximity is used to express similarity or dissimilarity Why is this important? [From [2] Chp 2.4] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 34 / 79 Definitions and Matrix Measures Data Matrix and Dissimilarity Matrix Data (sample) matrix n data sample points with d dimensions (features, attributes, . . . ) Distance matrix n data points, each is the distance between pairs of points A triangular matrix ECE 750 T17 9 Data Matrix and Dissimilarity Matrix | Data (sample) matrix z n data sample points with d dimensions (features) | Distance matrix z n data points, each entry is the distance between pairs of points z A triangular matrix » » » » » » ¼ º « « « « « « ¬ ª nd x... nr x... n1 x ............... xd... ir x... i1 x ............... 1d x... 1r x... 11 x » » » » » » ¼ º « « « « « « ¬ ª 0...21 ::: 3231 21 ...dndn 0dd 0d 0 Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 35 / 79 Definitions and Matrix Measures Data Matrix and Dissimilarity Matrix Data (sample) matrix n data sample points with d dimensions (features, attributes, . . . ) Distance matrix n data points, each is the distance between pairs of points A triangular matrix ECE 750 T17 9 Data Matrix and Dissimilarity Matrix | Data (sample) matrix z n data sample points with d dimensions (features) | Distance matrix z n data points, each entry is the distance between pairs of points z A triangular matrix » » » » » » ¼ º « « « « « « ¬ ª nd x... nr x... n1 x ............... xd... ir x... i1 x ............... 1d x... 1r x... 11 x » » » » » » ¼ º « « « « « « ¬ ª 0...21 ::: 3231 21 ...dndn 0dd 0d 0 Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 35 / 79 Definitions and Matrix Measures Data Matrix and Dissimilarity Matrix Data (sample) matrix n data sample points with d dimensions (features, attributes, . . . ) Distance matrix n data points, each is the distance between pairs of points A triangular matrix ECE 750 T17 9 Data Matrix and Dissimilarity Matrix | Data (sample) matrix z n data sample points with d dimensions (features) | Distance matrix z n data points, each entry is the distance between pairs of points z A triangular matrix » » » » » » ¼ º « « « « « « ¬ ª nd x... nr x... n1 x ............... xd... ir x... i1 x ............... 1d x... 1r x... 11 x » » » » » » ¼ º « « « « « « ¬ ª 0...21 ::: 3231 21 ...dndn 0dd 0d 0 Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 35 / 79 Data Similarity For Binary and Nominal Types Outline of Measuring Data Similarity Definitions and Matrix Measures Data Similarity For Binary and Nominal Types Contingency Tables Nominal Types For Interval, Ratio (and Ordinal) Types For Ordinal Types Cosine Similarity Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 36 / 79 Data Similarity For Binary and Nominal Types Measures for Binary Types Given two binary vectors x and y of d dimensions each. Dot Product: xT y measures the number of attributes of value 1 shared between the two vectors It can be normalized by the geometric mean of the number of attributes with value one in the vectors √ (xT y)yT y) Hamming Distance: measures the number of attributes that have different values in the two vectors (1 in one and 0 in the other and vise versa). Equilvalent to ∑d i=1 |xi − yi | [From [2] ][From [4] ] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 37 / 79 Data Similarity For Binary and Nominal Types Measures for Binary Types Tanimoto Measure: it measures the common attributes relative to the non-common elements t(x , y) = xT y xT x + yT y − xT y Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 38 / 79 Data Similarity For Binary and Nominal Types Contingency Tables Contingency Tables Contingency tables can be used to define of number of other proximity measures. Contingency table: ECE 750 T17 � Tanimoto Measure: it measures the common attributes relative to the non-common ݐ ݕ,ݔ = ݔ ݕ் ݔ்ݔ + ݕ்ݕ െ ݕ்ݔ Forming Contingency table y 1 0 1 m11 m10 x 0 m01 m00 A number of measures can be defined in terms of these 4 terms. 11 Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 39 / 79 Data Similarity For Binary and Nominal Types Contingency Tables Distance Coefficients Simple Matching Coefficient (SMC): m00+m11 m00 + m11 + m01 + m10 Similarity with equal weight given to 0 and 1. Jaccard Coefficient (JC): m11 m11 + m01 + m10 Similarity ignoring 0-0 matches SMC and JC always less than 1 value of 1 means the two vectors are identical (1− SMC) yields distance (1− JC) yields Jaccard distance Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 40 / 79 Data Similarity For Binary and Nominal Types Nominal Types Measures of Nominal Types Converting Nominal Type to Binary Type: for each of the nominal values of the variables define as a binary variable. Binary variable is 1 if variable has corresponding value, 0. Then we can use the binary measures. Can use the number of matches of the values of attributes relative to the number of attributes as simple similarity measure and 1-similarty as distance. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 41 / 79 For Interval, Ratio (and Ordinal) Types Definition: Distance Metric Distance metric: For all vectors x , y and z the function d is a metric iff: d(x , x) = 0 where d(x , y) = 0 iff x = y d(x , y) ≥ 0 Non-negativity d(x , y) = d(y , x) Symmetry d(x , y) ≤ d(x , z) + d(z , y) Triangle inequality e.g. Mikowski metric: dk(x , y) = [ n∑ i=1 |xi − yi |k ]1/k Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 42 / 79 For Interval, Ratio (and Ordinal) Types Minkowksi Metric dk(x , y) = [ n∑ i=1 |xi − yi |k ]1/k For k=2, we get the l2 norm or Euclidean distance For k=1, we get the l1 norm. Also called absolute norm, or city block norm or Manhattan distance. For k →∞, we get the supremum or Chebyshev distance d∞(x , y) = max i |xi − yi | k → −∞ we get the min distance min i |xi − yi | Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 43 / 79 For Interval, Ratio (and Ordinal) Types For Ordinal Types Measures for Ordinal Types Ordinal values are ranked values Can be mapped to the interval type replace xif by their rank rif ∈ {1, ...,Mf } map the range of each variable onto [0, 1] by replacing i th object in the f th variable by zif = rif − 1 Mf − 1 One can then use methods for interval-scaled variables [From [2] ] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 44 / 79 Cosine Similarity Cosine Similarity The similarity between two vectors x and y is measured as the cosine of the angle of the two vectors: cos(x, y) = xTy/||x||||y|| i.e. the dot product of the two vectors normalized by the product of their lengths. Value range from -1 (opposite) to 1 same(except for length). It is a useful measure for similarity that is widely used in information retrieval and data mining especially for sparse vectors. Note that the measure focuses on the shared non-zero attribute values and ignores any 0-* matches between the two vectors. If the vectors are binary it reduces to the number of attributes (with value 1) shared by the two vectors normalized by the geometric mean of their sizes. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 45 / 79 Cosine Similarity Cosine Similarity ECE 750 T17 x is closer to y than z using the cosine similarity measure 18 ĭ1 ĭ2 x y z x is closer to y than z using the cosine similarity measure Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 46 / 79 Cosine Similarity Example Let x = (0, 0, 1, 1, 3, 0, 1) y = (0, 4, 3, 1, 1, 0, 0) z = (1, 3, 1, 0, 0, 1, 0) ||x || = √ 1 + 1 + 9 + 1 = 3.5 xT y = 3 + 1 + 3 = 7 ||y || = √ 16 + 9 + 1 + 1 = 5.2 xT z = 1 ||z || = √ 1 + 9 + 1 + 1 = 3.5 cos(x , y) = 7/18.2 = 0.38 ||x − y ||2 = 5 cos(x , z) = 1/12.25 = 0.08 ||x − z ||2 = 4.6 Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 47 / 79 Cosine Similarity Outline Announcements Data Preprocessing Group forming ice breaker break Measuring Data Similarity Basic Parameter Estimation Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 48 / 79 Part IV Basic Data Analysis : Parameter Estimation Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 49 / 79 Outline of Basic Data Analysis : Parameter Estimation Descriptive vs. Inferential Analysis Unbiased Estimators Mean Squared Error Reducing Bias Resampling Cross Validation Maximum Likelihood Estimation Expectation Maximization (EM) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 50 / 79 Descriptive vs. Inferential Analysis Descriptive versus Inferential Analysis We have data (samples). This data is a sample of a population (more than just the measured or observed sample). Descriptive Analysis is the type of analysis and measures that seek to describe and summarize the data, the available samples. We can not in general use it for interpretation of unobserved data. Inferential Analysis (predictive) is the type of analysis that can describe measures over the population of data. That is observed and unobserved. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 51 / 79 Descriptive vs. Inferential Analysis Example: the mean of a sample Take for example calculating the mean of a sample. The mean is correct for just the sample as it is descriptive of the sample. For inferential analysis it can be only an estimate of the mean of a population that has a slight variation of the sample. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 52 / 79 Descriptive vs. Inferential Analysis Point Estimation (Parameter Estimation) We assume there is a true set of parameters defining a function θ(X1,X2, · · · ,Xn) that defines our data. We don’t know θ so we estimate it and call the estimate θ̂. This is usually done by calculating the parameter value for the population sample. For example, we can estimate the mean and variance of a population, then θ = {X̄ , σX} Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 53 / 79 Unbiased Estimators Outline of Basic Data Analysis : Parameter Estimation Descriptive vs. Inferential Analysis Unbiased Estimators Mean Squared Error Reducing Bias Resampling Cross Validation Maximum Likelihood Estimation Expectation Maximization (EM) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 54 / 79 Unbiased Estimators Bias vs. Variance Tradeoff [From http://scott.fortmann-roe.com/docs/BiasVariance.html] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 55 / 79 Unbiased Estimators Unbiased Estimators If the expected value of the estimate (over repeated samples of the population) equals the actual value of the parameter then the estimator is called unbiased otherwise the bias is the difference. An unbiased estimator has a bias of 0. Bias = B(θ̂) = E (θ̂)− θ this can be seen as measuring the quality of the estimate For a discrete random variable, the expected value is defined as E (x) = n∑ i=x xipi (xi ) For equal probabilities, then expectation=mean P(x) = 1 n E (x) = 1 n n∑ i=1 xi Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 56 / 79 Unbiased Estimators Mean Squared Error Mean Squared Error We talked about quality measures for data last week, simplest was variance V (θ̂) = E (θ̂ − E (θ̂))2 One measure for the effectiveness of an estimator is a combination of bias and variance, the Mean Squared Error (MSE) MSE (θ̂) = E (θ̂ − θ)2 = B2(θ̂) + V (θ̂) Alternative: Root Mean Squared Error (RMSE) = √ MSE gives a better sense of the magnitude of the error Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 57 / 79 Unbiased Estimators Mean Squared Error MSE and Bias If the estimator θ is unbiased then MSE = Variance Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 58 / 79 Reducing Bias Outline of Basic Data Analysis : Parameter Estimation Descriptive vs. Inferential Analysis Unbiased Estimators Mean Squared Error Reducing Bias Resampling Cross Validation Maximum Likelihood Estimation Expectation Maximization (EM) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 59 / 79 Reducing Bias Resampling Leave-one-out Resampling If not then you want different samples which are not biased. What if you can’t get new data? Solution: resampling, simple approach is the Jackknife Estimate Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 60 / 79 Reducing Bias Resampling Leave-one-out Resampling If not then you want different samples which are not biased. What if you can’t get new data? Solution: resampling, simple approach is the Jackknife Estimate Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 60 / 79 Reducing Bias Resampling Jackknife Estimate The estimate is obtained by omitting one sample (i) from the sample set. θ̂(i) = µ̂(i) = ∑n j 6=i xj n Then take the mean of these the overall parameter θ̂ = ∑n j=1 θ̂(j) n Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 61 / 79 Reducing Bias Cross Validation K-Cross Validation More generally, we can leave out multiple samples: 1 Split the data into K equal sized parts 2 For ` = 1, 2, · · · ,K 1 Leave part ` out 2 Build estimator on K − 1 data points 3 Combine results into K statistics If K = n then this is just leave-one-out validation. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 62 / 79 Reducing Bias Cross Validation Sampling with Replacement Select n′ from the samples at random with replacement (i.e. the data always maintains n samples) Create B samples this way, then use the B samples to build your estimator B samples will have duplicate samples. B can be large (eg. 1000, 10,000) One can also select n′ < n (your original A samples) Use B for training Use A− B for testing [From [4] ] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 63 / 79 Maximum Likelihood Estimation Outline of Basic Data Analysis : Parameter Estimation Descriptive vs. Inferential Analysis Unbiased Estimators Mean Squared Error Reducing Bias Resampling Cross Validation Maximum Likelihood Estimation Expectation Maximization (EM) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 64 / 79 Maximum Likelihood Estimation Maximum Likelihood Estimate (MLE) Assumes that the samples are from a specific distribution with some unknown parameter(s). Likelihood is the probability that the samples observed come from the given distribution. L(θ|X ) = p(X |θ) We can estimate the parameter θ by maximizing the likelihood function. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 65 / 79 Maximum Likelihood Estimation Maximum Likelihood Estimate (MLE) Given a known distribution f (xi , θ) Given the sample x = {x1, · · · , xn} The likelihood can be formulated L(θ|x1, . . . , xn) = n∏ i=1 f (xi |θ) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 66 / 79 Maximum Likelihood Estimation Maximum Likelihood Estimate (MLE) MLE=The value of parameter θ that maximizes likelihood L is the estimate This can be obtained by finding the derivative of the log-likelihood with respect to the paremeter θ and equating the resulting formula to zero. Note: if there are multiple parameters then you can maximize each one seperately or attempt to find a mulidimensional maximum. since max L is the same as max log L we can use ` = log L = log n∏ i=1 f (xi |θ) = n∑ i=1 log f (xi |θ) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 67 / 79 Maximum Likelihood Estimation Example for MLE Suppose a coin is tossed in the air 5 times (H,H,H,H,T). Assuming a fair coin, the values follow the Bernoulli distribution f (xi |P) = Pxi (1− P)1−xi P − Probability of H or T xi − Observation (sample) [From [1] page 48] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 68 / 79 Maximum Likelihood Estimation Example for MLE L(P|x1, · · · , x5) = 5∏ i=1 Pxi (1− P)1−xi = P ∑5 i=1 xi (1− P)5− ∑5 i=1 xi Taking the log of both sides we get `(P) = log L(P) = 5∑ i=1 xi log(P) + (5− 5∑ i=1 xi ) log(1− P) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 69 / 79 Maximum Likelihood Estimation Example for MLE To Maximize the likelihood, take the derivative and equate to zero ∂`(P) ∂P = ∑5 i=1 xi P − 5− ∑5 i=1 xi 1− P = 0∑5 i=1 xi P = 5− ∑5 i=1 xi 1− P P = ∑5 i=1 xi 5 If we take H = 1,T = 0 then P̂ = 4 5 = 0.8 This P̂ is the estimate for P that maximizes the likelihood that the given sequence will be observed. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 70 / 79 Maximum Likelihood Estimation General MLE Formulas Bernoulli Distribution: MLE : P̂ = ∑ xi n Normal Distribution: MLE : µ̂ = ∑ xi n MLE : σ̂ = √∑ (xi − m̂u)2 n Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 71 / 79 Expectation Maximization (EM) Outline of Basic Data Analysis : Parameter Estimation Descriptive vs. Inferential Analysis Unbiased Estimators Mean Squared Error Reducing Bias Resampling Cross Validation Maximum Likelihood Estimation Expectation Maximization (EM) Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 72 / 79 Expectation Maximization (EM) The Problem of Missing Data This analytical MLE method requires that we can solve the equations exactly. What if we have missing data? Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 73 / 79 Expectation Maximization (EM) Expectation Maximization (EM) Simple and general approach for parameter estimation with incomplete data Obtains initial estimates for parameters. Then, iteratively use estimates for missing data and continues until convergence Pros: Easy to implement (other solutions are more powerful but often require gradients) Cons: Slow to converge, might find local maxima, sensitive to initial guess, bad in high dimensions? Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 74 / 79 Expectation Maximization (EM) The EM Algorithm Given initial estimates and training data repeat these two steps: 1 Estimation - calculate a value for the missing data 2 Maximization use the data and new values of the missing data to find new estimates using MLE. Repeat steps 1 and 2 until convergence. Converge means change in values is less then some small �. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 75 / 79 Expectation Maximization (EM) Example for EM Input Partial data set of size k = 4 X = {1, 5, 10, 4} but 2 data items missing so full data size is n = 6. Problem: Assume data has a normal distribution, find mean µ for data. For a gaussian distribution the MLE is µ̂ = ∑n i=1 xi n Give an initial guess µ̂0 = 3. Use this to fill in missing values. [From [1] p. 50] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 76 / 79 Expectation Maximization (EM) Example for EM µ̂1 = ∑k i=1 xi n + ∑n i=k+1 xi n = 1 + 5 + 10 + 4 6 + 3 + 3 n = 4.33 Useµ1 as new estimate µ̂2 = ∑k i=1 xi n + ∑n i=k+1 xi n = 1 + 5 + 10 + 4 6 + 4.33 + 4.33 n = 4.77 µ̂3 = 4.92 |µ2 − µ3| = .15 µ̂4 = 4.97 |µ3 − µ4| = .05 Let � = .05, then convergence occurs at µ̂4. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 77 / 79 Summary Examination of Data Quality accuracy, completeness, consistency, interpretability Data Preprocessing: Data Cleaning: missing values, outliers, noise Transformation: Smoothing with bins and windows Normalization and Scaling Measuring Data Similarity Parameter Estimation bias vs variance, Unbiased estimators Sampling, Cross Validation MLE, EM Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 78 / 79 Closing Data Example Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 79 / 79 [Dunham, Data Mining Intro and Advanced Topics, 2003] Margaret Dunham, Data Mining Introductory and Advanced Topics, ISBN:0130888923, Prentice Hall, 2003. [Han,Kamber and Pei. Data Mining, 2011] Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed, Morgan Kaufmann Publishers, May 2011. [Duda, Pattern Classification, 2001] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd ed.), John Wiley and Sons, 2001. [Jain and Dubes. Algs for Clustering Data, 1988] A. K. Jain and R.C. Dubes, Algorithms for Clustering Data, ISBN: 0-13-022278-x, Prentice Hall, 1988. [Cohen,Empirical Methods for Artificial Intelligence, 1995] P. Cohen, Empirical Methods for Artificial Intelligence, ISBN:0-262-03225-2, MIT Press, 1995. [Ackoff, From Data to Wisdom, 1989] Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 79 / 79 Ackoff, From Data to Wisdom, Journal of Applied Systems Analysis, 1989. [Sima and Dougherty, 2008] Sima, C. and Dougherty, E. R. The Peaking Phenomenon in the Presence of Feature Selection. Pattern Recognition Letters, 29, 16671674, 2008. Mark Crowley ECE 657A: Data and Knowledge Modelling and Analysis January 11, 2016 79 / 79 Introduction Data Preprocessing Data Examination and Cleaning Accuracy, Completeness, Consistency, Interpretability Data Transformation Filling in Missing Data Smoothing with Bins Smoothing with Windows Normalization Scaling Data Reduction via Sampling Measuring Data Similarity Definitions and Matrix Measures Data Similarity For Binary and Nominal Types Contingency Tables Nominal Types For Interval, Ratio (and Ordinal) Types For Ordinal Types Cosine Similarity Basic Data Analysis : Parameter Estimation Descriptive vs. Inferential Analysis Unbiased Estimators Mean Squared Error Reducing Bias Resampling Cross Validation Maximum Likelihood Estimation Expectation Maximization (EM) Wrap Up