Data Cleansing — 3
Data Cleansing — 3
Faculty of Information Technology
Monash University, Australia
FIT5196 week 8
(Monash) FIT5196 1 / 29
Outline
1 Recap
2 Outlier
Types of outliers
Univariate Outlier Detection
Multivariate Outlier Detection
3 Summary
(Monash) FIT5196 2 / 29
Recap
Missing Data Mechanisms
Describe relationships between measured variables and the probability of
missing data
Deciding upon the method for analysing missing values requires
understanding about both the reasons for the missing values and the nature
of the data for the missing observations.
Three different missingness mechanisms:
É Missing at random
É Missing completely at random
É Missing not at random
(Monash) FIT5196 3 / 29
Recap
MAR, MCAR v.s. MNAR?
Example adopted from “Applied Missing Data Analysis” by Craig K. Enders.
(Monash) FIT5196 4 / 29
Recap
Missing data Patten
A missing data pattern refers to the configuration of observed and missing
values in a data set.
The univariate pattern has missing values isolated to a single variable.
A monotone missing data pattern is typically associated with a
longitudinal study where participants drop out and never return.
a general pattern has missing values dispersed throughout the data matrix
in a haphazard fashion.
(Monash) FIT5196 5 / 29
Recap
Methods for handling missing values
Deletion methods
É Listwise deletion
É Pairwise deletion
Imputation methods
É Mean imputation
É Regression imputation
É Stochastic regression imputation
É Hot-deck imputation
É Last observation carrie forward
(Monash) FIT5196 6 / 29
Outlier
Outline
1 Recap
2 Outlier
Types of outliers
Univariate Outlier Detection
Multivariate Outlier Detection
3 Summary
(Monash) FIT5196 7 / 29
Outlier
Outliers: the definition
What is an outlier?
É Definition of Hawkins: “An outlier is an observation which deviates so much
from the other observations as to arouse suspicions that it was generated by
a different mechanism” (Hawkins, D. 1980. Identification of Outliers. Chapman and Hall.)
É Definition of Pearson: “An outlier is a data point that appears to be
inconsistent with the nominal behavior exhibited by most of the other data
points in a specified collection.”
(Monash) FIT5196 8 / 29
Outlier
Outliers: the definition
Figure is from Chapter 2 of “Mining Imperfect Data”. Outliers detected with the Hampel identifier are catalyst samples 4, 13, and 21 and are marked
with solid circles, The median value and upper and lower Hampel identifier detection limits are shown as dashed lines.
(Monash) FIT5196 8 / 29
Outlier
Outliers
An outlier often contains useful information about abnormal characteristics
of the systems and entities that impact the data generation process.
É Intrusion detection systems: unusual behaviour shown in the operating
system calls, network traffic, or other user action.
É Credit-card fraud: Unauthorized use of a credit card may show different
patterns, such as buying sprees from particular locations or very large
transactions.
É Medical Analysis: Unusual patterns in MRI, PET and ECT data typically
reflect disease conditions
É Law enforcement: Determining fraud in financial transactions, trading
activity, or insurance claims typically requires the identification of unusual
patterns in the data generated by the actions of the criminal entity.
(Monash) FIT5196 9 / 29
Outlier
Outliers: the impact
Outliers can increase the error variance and reduces the power of statistical
tests.
If the outliers are non-randomly distributed, they can decrease normality.
Outliers can bias or influence estimates that may be of substantive interest
Outliers can also impact the basic assumption of Regression, ANOVA and
other statistical model assumptions.
(Monash) FIT5196 10 / 29
Outlier
Outliers: the impact
Example:
8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
mean = 7.8 mean = 15.5
median = 8 median = 8
mod = 8 mod = 8
sd = 1.328 sd = 26.641
(Monash) FIT5196 11 / 29
Outlier Types of outliers
Types of outliers
Univariate outlier: concerns the distribution of a single variable
Multivariate outlier: concerns outliers in an n-dimensional space.
Figure is from “A Comprehensive Guide to Data Exploration”
(Monash) FIT5196 12 / 29
Outlier Types of outliers
Univariate Outliers
Based on the notion that “most ” of the data should exhibit approximately
the same value c, the observed sequence of data {xk} can be modelled as
xk = c + ek
where {ek} is a sequence of deviations about the nominal value c .
(Monash) FIT5196 13 / 29
Outlier Types of outliers
Univariate Outliers
Figure is from Chapter 2 of “Mining Imperfect Data”
Distinguish between lower outliers and upper outliers
(Monash) FIT5196 14 / 29
Outlier Types of outliers
Multivariate outliers
Multivariate outlier: concerns outliers in an n-dimensional space.
É A multivariate outlier in a sequence {xk} of vectors corresponds to a vector
xj whose individual components are significantly discordant with the
intercomponent relations exhibited by the majority of the other data values.
Figure is from Chapter 2 of “Mining Imperfect Data”
É the intercomponent relation:
y ‘
p
1− x2
É 0 ¯ x , y ,¯ 1
(Monash) FIT5196 15 / 29
Outlier Univariate Outlier Detection
How to detect Univariate Outliers
Problem formulation: given a sequence of observed data {xk}, a reference
value x0, and a measure of variation ζ computed from {xk}, detect outliers
according to
|xk − x0| > tζ
where t is a threshold parameter.
Questions
É How do we define the nominal data reference value x0?
É How do we define the scale of natural variation ζ ?
É How do we choose the threshold parameter t?
(Monash) FIT5196 16 / 29
Outlier Univariate Outlier Detection
Three Outlier Detection Methods
Choices for the nominal reference value x0
É mean: x̄
É median: x†
Choices for the measure of variation ζ
É the standard deviation: σ
É The median absolute deviation(MAD) scale estimator S:
S = 1.4826×median{|xk − x
†|}
É The Interquartile Range (IQR)
IQR = Q3 −Q1
Combine the choices
É The 3σ edit rule: x0 = x̄ , ζ = σ
É The Hampel identifier: x0 = x
†, ζ = S
É The standard boxplot outlier rule: x0 = x
†, ζ = IQR
(Monash) FIT5196 17 / 29
Outlier Univariate Outlier Detection
The 3σ edit rule
Basic idea: if a data sequence {xk} is well approximated by an i.i.d.
sequence of Gaussian random variables with mean µ and standard deviation
σ, the probability of observing a value xk farther than three standard
deviations from the mean is only about 0.3%.
(Monash) FIT5196 18 / 29
Outlier Univariate Outlier Detection
The 3σ edit rule
xk is an outlier if
|xk − x̄ | > 3σ
Also known as the extreme studentized deviation (ESD) identifier (Davies
and Gather, 1993)
Problems?
É The presence of outliers in the dataset can cause substantial errors in
estimating
− the mean
− the standard deviation
8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
mean = 7.8 mean = 15.5
avedev = 0.99 avedev = 14.08
sd = 1.328 sd = 26.641
(Monash) FIT5196 19 / 29
Outlier Univariate Outlier Detection
The 3σ edit rule
xk is an outlier if
|xk − x̄ | > 3σ
Also known as the extreme studentized deviation (ESD) identifier (Davies
and Gather, 1993)
Problems?
É The presence of outliers in the dataset can cause substantial errors in
estimating
− the mean
− the standard deviation
8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
mean = 7.8 mean = 15.5
avedev = 0.99 avedev = 14.08
sd = 1.328 sd = 26.641
(Monash) FIT5196 19 / 29
Outlier Univariate Outlier Detection
The Hampel Identifier
Basic idea:
É x0 = x
†
É ζ = S = 1.4826×median{|xk − x
†|}
É xk is an outlier if
|xk − x
†| > 3S
(Monash) FIT5196 20 / 29
Outlier Univariate Outlier Detection
The Hampel Identifier
Basic idea:
É x0 = x
†
É ζ = S = 1.4826×median{|xk − x
†|}
É xk is an outlier if
|xk − x
†| > 3S
Why use median and MAD
É lower outlier-sensitivities than mean and standard deviation
8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
median = 8 median = 8
MAD = 1 MAD = 1
(Monash) FIT5196 20 / 29
Outlier Univariate Outlier Detection
The Hampel Identifier
Basic idea:
É x0 = x
†
É ζ = S = 1.4826×median{|xk − x
†|}
É xk is an outlier if
|xk − x
†| > 3S
Drawbacks:
É the MAD scale estimate is identically zero if more than 50% of the data
observations xk have the same value.
(Monash) FIT5196 20 / 29
Outlier Univariate Outlier Detection
Quartile-based Detection and Boxplots
Q0 the minimum
Q1 bigger than 25% of the data points
Q2 the median
Q3 bigger than 75% of the data points
Q4 the maximum
For a symmetric distribution,
IQR = Q3−Q1
x† =
Q3 + Q1
2
Q3 = x† + IQR/2
Q1 = x† − IQR/2
The observation suggests
É x0 = x
†
É ζ = IQR
(Monash) FIT5196 21 / 29
Outlier Univariate Outlier Detection
Quartile-based Detection and Boxplots
Symmetric boxplot rule
|xk − x†| > t × IQR
Asymmetric boxplot rule
xk > Q3 + t × IQR ⇒ xk is an upper outlier
xk < Q1 + t × IQR ⇒ xk is an lower outlier
(Monash) FIT5196 22 / 29
Outlier Multivariate Outlier Detection
Multivariate Outlier Detection
Linear models
É Residuals, i.e., the distances of the data points from this hyperplane, are
used to quantify the outlier scores.
Proximity-based models
É Outliers are defined as those points that do not lie in the dense regions.
− Clustering methods: segment the data points
− Density-based methods: segment the data space.
(Monash) FIT5196 23 / 29
Outlier Multivariate Outlier Detection
Linear models
linear regression model
y =
d
∑
i=1
wixi + wd+1 + �j
É Learning objective: minimise the error between the true value of the
predicted value of y
∑
j
�
2
j =
∑
j
((
d
∑
i=1
wixj ,i + wd+1)− yj )
2 (1)
= ||Dw − y||2 (2)
where D is N× (d + 1) data matrix, W is the coefficients, y is a vector N
true response values.
É Closed form soluction
w = (DtD + αI)−1Dty
(Monash) FIT5196 24 / 29
Outlier Multivariate Outlier Detection
Linear models
Regression with and without outliers
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2
−
6
−
4
−
2
0
2
4
6
Without Outliers
syn_data$x
sy
n
_
d
a
ta
$
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2
−
6
−
4
−
2
0
2
4
6
8
With Outliers
syn_data$x
sy
n
_
d
a
ta
$
y
●
●
●
●
Figure: y = 2x + 0.5 + �
(Monash) FIT5196 24 / 29
Outlier Multivariate Outlier Detection
Linear models
Outliers are, after all, values that deviate from expected (or predicted)
values on the basis of a particular model
Goal: find lower-dimensional subspaces, in which the outlier points behave
very differently from other points
É The residual �j provides useful information about the outlier score of the data
point j .
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2
−
6
−
4
−
2
0
2
4
6
Without Outliers
syn_data$x
sy
n
_
d
a
ta
$
y
(Monash) FIT5196 24 / 29
Outlier Multivariate Outlier Detection
Linear models
Using boxplot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2
−
6
−
4
−
2
0
2
4
6
8
With Outliers
syn_data$x
sy
n
_
d
a
ta
$
y
●
●
●
●
●
●
●
●−
1
0
−
5
0
5
1
0
(Monash) FIT5196 24 / 29
Outlier Multivariate Outlier Detection
Density-Based Outliers
Figure: Figure from "Outlier Analysis", second edition by Charu C. Aggarval
(Monash) FIT5196 25 / 29
Outlier Multivariate Outlier Detection
Density-Based Outliers
Figure: Figure from "Outlier Analysis", second edition by Charu C. Aggarval
(Monash) FIT5196 25 / 29
Outlier Multivariate Outlier Detection
Density-Based Outliers
Distance-based method:
É An object p in a dataset D is a DB(pct, dmin)-outlier if at least percentage
pct of the objects in D lies greater than distance dmin from p,
|{q ∈ D|d(p, q) ≤ dmin}| ≤ (100− pct)× |D|.
(Monash) FIT5196 26 / 29
Outlier Multivariate Outlier Detection
Density-Based Outliers
Local Outlier Factor (LOF)
É k-distance of an object p, denoted as dk (p) is defined as the distance d(p, o)
between p and an object o ∈ D such that:
1 for at least k objects o ′ ∈ D\{p} it holds that d(p, o ′) ≤ d(p, o), and
2 for at most k − 1 objects o ∈ D\{p} it holds that d(p, o ′) < d(p, o).
É k-distance neighborhood of an object p
Ndk (p)(p) = {q ∈ D\{d} | d(p, q) ≤ dk (p)}
É Reachability distance of an object p w.r.t. object o
dr ,k (p, o) = max(dk (o), d(p, o))
(Monash) FIT5196 27 / 29
Outlier Multivariate Outlier Detection
Density-Based Outliers
Local Outlier Factor (LOF)
where reach − distk (p1, o) = dr ,k (p1, o) and k − distance(o) = dk (o)
(Monash) FIT5196 27 / 29
Outlier Multivariate Outlier Detection
Density-Based Outliers
Local Outlier Factor (LOF)
É Local reachability density of an object p
lrdk (p) =
1
�
∑
o∈Nk (p)
dr ,k (p,o)
|Nk (p)|
�
É LOF of an object p
LOFk (p) =
∑
o∈Nk (p)
lrdk (o)
lrdk (p)
| Nk (p) |
(Monash) FIT5196 27 / 29
Outlier Multivariate Outlier Detection
Density-Based Outliers
Local Outlier Factor (LOF)
(Monash) FIT5196 27 / 29
Outlier Multivariate Outlier Detection
Compare different outlier detection methods
Figure: Figures from http://scikit-learn.org/
(Monash) FIT5196 28 / 29
Outlier Multivariate Outlier Detection
Compare different outlier detection methods
Figure: Figures from http://scikit-learn.org/
(Monash) FIT5196 28 / 29
Outlier Multivariate Outlier Detection
Compare different outlier detection methods
Figure: Figures from http://scikit-learn.org/
(Monash) FIT5196 28 / 29
Summary
Summary
Types of outliers
Univariate outlier detection method
É the 3σ edit rule
É the Hampel identifier
É the Quartile-based detection
Multi-variate outlier detection method
É Linear model
É Local Outlier factor
(Monash) FIT5196 29 / 29
Recap
Outlier
Types of outliers
Univariate Outlier Detection
Multivariate Outlier Detection
Summary