程序代写代做代考 Data Cleansing — 3

Data Cleansing — 3

Data Cleansing — 3

Faculty of Information Technology
Monash University, Australia

FIT5196 week 8

(Monash) FIT5196 1 / 29

Outline

1 Recap

2 Outlier
Types of outliers
Univariate Outlier Detection
Multivariate Outlier Detection

3 Summary

(Monash) FIT5196 2 / 29

Recap

Missing Data Mechanisms

Describe relationships between measured variables and the probability of
missing data

Deciding upon the method for analysing missing values requires
understanding about both the reasons for the missing values and the nature
of the data for the missing observations.
Three different missingness mechanisms:
É Missing at random
É Missing completely at random
É Missing not at random

(Monash) FIT5196 3 / 29

Recap

MAR, MCAR v.s. MNAR?

Example adopted from “Applied Missing Data Analysis” by Craig K. Enders.

(Monash) FIT5196 4 / 29

Recap

Missing data Patten

A missing data pattern refers to the configuration of observed and missing
values in a data set.

The univariate pattern has missing values isolated to a single variable.
A monotone missing data pattern is typically associated with a
longitudinal study where participants drop out and never return.
a general pattern has missing values dispersed throughout the data matrix
in a haphazard fashion.

(Monash) FIT5196 5 / 29

Recap

Methods for handling missing values

Deletion methods
É Listwise deletion
É Pairwise deletion

Imputation methods
É Mean imputation
É Regression imputation
É Stochastic regression imputation
É Hot-deck imputation
É Last observation carrie forward

(Monash) FIT5196 6 / 29

Outlier

Outline

1 Recap

2 Outlier
Types of outliers
Univariate Outlier Detection
Multivariate Outlier Detection

3 Summary

(Monash) FIT5196 7 / 29

Outlier

Outliers: the definition

What is an outlier?
É Definition of Hawkins: “An outlier is an observation which deviates so much
from the other observations as to arouse suspicions that it was generated by
a different mechanism” (Hawkins, D. 1980. Identification of Outliers. Chapman and Hall.)

É Definition of Pearson: “An outlier is a data point that appears to be
inconsistent with the nominal behavior exhibited by most of the other data
points in a specified collection.”

(Monash) FIT5196 8 / 29

Outlier

Outliers: the definition

Figure is from Chapter 2 of “Mining Imperfect Data”. Outliers detected with the Hampel identifier are catalyst samples 4, 13, and 21 and are marked

with solid circles, The median value and upper and lower Hampel identifier detection limits are shown as dashed lines.

(Monash) FIT5196 8 / 29

Outlier

Outliers

An outlier often contains useful information about abnormal characteristics
of the systems and entities that impact the data generation process.
É Intrusion detection systems: unusual behaviour shown in the operating
system calls, network traffic, or other user action.

É Credit-card fraud: Unauthorized use of a credit card may show different
patterns, such as buying sprees from particular locations or very large
transactions.

É Medical Analysis: Unusual patterns in MRI, PET and ECT data typically
reflect disease conditions

É Law enforcement: Determining fraud in financial transactions, trading
activity, or insurance claims typically requires the identification of unusual
patterns in the data generated by the actions of the criminal entity.

(Monash) FIT5196 9 / 29

Outlier

Outliers: the impact

Outliers can increase the error variance and reduces the power of statistical
tests.

If the outliers are non-randomly distributed, they can decrease normality.

Outliers can bias or influence estimates that may be of substantive interest

Outliers can also impact the basic assumption of Regression, ANOVA and
other statistical model assumptions.

(Monash) FIT5196 10 / 29

Outlier

Outliers: the impact

Example:

8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
mean = 7.8 mean = 15.5
median = 8 median = 8
mod = 8 mod = 8
sd = 1.328 sd = 26.641

(Monash) FIT5196 11 / 29

Outlier Types of outliers

Types of outliers

Univariate outlier: concerns the distribution of a single variable

Multivariate outlier: concerns outliers in an n-dimensional space.

Figure is from “A Comprehensive Guide to Data Exploration”

(Monash) FIT5196 12 / 29

Outlier Types of outliers

Univariate Outliers

Based on the notion that “most ” of the data should exhibit approximately
the same value c, the observed sequence of data {xk} can be modelled as

xk = c + ek

where {ek} is a sequence of deviations about the nominal value c .

(Monash) FIT5196 13 / 29

Outlier Types of outliers

Univariate Outliers

Figure is from Chapter 2 of “Mining Imperfect Data”

Distinguish between lower outliers and upper outliers
(Monash) FIT5196 14 / 29

Outlier Types of outliers

Multivariate outliers

Multivariate outlier: concerns outliers in an n-dimensional space.
É A multivariate outlier in a sequence {xk} of vectors corresponds to a vector

xj whose individual components are significantly discordant with the
intercomponent relations exhibited by the majority of the other data values.

Figure is from Chapter 2 of “Mining Imperfect Data”

É the intercomponent relation:
y ‘
p

1− x2

É 0 ¯ x , y ,¯ 1

(Monash) FIT5196 15 / 29

Outlier Univariate Outlier Detection

How to detect Univariate Outliers

Problem formulation: given a sequence of observed data {xk}, a reference
value x0, and a measure of variation ζ computed from {xk}, detect outliers
according to

|xk − x0| > tζ

where t is a threshold parameter.
Questions
É How do we define the nominal data reference value x0?
É How do we define the scale of natural variation ζ ?
É How do we choose the threshold parameter t?

(Monash) FIT5196 16 / 29

Outlier Univariate Outlier Detection

Three Outlier Detection Methods

Choices for the nominal reference value x0
É mean: x̄
É median: x†

Choices for the measure of variation ζ
É the standard deviation: σ
É The median absolute deviation(MAD) scale estimator S:

S = 1.4826×median{|xk − x
†|}

É The Interquartile Range (IQR)

IQR = Q3 −Q1

Combine the choices
É The 3σ edit rule: x0 = x̄ , ζ = σ
É The Hampel identifier: x0 = x

†, ζ = S
É The standard boxplot outlier rule: x0 = x

†, ζ = IQR

(Monash) FIT5196 17 / 29

Outlier Univariate Outlier Detection

The 3σ edit rule

Basic idea: if a data sequence {xk} is well approximated by an i.i.d.
sequence of Gaussian random variables with mean µ and standard deviation
σ, the probability of observing a value xk farther than three standard
deviations from the mean is only about 0.3%.

(Monash) FIT5196 18 / 29

Outlier Univariate Outlier Detection

The 3σ edit rule

xk is an outlier if
|xk − x̄ | > 3σ

Also known as the extreme studentized deviation (ESD) identifier (Davies
and Gather, 1993)
Problems?

É The presence of outliers in the dataset can cause substantial errors in
estimating

− the mean
− the standard deviation

8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
mean = 7.8 mean = 15.5
avedev = 0.99 avedev = 14.08
sd = 1.328 sd = 26.641

(Monash) FIT5196 19 / 29

Outlier Univariate Outlier Detection

The 3σ edit rule

xk is an outlier if
|xk − x̄ | > 3σ

Also known as the extreme studentized deviation (ESD) identifier (Davies
and Gather, 1993)
Problems?
É The presence of outliers in the dataset can cause substantial errors in
estimating

− the mean
− the standard deviation

8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
mean = 7.8 mean = 15.5
avedev = 0.99 avedev = 14.08
sd = 1.328 sd = 26.641

(Monash) FIT5196 19 / 29

Outlier Univariate Outlier Detection

The Hampel Identifier

Basic idea:
É x0 = x

É ζ = S = 1.4826×median{|xk − x
†|}

É xk is an outlier if
|xk − x

†| > 3S

(Monash) FIT5196 20 / 29

Outlier Univariate Outlier Detection

The Hampel Identifier

Basic idea:
É x0 = x

É ζ = S = 1.4826×median{|xk − x
†|}

É xk is an outlier if
|xk − x

†| > 3S
Why use median and MAD
É lower outlier-sensitivities than mean and standard deviation

8,7,9,9,6,5,8,9,8,8,9 8,7,9,9,6,5,8,9,8,8,9,100
median = 8 median = 8
MAD = 1 MAD = 1

(Monash) FIT5196 20 / 29

Outlier Univariate Outlier Detection

The Hampel Identifier

Basic idea:
É x0 = x

É ζ = S = 1.4826×median{|xk − x
†|}

É xk is an outlier if
|xk − x

†| > 3S
Drawbacks:
É the MAD scale estimate is identically zero if more than 50% of the data
observations xk have the same value.

(Monash) FIT5196 20 / 29

Outlier Univariate Outlier Detection

Quartile-based Detection and Boxplots

Q0 the minimum
Q1 bigger than 25% of the data points
Q2 the median
Q3 bigger than 75% of the data points
Q4 the maximum

For a symmetric distribution,

IQR = Q3−Q1

x† =
Q3 + Q1

2
Q3 = x† + IQR/2

Q1 = x† − IQR/2

The observation suggests
É x0 = x

É ζ = IQR

(Monash) FIT5196 21 / 29

Outlier Univariate Outlier Detection

Quartile-based Detection and Boxplots

Symmetric boxplot rule

|xk − x†| > t × IQR

Asymmetric boxplot rule

xk > Q3 + t × IQR ⇒ xk is an upper outlier
xk < Q1 + t × IQR ⇒ xk is an lower outlier (Monash) FIT5196 22 / 29 Outlier Multivariate Outlier Detection Multivariate Outlier Detection Linear models É Residuals, i.e., the distances of the data points from this hyperplane, are used to quantify the outlier scores. Proximity-based models É Outliers are defined as those points that do not lie in the dense regions. − Clustering methods: segment the data points − Density-based methods: segment the data space. (Monash) FIT5196 23 / 29 Outlier Multivariate Outlier Detection Linear models linear regression model y = d ∑ i=1 wixi + wd+1 + �j É Learning objective: minimise the error between the true value of the predicted value of y ∑ j � 2 j = ∑ j (( d ∑ i=1 wixj ,i + wd+1)− yj ) 2 (1) = ||Dw − y||2 (2) where D is N× (d + 1) data matrix, W is the coefficients, y is a vector N true response values. É Closed form soluction w = (DtD + αI)−1Dty (Monash) FIT5196 24 / 29 Outlier Multivariate Outlier Detection Linear models Regression with and without outliers ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 − 6 − 4 − 2 0 2 4 6 Without Outliers syn_data$x sy n _ d a ta $ y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 − 6 − 4 − 2 0 2 4 6 8 With Outliers syn_data$x sy n _ d a ta $ y ● ● ● ● Figure: y = 2x + 0.5 + � (Monash) FIT5196 24 / 29 Outlier Multivariate Outlier Detection Linear models Outliers are, after all, values that deviate from expected (or predicted) values on the basis of a particular model Goal: find lower-dimensional subspaces, in which the outlier points behave very differently from other points É The residual �j provides useful information about the outlier score of the data point j . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 − 6 − 4 − 2 0 2 4 6 Without Outliers syn_data$x sy n _ d a ta $ y (Monash) FIT5196 24 / 29 Outlier Multivariate Outlier Detection Linear models Using boxplot ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 − 6 − 4 − 2 0 2 4 6 8 With Outliers syn_data$x sy n _ d a ta $ y ● ● ● ● ● ● ● ●− 1 0 − 5 0 5 1 0 (Monash) FIT5196 24 / 29 Outlier Multivariate Outlier Detection Density-Based Outliers Figure: Figure from "Outlier Analysis", second edition by Charu C. Aggarval (Monash) FIT5196 25 / 29 Outlier Multivariate Outlier Detection Density-Based Outliers Figure: Figure from "Outlier Analysis", second edition by Charu C. Aggarval (Monash) FIT5196 25 / 29 Outlier Multivariate Outlier Detection Density-Based Outliers Distance-based method: É An object p in a dataset D is a DB(pct, dmin)-outlier if at least percentage pct of the objects in D lies greater than distance dmin from p, |{q ∈ D|d(p, q) ≤ dmin}| ≤ (100− pct)× |D|. (Monash) FIT5196 26 / 29 Outlier Multivariate Outlier Detection Density-Based Outliers Local Outlier Factor (LOF) É k-distance of an object p, denoted as dk (p) is defined as the distance d(p, o) between p and an object o ∈ D such that: 1 for at least k objects o ′ ∈ D\{p} it holds that d(p, o ′) ≤ d(p, o), and 2 for at most k − 1 objects o ∈ D\{p} it holds that d(p, o ′) < d(p, o). É k-distance neighborhood of an object p Ndk (p)(p) = {q ∈ D\{d} | d(p, q) ≤ dk (p)} É Reachability distance of an object p w.r.t. object o dr ,k (p, o) = max(dk (o), d(p, o)) (Monash) FIT5196 27 / 29 Outlier Multivariate Outlier Detection Density-Based Outliers Local Outlier Factor (LOF) where reach − distk (p1, o) = dr ,k (p1, o) and k − distance(o) = dk (o) (Monash) FIT5196 27 / 29 Outlier Multivariate Outlier Detection Density-Based Outliers Local Outlier Factor (LOF) É Local reachability density of an object p lrdk (p) = 1 � ∑ o∈Nk (p) dr ,k (p,o) |Nk (p)| � É LOF of an object p LOFk (p) = ∑ o∈Nk (p) lrdk (o) lrdk (p) | Nk (p) | (Monash) FIT5196 27 / 29 Outlier Multivariate Outlier Detection Density-Based Outliers Local Outlier Factor (LOF) (Monash) FIT5196 27 / 29 Outlier Multivariate Outlier Detection Compare different outlier detection methods Figure: Figures from http://scikit-learn.org/ (Monash) FIT5196 28 / 29 Outlier Multivariate Outlier Detection Compare different outlier detection methods Figure: Figures from http://scikit-learn.org/ (Monash) FIT5196 28 / 29 Outlier Multivariate Outlier Detection Compare different outlier detection methods Figure: Figures from http://scikit-learn.org/ (Monash) FIT5196 28 / 29 Summary Summary Types of outliers Univariate outlier detection method É the 3σ edit rule É the Hampel identifier É the Quartile-based detection Multi-variate outlier detection method É Linear model É Local Outlier factor (Monash) FIT5196 29 / 29 Recap Outlier Types of outliers Univariate Outlier Detection Multivariate Outlier Detection Summary