程序代写代做代考 data mining Excel algorithm database EM623-Week3

EM623-Week3

Carlo Lipizzi
clipizzi@stevens.edu

SSE

Machine Learning and Data Mining
Data management: generalized tools and techniques

Knowledge Discovery Process, in practice

Data Preparation
estimated to take 70-
80% of the time and
effort

Data
Preparation

2

Data Processing Flow

Data Analysis Decisions

Quality of
Data

Quality of
Analysis

Quality of
Decisions

• Types of Data Quality Problems:
Ambiguity
Uncertainty
Erroneous data values
Missing Values
Duplication
etc

3

Approaching Data Quality

We need a multi-disciplinary approach to attack data
quality problems

– No one approach solves all problem
• Process management

– Ensure proper procedures
• Statistics

– Focus on analysis: find and repair anomalies in data.
• Database

– Focus on relationships: ensure consistency.
•Metadata / domain expertise

– What does it mean? Interpretation

4

Metadata

• Data about the data
• Data types, domains, and constraints help, but are often not

enough
• Interpretation of values

– Scale, units of measurement, meaning of labels
• Interpretation of tables

– Frequency of refresh, associations, view definitions
• Most work done for scientific databases

– Metadata can include programs for interpreting the data
set

5

Process Management

Business processes which encourage data quality.
– Standardization of content and formats
– Enter data once, enter it correctly (incentives for

sales, customer care)
– Automation
– Assign responsibility : data stewards
– End-to-end data audits and reviews

• Transitions between organizations.
– Data Monitoring
– Data Publishing
– Feedback loops

6

Feedback Loops

Data processing systems are often thought of as open-
loop systems

– Do your processing then throw the results over the
fence?

– Computers don’t make mistakes, do they?
Analogy to control systems: feedback loops

– Monitor the system to detect difference between
actual and intended

– Feedback loop to correct the behavior of earlier
components

– Of course, data processing systems are much more
complicated than linear control systems

7

Example

Sales, provisioning, and billing for telecommunications
service

– Many stages involving handoffs between
organizations and databases

– Simplified picture
Transition between organizational boundaries is a
common cause of problems
Natural feedback loops

– Customer complains if the bill is to high
Missing feedback loops

– No complaints if we undercharge

8

Example

Customer Sales Order

Billing

Customer Account
Information

Provisioning

Customer
Care

Existing Data Flow Missing Data Flow

9

Monitoring

Use data monitoring to add missing feedback loops
Methods:

– Data tracking / auditing
• Follow a sample of transactions through the workflow.
• Build secondary processing system to detect possible

problems
– Reconciliation of incrementally updated databases

with original sources.
– Mandated consistency with a Database of Record
– Feedback loop sync-up
– Data Publishing

10

Statistical Approaches

No explicit DQ methods
– Traditional statistical data collected from carefully

designed experiments, often tied to analysis
– But, there are methods for finding anomalies and

repairing data
– Existing methods can be adapted for DQ purposes

Four broad categories can be adapted for DQ
– Missing, incomplete, ambiguous or damaged data

e.g. truncated, censored
– Suspicious or abnormal data e.g. outliers
– Testing for departure from models
– Goodness-of-fit

11

Statistics has two major chapters:

• Descriptive Statistics
– Gives numerical and graphic procedures to

summarize a collection of data in a clear
and understandable way

• Inferential statistics
– Provides procedures to draw inferences

about a population from a sample

12

Descriptive Measures

• Central Tendency measures
– They are computed to give a “center” around

which the measurements in the data are distributed

• Variation or Variability measures
– They describe “data spread” or how far away the

measurements are from the center

• Relative Standing measures
– They describe the relative position of specific

measurements in the data

13

Measures of Central Tendency

•Mean
– Sum of all measurements divided by the

number of measurements

•Median
– A number such that at most half of the

measurements are below it and at most half of
the measurements are above it

•Mode
– The most frequent measurement in the data

14

Example of Mean

Measurements Deviation
x x – mean
3 -1
5 1
5 1
1 -3
7 3
2 -2
6 2
7 3
0 -4
4 0

40 0

• MEAN = 40/10 = 4
• Notice that the sum of

the “deviations” is 0
• Notice that every single

observation intervenes
in the computation of
the mean

Excel Example
=AVERAGE(B72:B81)

15

Example of Median

• Median: (4+5)/2 = 4.5
• Notice that only the two

central values are used
in the computation

• The median is not
sensible to extreme
values

Measurements Measurements
Ranked

x x
3 0
5 1
5 2
1 3
7 4
2 5
6 5
7 6
0 7
4 7

40 40

Excel Example
=MEDIAN(B72:B81)

16

Example of Mode

Measurements

x
3
5
5
1
7
2
6
7
0
4

• The mode in a list of numbers refers to the list of
numbers that occur most frequently

• In this case the data have two modes: 5 and 7
• Both measurements are repeated twice

Measurements
x
3
5
1
1
4
7
3
8
3

• Mode: 3
• Notice that it is possible for a dataset

not to have any mode
Excel Example

=MODE(B72:B81)

17

Maximum, Minimum, and Range

Excel Example:

– =MIN(cellrange) =MIN(D2:D81)
– =MAX(cellrange) =MAX(D2:D81)

– There is no explicit command to find the range
– However, it can be easily calculated

– = MAX(D2:D81) – MIN(D2:D81)

18

Exercise – Companies Values

• Data set: Companies1.xlsx

– 25 companies

• For the 3 numeric variables calculate:

– Mean, Mode, Median, Max, Min, Range

• Can you get any non explicit info from the values you
calculated?

• Can you create any new variables to get more from your data?
What is your goal?

19

Variance

• Variance is the average of the squared differences from
the Mean

• Steps:

– Compute each deviation
– Square each deviation
– Sum all the squares
– Divide by the data size (sample size) minus one: n-1

20

σ 2 =
1
N

(xi −µ)
2

i=1

N

∑ s2 = 1
n−1

(xi − x)
2

i=1

n

Example of Variance

Measurements Deviations Square of
deviations

x x – mean
3 -1 1
5 1 1
5 1 1
1 -3 9
7 3 9
2 -2 4
6 2 4
7 3 9
0 -4 16
4 0 0

40 0 54

• Variance = 54/9 = 6
• It is a measure of “spread”
• Notice that the larger the

deviations (positive or
negative) the larger the
variance

21

Excel Example
=VAR.P(B72:B81)
Calculates variance based on the
entire population
=VAR.S(B72:B81)

Calculates variance based on a
sample

The standard deviation

• It is defines as the square root of the variance
• In the previous example Variance = 6
• Standard deviation = Square root of the variance = Square root of

6 = 2.45
• We use n-1 instead N (Bessel’s correction) to compensate the fact

that xi in Samples tend to be closer to their average

22

Excel Example
=STDEV.P(B72:B81)
=STDEV.S(B72:B81)

σ =
1
N

(xi −µ)
2

i=1

N

∑ s = 1
n−1

(xi − x)
2

i=1

n

Population Sample

The standard deviation: Sample vs Population

• The standard deviation is a measure of
the spread of scores within a set of data

• Usually, we are interested in the
standard deviation of a population.
However, as we are often presented
with data from a sample only, we can
estimate the population standard
deviation from a sample standard
deviation

23

• A population includes each element from the set of observations that can
be made, while a sample consists only of observations drawn from the
population

• These two standard deviations – sample and population standard
deviations – are calculated differently. In statistics, we are usually presented
with having to calculate sample standard deviations

Shape – Patterns of Frequency

24

BIMODAL

SYMMETRICAL

SKEWED

TRUNCATED

Percentiles

• The pth percentile is a number such that at most p% of the
measurements are below it and at most 100 – p percent of the
data are above it

• Example, if in a certain data the 85th percentile is 340 means
that 15% of the measurements in the data are above 340. It also
means that 85% of the measurements are below 340

• Notice that the median is the 50th percentile

25

For any data

• At least 75% of the measurements differ from the mean less
than twice the standard deviation

• At least 89% of the measurements differ from the mean less
than three times the standard deviation

Note: This is a general property and it is called Tchebysheff’s
Inequality: Given a number k >= 1 and a population with n
measurements, at least 1-1/k2 of the measurements will lie within k
standard deviations of their mean. It is true for every dataset

26

Example of Tchebysheff’s Inequality

• Suppose that for a certain
data is : Mean = 20

• Standard deviation =3

Then:
• A least 75% of the

measurements are
between 14 and 26

• At least 89% of the
measurements are
between 11 and 29

27

• Bottom line: the rule
guarantees that in any
probability distribution,
“nearly all” values are
close to the mean

Further Notes

• When the Mean is greater than the Median the data
distribution is skewed to the Right

• When the Median is greater than the Mean the data
distribution is skewed to the Left

• When Mean and Median are very close to each other
the data distribution is approximately symmetric

28

Exercise – Starting Salaries – 15’

• Data set: StartSalary.xlsx

– 12 datapoint

• Calculate:

–Mean, Mode, Median, Standard Deviation, Sample
Variance, Skewness, Max, Min, Range

29

Knowledge Discovery Process, in practice

Data Preparation
estimated to take 70-
80% of the time and
effort

Data
Preparation

30

Data Cleaning / Quality

• Individual measurements
– Random noise in individual measurements

• Outliers
• Random data entry errors
• Noise in label assignment (e.g., class labels in medical data sets)
• can be corrected or smoothed out

– Systematic errors
• E.g.: all ages > 99 recorded as 99
• More individuals aged 20, 30, 40, etc. than expected

• Missing information
• Missing at random

– Questions on a questionnaire that people randomly forget to fill in
• Missing systematically

– Questions that people don’t want to answer
– Patients who are too ill for a certain test

31

Handling Missing Data: 3 alternatives

• Replace Missing Values with User-defined Constants

– Missing numeric values replaced with 0.0
– Missing categorical values replaced with “Missing”

• Replace Missing Values with Mode or Mean
• Replace Missing Values with Random Values

– Values randomly taken from underlying distribution
– Method superior compared to mean substitution

32

Exercise on Handling Missing Data

• Examine cars.txt dataset containing records for 261
automobiles manufactured in 1970s and 1980s

• Examine the file and handle missing data
–Use one or more of the three alternative methods

33

Identifying Outliers

• Outliers are values that lie near extreme limits of
data range

• Outliers may represent errors in data entry
• Certain statistical methods may produce unstable

results
• Some data mining algorithms benefit from

normalized data

34

Graphical Methods for Identifying Outliers

• A histogram examines values of numeric fields
• Gives us the possibility to identify the outliers and

then decide what to do
• Multidimensional graphs could provide more insights

35

Exercise on Handling Outliers

• Examine cars_full.txt dataset containing full records
for 261 automobiles manufactured in 1970s and
1980s

• Examine the file for outliers

36

Data Transformation – Normalization

• Variables tend to have ranges different from each
other

• Some data mining algorithms adversely affected by
differences in variable ranges

• Variables with greater ranges tend to have larger
influence on data model’s results

• Therefore, numeric field values should be
normalized

37

Normalization – Min-Max

• Determines how much greater the selected field
value is than minimum value for field

• Scales this difference by field’s range

X*=
X −min(X)
range(X)

=
X −min(X)

max(X)−min(X)

38

Min-Max – Example

• From the cars dataset, normalize the value for a
vehicle taking 25 seconds to reach 60mph

• Max(time-to-60) = 25

• Maximum field values have Min-max Normalization
value = 1

• Min-max Normalization values range [ 0, 1 ]

0.1
825
825

)min()max(
)min(

* ===
XX

XX
X

39

Z-score Standardization

• Widely used in statistical analysis
• Takes difference between field value and field value

mean
• Scales this difference by field’s standard deviation

40

𝑿∗ =
𝑿 −𝒎𝒆𝒂𝒏(𝑿)

𝑺𝑫(𝑿)

Z-score – Example

• Same: From the cars dataset, normalize the value
for a vehicle taking 25 seconds to reach 60mph

• Data values that lie below the mean have negative
Z-score Standardization values

41

𝑋∗ = ./0123(.)
45(.)

= 7/89.9;7
<.=88 = 2.593 Z-score – Key points • Z-score Standardization values typically range [ -4, 4 ] • Field values below field mean à negative Z-score Standardization values • Field values equal to field mean à Z-score Standardization value = 0 • Field values above field mean à positive Z-score Standardization values 42 Exercise on Normalization • Using the cars_full.txt dataset normalize the “time-to- 60” values • Use either Min-Max or Z-score method 43 Data Transformation – Data Reduction –Dimension Reduction •In general, incurs loss of information about x •If dimensionality p is very large (e.g., 1000’s), representing the data in a lower-dimensional space may make learning more reliable • e.g.: clustering example • 100 dimensional data • if cluster structure is only present in 2 of the dimensions, the others are just noise • if other 98 dimensions are just noise (relative to cluster structure), then clusters will be much easier to discover if we just focus on the 2d space •Dimension reduction can also provide interpretation/insight (e.g.: for 2d visualization purposes) 44 Data Reduction - Methods • Sampling • Choose a representative subset of the data Simple random sampling may be ok but beware of skewed variables • Principal Component • One of several projection methods • Idea: Find a projection of your data in a lower dimension, that maximizes the amount of information retained 45 Data Reduction - Principal Component 46 Direction of 1st principal component vector (highest variance projection) x1 x2 Direction of 1st principal component vector (highest variance projection) x1 x2 Direction of 2nd principal component vector Using orthogonal transformations, converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components Data Reduction - Consolidation • Consolidating variables to create new logical variables This is very domain-dependent and may create new insights on the data • In the cars dataset, creating the variable “hp/weight” can provide an indication of power/unit and make vehicles more comparable one to the other 47