Data Mining: Concepts and Techniques
— Chapter 3 —
Qiang (Chan) Ye Faculty of Computer Science Dalhousie University University
Copyright By PowCoder代写 加微信 powcoder
Chapter 3: Data Preprocessing
n Data Preprocessing: An Overview
n Data Quality
n Major Tasks in Data Preprocessing
n Data Cleaning
n Data Integration
n Data Reduction
n Data Transformation and Data Discretization n Summary
Data Quality
n Data quality criteria: A multidimensional view n Accuracy: correct or wrong, accurate or not
n Completeness: not recorded, unavailable, …
n Consistency: some modified but some not, dangling, …
n Timeliness: timely update?
n Believability: how much does users believe the data are correct?
n Interpretability: how easily can the data be understood?
Major Tasks in Data Preprocessing
n Data cleaning
n Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
n Data integration
n Integration of multiple databases, data cubes, or files
n Data reduction
n Dimensionality reduction n Numerosity reduction
n Data compression
n Data transformation and data discretization n Normalization
n Concept hierarchy generation
Chapter 3: Data Preprocessing
n Data Preprocessing: An Overview
n Data Quality
n Major Tasks in Data Preprocessing
n Data Cleaning
n Data Integration
n Data Reduction
n Data Transformation and Data Discretization n Summary
Data Cleaning
n Real World Data Tends to be Dirty: Lots of potentially incorrect data, e.g., instrument fault, human or computer error, transmission error
n incomplete (i.e. missing): lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
n e.g., Occupation=“ ” (missing data)
n noisy: containing noise, errors, or outliers
n e.g., Salary=“−10” (an error)
n inconsistent: containing discrepancies in codes or names, e.g.
n Age=“42”, Birthday=“03/07/2010”
n Previous ratings “1, 2, 3”, current ratings “A, B, C”
Incomplete (Missing) Data
n Data is not always available
n For example, you need to analyze AllElectronics sales and customer data. Then you find that many tuples have no recorded value for several attributes such as customer income.
n Missing data may result from:
n equipment malfunction
n inconsistent with other recorded data and thus deleted
n data not entered due to misunderstanding
n certain data may not be considered important at the time of entry
n Missing data may need to be inferred
How to Handle Incomplete (Missing) Data?
n Ignore the tuple:
n For classification problems, this is usually done when class
label is missing
n By ignoring the tuple, we do not make use of the remaining attributes’ values in the tuple. Such data could have been useful to the task at hand. Thus, normally, ignoring tuples is not preferred.
n Generate missing data manually: n time-consuming
n infeasible for large data set
How to Handle Incomplete (Missing) Data?
n Generate missing data automatically with
n a constant : e.g., “unknown”, however, this could lead to a
new class, which is not a valid class
n the attribute mean
n the attribute mean for all samples belonging to the same class
n the most possible value: using inference-based methods such as decision tree (we will learn decision tree later)
n Note: These methods could possibly bias the data. However, generating missing data automatically with the most possible value is popular and widely used because it is most effective.
Noisy Data
n Noise data: abnormal values
n Noisy data may result from
n faulty data collection instruments n data entry problems
n data transmission problems
n technology limitation
How to Handle Noisy Data?
n Binning: smooth a sorted data value by consulting its “neighborhood,” that is, the values around it.
Step 1: sort data and partition into equal-frequency bins (i.e. each bin contains the same number of values)
Step 2: Use one of the following methods to smooth the values
n Smooth by bin means: each value in a bin is replaced by the mean value of the bin
n Smooth by bin median: each value in a bin is replaced by the median value of the bin
n Smooth by bin boundaries: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
How to Handle Noisy Data?
How to Handle Noisy Data?
n Regression: smooth data with regression functions n Regression is a technique that uses a function to
represent/describe data values.
n Linear regression involves finding the “best” line to fit two attributes (i.e. variables) so that one attribute can be used to predict the other.
n For example, a linear function can be used to describe the relationship between age and height.
n The details of regression will be discussed in Section 3.4
How to Handle Noisy Data?
n Clustering: detect and remove noisy data (i.e. outliers)
n Noise data can be detected by clustering, for example, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the clusters may be considered outliers.
Chapter 3: Data Preprocessing
n Data Preprocessing: An Overview
n Data Quality
n Major Tasks in Data Preprocessing
n Data Cleaning
n Data Integration
n Data Reduction
n Data Transformation and Data Discretization n Summary
Data Integration: Entity Identification Problem
n Data integration:
n Combines data from multiple sources into a coherent data set
n Entity identification problem: Identify real world entities from multiple data sources
n When we combine data from multiple sources, we often need to do schema integration and object matching
n Schema integration: integrate the schemas of different databases into one consistent schema, e.g., A.cust-id o B.cust-#
n For a database, the metadata for each attribute includes the details about the attribute, such as name, meaning, data type, and range of values permitted, etc. Such metadata can be used to help avoid errors in schema integration.
Data Integration: Redundancy
n Redundancy: An attribute may be redundant if it can be “derived” from another attribute or set of attributes.
n Redundancy is another important issue in data integration.
n Redundancy could be detected by correlation analysis.
n Correlation Analysis: Given two attributes, correlation analysis can measure how strongly one attribute implies the other, using the available data.
n For nominal (i.e. non-numeric) data, we use the 𝛘2 (chi- square) test.
n For numeric attributes, we can use the correlation coefficient and covariance, both of which assess how one attribute’s values vary from those of another.
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
c2 =å(Observed-Expected)2 Expected
n With the 𝛘2 test, we need to calculate the 𝛘2 value.
n The larger the 𝛘2 value, the higher the likelihood that the variables are related.
Correlation Analysis (Nominal Data)
Correlation Analysis (Nominal Data)
Correlation Analysis (Nominal Data)
n The 𝛘2 test can verify the hypothesis that A and B are independent, that is, there is no correlation between them.
n The test is based on a significance level, with (r-1)x(c- 1) degrees of freedom.
n If the hypothesis can be rejected, then we say that A and B are statistically correlated.
n We will illustrate how to carry out the 𝛘2 test using the following example.
Chi-Square Test: An Example
n Example Scenario:
n Suppose that a group of 1500 people was surveyed.
n The gender of each person was noted.
n Each person was polled as to whether his or her preferred type of reading material was fiction or non-fiction.
n Thus, we have two attributes, gender and preferred reading.
n The observed frequency (or count) of each possible joint event is summarized in the contingency table below.
Green numbers are the corresponding
Total (row)
Non-fiction
1000 (840)
Total (col.)
Chi-Square Test: An Example
n Using Eq. (3.2), we can calculate the expected frequencies for each cell.
n For example, the expected frequency for the cell (male, fiction) is
n Note that in any row, the sum of the expected frequencies must equal the total observed frequency for that row, and the sum of the expected frequencies in any column must also equal the total observed frequency for that column.
Chi-Square Test: An Example
n Using Eq. (3.1) for 𝛘2 calculation, we arrive at:
c2 =(250-90)2 +(50-210)2 +(200-360)2 +(1000-840)2 =507.93
90 210 360 840
n For the 2×2 table in this example, the degrees of freedom are (2-1) x(2-1)=1.
n For 1 degree of freedom, the 𝛘2 value needed to reject the hypothesis at the 0.001 significance level is 10.8276 (Note that the value can be taken from the table of upper percentage points of the 𝛘2 distribution, typically available from any textbook on statistics).
Chi-Square Test: An Example
n Chi-Square Table: Part of table can be found below
Chi-Square Test: An Example
n Since our computed value (i.e. 507.93) is greater than 10.828, we can reject the hypothesis that gender and preferred reading are independent and conclude that the two attributes are (strongly) correlated for the given group of people.
n Note that correlation does not imply causality
n # of hospitals and # of car-theft in a city can be
correlated
n Both are causally linked to the third variable: population
Correlation Analysis (Numeric Data)
n Correlation coefficient (also called Pearson’s product moment coefficient)
where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of aibi (where i=1 to n).
n If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher the value, the stronger the correlation.
n A high value might indicate that A (or B) is redundant and it can be removed.
n If rA,B = 0: independent
n If rAB < 0: negatively correlated
Visually Evaluating Correlation
Scatter plots showing correlation coefficient in the range of [–1, 1].
Note that:
1) Correlation coefficient is a measure of the linear correlation between two variables.
2) The absolute values of correlation coefficient is always less than or equal to 1. 1 means completely positive linear correlation, 0 means no linear correlation, and −1 means completely negative linear correlation.
Covariance (Numeric Data)
n Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.
n Positive covariance: If CovA,B > 0, then A and B are positively correlated (i.e., A and B tend to show similar behavior).
n Negative covariance: If CovA,B < 0, then A and B are negatively correlated (i.e., A and B tend to show opposite behavior).
n Independence => CovA,B = 0, but the converse is not true. Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence.
E: Expected Value (i.e. average)
Covariance (Numeric Data)
n The sign of the covariance therefore shows the tendency of the linear relationship between the variables.
n The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables.
n Note that the normalized version of the covariance, the correlation coefficient, however, can indicate the strength of the linear relation via the magnitude.
Covariance: An Example
n It can be simplified in computation as
n Suppose two stocks AllElectronics and HighTech have the following values in one week:
Dot Product of Vectors
Covariance: An Example
n Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?
n E(AllElectronics)=(6+5+4+3+2)/5=20/5=4
n E(HighTech)=(20+10+14+5+5)/5=54/5=10.8
n Cov(AllElectronics, HighTech) = (6×20+5×10+4×14+3×5+2×5)/5 − 4 × 10.8 = 7
n Thus, AllElectronics and HighTech rise or fall together since Cov(A, B) > 0.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com