计算机代考程序代写 data science data mining COMP3430 / COMP8430 Data wrangling

COMP3430 / COMP8430 Data wrangling
Lecture 7: Data transformation, aggregation and reduction
(Lecturer: )

Lecture outline
Data transformation
Attribute/feature construction

– – –
● ● ●
Data aggregation Data reduction Summary
2

Data transformation



– –


Generalisation
Using concept hierarchy
Normalisation
Scale data to fall within a small (specified) range
Min-max normalisation, z-score normalisation, decimal scaling, logarithm transformation
Attribute/feature construction
New attributes constructed by applying a function on existing attributes
3

Generalisation (1)
Based on concept hierarchy or value generalisation hierarchy
Concept hierarchy – specifies ordering of attributes explicitly at the schema level (as discussed in the data warehousing lecture)
For example, Street < City < State < Country ● ● – ● – Value generalisation hierarchy – specifies a hierarchy for the values of an attribute by explicit data grouping For example, {Dickson, Lyneham, Watson} < Canberra 4 Generalisation (2) Some concept hierarchies can be automatically generated Based on the number of distinct values in each attribute The attribute with the most distinct values is in the lowest level of the hierarchy ● – – Country State 7 distinct values 179 distinct values – Day, month, year and time attributes are exception! 1,578 City distinct values Street 235,412 distinct values 5 (1) Min-max [0-1] normalisation ● – ● – ● – Subtracting the median value and dividing by the median absolute deviation Subtracting the minimum value and dividing by the difference between maximum and minimum values Z-score normalisation Subtracting the mean value and dividing by the standard deviation Robust normalisation 6 Normalisation (2) 5 27 100 59 28 48 50 39 9 7 20 63 10 41 9 Values Min-max Z-score Robust smallest largest median 0 0.23 1 0.57 0.24 0.45 0.47 0.36 0.04 0.02 0.16 0.61 0.05 0.38 0.04 -1.13 -0.28 2.54 0.95 -0.24 0.53 0.61 0.18 -0.98 -1.06 -0.55 1.11 -0.94 0.26 -0.98 -1.21 -0.05 3.79 1.63 0 1.05 1.16 0.58 -1 -1.11 -0.42 1.84 -0.95 0.68 -1 7 Normalisation (3) Logarithm normalisation For attributes with skewed distribution (such as income) Transforms a broad range of numeric values to a narrower range of numeric values Useful when data have outliers with extremely large variance For example, using a base 10 logarithm function, a list of income values [$10,000, $100,000, $150,000, $1,000,000] is transformed into [4, 5, 5.18, 6] ● – – – – 8 Normalisation (4) Logarithm normalisation on the WindGustSpeed attribute in the Rattle Weather data set ● 9 Attribute / feature selection (1) Reduce the number of features/attributes that are not significant for a certain data science project Select a minimum set of features/attributes such that ● ● – The probability of different classes or information gain given the values for these features is as close as possible given all the features ● – Exponential number of choices 2d possible combinations of sub-features from d features 10 Attribute / feature selection (2) Step-wise forward selection ● – Best feature is selected first, then the next best feature condition to the first is selected, and so on ● – ● – ● Step-wise backward elimination Repeatedly eliminate the least useful feature Combining forward selection and backward elimination Repeatedly select best and eliminate worst features Decision-tree induction (machine learning-based) 11 Attribute / feature construction A process of adding derived features to data (also known as constructive induction or attribute discovery) Construct new attributes/features based on existing attributes/features Combining or splitting existing raw attributes into new one which have a higher predictive power For example splitting date attribute into month and year attributes for monthly and annual processing Generating new attribute on tax exclusive price values ● ● – – – 12 Data aggregation Compiling and summarising data to prepare new aggregated data The aim is to get more information about particular groups based on specific attributes, such as age, income, and location For example, aggregated phone usage of customers by age and location in a phone calling list data set ● ● – ● Can also be aggregated from multiple sources 13 Data reduction Volume of data increases with the Big data growth A process of reducing data volume by choosing smaller forms of representation ● ● ● – Construct model fitting the data, estimate model parameters, store only the parameters, and discard data ● – Parametric methods: Non-parametric methods: Based on histograms, clustering, and sampling 14 Parametric methods (1) Linear regression: fit the data to a straight line (Y=wX+b), the regression coefficients w and b determine the line using the data Multiple regression: to transform to non-linear functions (Y=b0+b1X1+b2X2) Log-linear models: approximate discrete multi-dimensional probability distributions ● ● ● To be covered in more detail in the data mining course 15 Parametric methods (2) Example: Linear regression y=x+2 y y=2 To be covered in more detail in the data mining course x 16 Histograms ● – Divides data into buckets and store summary for each bucket (total, average, median) ● – – Binning: Binning methods: Equal width – with equal bin range Equal frequency/depth – with equal bin frequency (same number of data points in each bin) 17 Clustering (1) ● – Partition/group data into clusters based on similarity, and store only cluster representation (for example, centroid and diameter only) ● – Centroid-based - K-means: assigns data to the nearest cluster center (of k clusters), such that the squared distances from the center are minimised Connectivity-based - Hierarchical clustering: data belong to a child cluster also belong to the parent cluster Density-based – DBSCAN: Clusters data that satisfy a density criterion Distribution-based – Gaussian mixture models: Models (iteratively optimized) data with a fixed number of Gaussian distributions Clustering: Clustering techniques: – – – 18 Clustering (2) Example: Centroid-based clustering (k-means with k=4) Data Centroids To be covered in more detail in the data mining course 19 Sampling (1) ● – ● ● – – Sampling: Generate a small sample to represent the whole dataset Choose a representative subset of the data Sampling methods: Simple random sampling does not perform well on skewed data (for example, only a few people with high salary) Stratified sampling is an adaptive sampling method that divides the data into groups (known as strata) and a probability sample is drawn from each group 20 Sampling (2) Example: Random sampling 21 Random sampling without replacement Random sampling with replacement Sampling (3) Example: Stratified sampling (sample 5 data points per group / cluster) 22 Summary Data transformation, aggregation, and reduction are being used in data science applications to improve effectiveness and quality of data analysis and mining Data pre-processing includes: Data cleaning, transformation, aggregation, and reduction Data standardisation and parsing (will be covered tin lecture 8) Data integration (will be covered later in the course) ● ● – – – ● Various methods have been developed for data pre-processing, but this is still an active area of research 23