CS代考 COMP3115 Exploratory Data Analysis and Visualization

COMP3115 Exploratory Data Analysis and Visualization
Lecture 4: Data Preprocessing Dr.
 Why we need data preprocessing
 What is data preprocessing

Copyright By PowCoder代写 加微信 powcoder

 Structured data preprocessing – Data Clean
– Data Transformation
 Text data (unstructured) preprocessing data – Text documents and Text mining
– Tokenization
– Stemming
– TF-IDF Scheme

Why we need preprocess the data?
 Real world data is dirty
– Incomplete: lacking attribute values e.g., Occupation = “ ”
– Noisy: containing errors or outliers E.g., Salary = “-10”
– Inconsistent:containingdiscrepanciesin codes or names
E.g., Age = “40”, Birthday = ’03/02/1990’ E.g., was rating “1, 2, 3”, now rating “A, B, C” E.g, discrepancy between duplicate records
Occupation
11/01/1986
03/02/1990
01/01/1980
Noisy value
Inconsistent value
Missing value

Why Data is Dirty?
 Incomplete data may come from
– “Not applicable” data value
E.g., annual income is not applicable to children
– People do not want to disclose the information E.g., age, birthday
– Human/hardware/softwareproblems E.g., data was accidently deleted
 Noisy Data may come from
– Faulty Data Collection Instruments – Human/Computererrors
 Inconsistent data may come from – Different Data sources

Why Data Preprocessing is important
 No quality data, no quality results
– Quality decisions must be based on quality data
E.g., missing data or incorrect data may cause incorrect or even misleading statistics
 Garbage In, Garbage Out
 In general, data pre-processing consumes more than 60% of a data analytics project effort.

Typical tasks in data preprocessing
 Data Cleaning
– Fill in missing values
– Smooth noisy data
– Identify and remove outliers – Resolve inconsistencies
 Data transformation and data discretization
– Feature type conversion
– Normalization
Scaling attribute values to fall within a specific range (e.g., [0, 1])

Data Cleaning: Handling Incomplete (Missing) Values
 Data is not always available
– E.g, many data samples do not have recorded value for several attributes, such as
customer age, customer income in sales data
 Missing data may be due to
– Equipment malfunction
– Data is not entered
– Certain data may not be considered at the time of data collection

Missing Data Example (Titantic Data)
 “titantic.csv” (https://www.kaggle.com/c/titanic/data)
Description for each feature contained in this dataset:
• Survival: Survival 0 = No, 1 = Yes
• Pclass: A proxy for economic status (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
• SibSp: number of siblings / spouses aboard the Titanic
• Parch: number of parents / children aboard the Titanic
• Ticket: Ticket number
• Fare: Passenger fare
• Cabin: Cabin number
• embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Missing Data Example (Titantic Data)
 Age, cabin and embarked contains missing values

How to handle missing data?
 Ignore the data sample with missing values
– Not a good solution, especially when data is scarce
 Ignore attributes with missing values
– Use only attributes (features) with all values – May leave out important features
 Fill in it by
– A global constant: e.g., “unknown” – Attribute mean/Median/mode
– Predict the missing value (data imputation)
Estimate gender based on first name (name gender) Estimate age based on first name (name popularity) Build a predictive model based on other attributes/features

Handing missing value by mean/median/mode

Handing missing value by mean/median/mode

A smarter way to fill with mean/median/mode
 The Pclass value is ‘3’ and sex is ‘female’. Replace the missing values by “21.75” may be better.

Handing missing value by prediction model
 Replace missing value by predicted values by a prediction model (e.g., a regression model)
 Requires attribute dependencies
 Can be used for handling missing data and
noisy data.
 Prediction models will be discussed in deep in future.

Example of handling missing value by regression
weight = 0.327*height + 0.6

Example of handling missing value by mean
Mean value of weight

Data Cleaning: Handling Noisy Data
 Noise: random error or variable in a measured variable
 Incorrect attribute values may be due to – Errors in data collection devices
– Wronginput
– Technology limitation

How to Handle Noisy Data
– First sort data and partition into bins
– Smooth by bin mean/median/boundaries
 Regression
– Smooth by fitting the data into regression functions
 Clustering
– Detect and remove outliers

Simple Discretization Methods: Binning
 Equal-width (distance) Partitioning
– Divides the range into N intervals of equal size: uniform grid
– Suppose min and max are the lowest and highest values of the attribute, the width of intervals will be: w = (max – min)/N
– The most straight-forward method
– Outliers may dominate presentation
– Skewed data is not handle well
 Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same the number of
– Skewed data is also handled well

Example of Equal-width Binning for data smoothing
 Suppose we have the following values for temperature and we want to divided them into 7 bins
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
 Partition data into bins
– Compute the width w = (85-64)/7 = 3
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85]

Example of Equal-width Binning for data smoothing
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85]
 Smoothing by bin means
– Each value in a bin is replaced by the mean value of the bin
[64.5, 64.5, 68.5, 68.5, 71.25, 71.25, 71.25, 71.25, 75, 75, 80.5, 80.5, 84, 84]
– Similarly, smoothing by bin medians can be used, in which each bin value is replaced by the bin median.

Example of Equal-width Binning for data smoothing: Smoothing by bin boundaries
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85)
 Smoothing by bin boundaries
– Bin boundaries are the minimum and maximum values in a given bin. – Each bin value then is replaced by the closest boundary value
[64, 65, 68, 69, 70, 70, 72, 72, 75, 75, 80, 81, 83, 85] In general, the larger the width, the greater the effect of the smoothing
The bin boundaries for the first bin is 64 and 65 NOT 64 and 67.

Equal-width Binning
 Advantage
– Simple and easy to implement
– Produce a reasonable abstraction of data
 Disadvantage
– Where does N come from? – Sensitive to outliers

Example of Equal-depth Binning
 Divides the range into N intervals, each containing approximately the same number of samples
 E.g., we have the following values for prices and we want to divided them into 3 bins using Equal-depth binning
[4, 8, 15, 21, 21, 24, 25, 28, 34]
 Partition into 3 bins (equal frequency)
[4, 8, 15, 21, 21, 24, 25, 28, 34]
 Smooth by bin means
[9, 9, 9, 22, 22, 22, 29, 29, 29]
 Smooth by bin boundaries
[4, 4, 15, 21, 21, 24, 25, 25, 34]

Handling Data Noisy by Regression Analysis
 Data smoothing can also be done by regression analysis.

Handling Data Noisy by Clustering Analysis
 Outliers may detected by clustering analysis.
 Detect and remove outliers
 Clustering algorithms will be discussed in depth in future lectures.

Data Transformation
 Data Transformation
– A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
– Feature Type Conversion – Normalization
– Feature construction

Feature Type Conversion
 Some tools can only deal with nominal values but other only deal with numeric values.
 Features have to be converted to satisfy the requirements of different tools
– Numeric -> Nominal  Binning
– Nominal -> numeric One hot encoding
– Ordinal -> numeric (order matters) A -> 4.0
A- -> 3.7 B+ -> 3.3 B -> 3

Nominal to Numeric (one-hot encoding)
Color_green
Color_Blue
• One of the ways to encode the nominal variable to numeric is one-hot encoding
• With one-hot encoding, a nominal feature becomes a vector whose size is the number of possible choices for that features

One-hot encoding in Python

Normalization
 Data have attribute values
 Can we compare these attribute values?
 E.g., considering the following two records, which one is more similar to (1.65 m, 50kg)
– (1.40m,55kg) – (1.70m,56kg)
 Different attributes take very different range of values. For computing the distance/similarity, the small value will disappear. We need to normalize data to makes different attributes comparable.

Normalization
 For distance-based methods, normalization helps to prevent that attributes with large ranges out-weight attributes with small ranges
 Scale the attribute values to a small specified range
 Normalization Methods
– Min-Max normalization (normalized by range) – Z-Score normalization
– Normalization by decimal scaling

Normalization: min-max Normalization
 Min-max normalization
– Performs a linear transformation on the original data.
– Suppose min, max are the minimum and maximum values of an attribute and we want to normalize the attribute value to [𝑚𝑖𝑛𝑛𝑒𝑤, 𝑚𝑎𝑥𝑛𝑒𝑤] , min-max normalization maps a value xi to xi’ by
𝑥𝑖′ = (𝑥𝑖 − 𝑚𝑖𝑛) (𝑚𝑎𝑥𝑛𝑒𝑤 − 𝑚𝑖𝑛𝑛𝑒𝑤) + 𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥 − 𝑚𝑖𝑛
 E.g., suppose that the minimum and maximum values for the feature income are $12,000 and $98,000. We would like to map income to the range [0.0, 1.0]. By min-max normalization, what is the mapped value for $73,600?
73,600 − 12,000 98,000 − 12,000
1.0−0.0 +0.0=0.716

Deriving the formula for min-max Normalization
𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥𝑛𝑒𝑤
Find a linear transform x’ = a*x + b
 We know min is mapped to minnew and max is mapped to maxnew – minnew = a*min + b
– maxnew = a*max +b
– Therefore, a = (maxnew – minnew)/(max-min);
b=minnew -(maxnew-minnew)*min/(max-min);
𝑥𝑖′ = (𝑥𝑖 − 𝑚𝑖𝑛) (𝑚𝑎𝑥𝑛𝑒𝑤 − 𝑚𝑖𝑛𝑛𝑒𝑤) + 𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥 − 𝑚𝑖𝑛
We take it from here

min-max Normalization problem
𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥𝑛𝑒𝑤
 Min-max normalization will encounter an “out-of-bounds” error if a future input value is fall outside of the original data range.
 In some cases, we may do not know the minimum and maximum values of an attribute.

min-max normalization in Python

min-max normalization in Python

Normalization: Z-score normalization  Z-score normalization
– The values of an attribute are normalized to a scale with a mean value of 0 and standard deviation of 1.
– Z-score normalization maps a value xi to xi’ by
mean of a original x
𝑥 standard deviation of original x
 E.g., suppose that the mean and standard deviation for the feature income
are $54,000 and $16,000. What is the mapped value for $73,600?
73,600 − 54,000 16,000
𝑥𝑖′ = (𝑥𝑖 − 𝑥ҧ) 𝜎

Z-score normalization in Python

Z-score normalization in Python

Normalization: Decimal Scaling
 Decimal Scaling
– The values of an attribute are normalized by moving the decimal point.
– The number of decimal points moved depends on the maximum absolute value of the attribute.
– Decimal scaling maps a value xi to xi’ by
𝑥𝑖′= 𝑥𝑖 max(|𝑥𝑖|)
10𝑗 j is the smallest integer to make 10𝑗 < 1  E.g., suppose that the recorded values of an attribute range from − 986 to 917. The maximum absolute value is 986. To normalize by decimal scaling, we therefore divide each value by 1,000 (i.e., 𝑗 = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.  Real world data is dirty – Incomplete: lacking attribute values – Noisy: containing errors or outliers – Inconsistent: containing discrepancies in codes or names  Garbage In, Garbage Out – Data pre-processing consumes more than 60% of a data analytics project effort.  Data Cleaning – Fill in missing values; – Smooth noisy data – Identify and remove outliers  Data Transformation – Feature type conversion – Normalization Text Documents  A text digital document consists of a sequence of words and other symbols, e.g., punctuation.  The individual words and other symbols are known as tokens or terms.  A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. Text Mining  The process to deriving information from the text.  It usually requires a preprocessing of the input data.  “...finding interesting regularities in large textual datasets...” (adapted from ) – ...where interesting means: non-trivial, hidden, previously unknown and potentially useful  “...finding semantic and abstract information from the surface form of textual data...” Tokenization  Fundamental to Natural Language Processing (NLP), Information Retrival, deep Learning and AI  Parsing (chopping up) the document into basic units that are candidates for later analysis – What parts of text to use and what not  Issues with – Punctuation – Special characters – Equations – Languages – Normalization (often by stemming) Tokenization  Different forms of the same word are usually problematic for text data analysis, because they have different spelling and similar meaning (e.g. learns, learned, learning,...)  Stemming is a process of transforming a word into its stem (normalized form) – ...stemming provides an inexpensive mechanism to merge Vector Space Model  Given a collection of documents 𝐷, let 𝑉 = 𝑡1, 𝑡2, ... , 𝑡 𝑉 be the set of distinctive terms in the collection, where 𝑡𝑖 is a term.  The set 𝑉 is usually called the vocabulary of the collection, and 𝑉 is its size, i.e., the number of terms in 𝑉.  A weight 𝑤 > 0 is associated with each term 𝑡 of a document 𝑑 ∈ 𝑖𝑗 𝑖𝑗
𝐷, quantifying the level of importance of 𝑡 in document 𝑑 . 𝑖𝑗
 Each document 𝑑 is thus represented with a document vector, 𝑗
𝑑=𝑤,𝑤,…,𝑤

Simple Term Frequency Scheme
 Term Frequency (TF) Scheme
– 𝑤 = the number of times that 𝑡 appears in document 𝑑 ,
denoted by 𝑓 . 𝑖𝑗
– Shortcoming: If a term appears in a large number of documents in the collection, it is probably not important or not discriminative. But this is not considered in TF Scheme.

Simple Term Frequency Scheme
Text analysis is fun. I like doing text analysis.
I like doing text analysis. I like this. Puppies like this.
I like puppies, they are fun.
I like this blog post. This is fun.
𝑑1 = 2,2,1,1,1,1,0,1,0,0,0,0,0
𝑑2 = 1,1,0,0,2,3,1,1,0,0,2,0,0 …

VSM: TF-IDF Scheme
 𝑁 = total no. of documents
 𝑑𝑓 = # of documents in which term 𝑡 appears at least once (doc
frequency)
 𝑓 = raw frequency count of term 𝑡 in document 𝑑
 Normalized term frequency of 𝑡 in 𝑑 : 𝑖𝑗
𝑚𝑎𝑥 𝑓1𝑗,𝑓2𝑗,…,𝑓 𝑉 𝑗
 Inverse document frequency (IDF) of term 𝑡𝑖: – 𝑖𝑑𝑓=𝑙𝑜𝑔𝑁
 Final TF-IDF term weight for document 𝑑 : 𝑗
– 𝑤 =𝑡𝑓 ×𝑖𝑑𝑓 𝑖𝑗 𝑖𝑗 𝑖
 TF-IDF term weight for query 𝑞 : – 𝑤𝑖𝑞= 𝑓𝑖𝑞 ×𝑙𝑜𝑔𝑁
𝑚𝑎𝑥 𝑓 ,𝑓 ,…,𝑓 𝑑𝑓 1𝑞2𝑞𝑉𝑞 𝑖

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com