COMP3115 Exploratory Data Analysis and Visualization
Lecture 4: Data Preprocessing Dr.
Why we need data preprocessing
What is data preprocessing
Copyright By PowCoder代写 加微信 powcoder
Structured data preprocessing – Data Clean
– Data Transformation
Text data (unstructured) preprocessing data – Text documents and Text mining
– Tokenization
– Stemming
– TF-IDF Scheme
Why we need preprocess the data?
Real world data is dirty
– Incomplete: lacking attribute values e.g., Occupation = “ ”
– Noisy: containing errors or outliers E.g., Salary = “-10”
– Inconsistent:containingdiscrepanciesin codes or names
E.g., Age = “40”, Birthday = ’03/02/1990’ E.g., was rating “1, 2, 3”, now rating “A, B, C” E.g, discrepancy between duplicate records
Occupation
11/01/1986
03/02/1990
01/01/1980
Noisy value
Inconsistent value
Missing value
Why Data is Dirty?
Incomplete data may come from
– “Not applicable” data value
E.g., annual income is not applicable to children
– People do not want to disclose the information E.g., age, birthday
– Human/hardware/softwareproblems E.g., data was accidently deleted
Noisy Data may come from
– Faulty Data Collection Instruments – Human/Computererrors
Inconsistent data may come from – Different Data sources
Why Data Preprocessing is important
No quality data, no quality results
– Quality decisions must be based on quality data
E.g., missing data or incorrect data may cause incorrect or even misleading statistics
Garbage In, Garbage Out
In general, data pre-processing consumes more than 60% of a data analytics project effort.
Typical tasks in data preprocessing
Data Cleaning
– Fill in missing values
– Smooth noisy data
– Identify and remove outliers – Resolve inconsistencies
Data transformation and data discretization
– Feature type conversion
– Normalization
Scaling attribute values to fall within a specific range (e.g., [0, 1])
Data Cleaning: Handling Incomplete (Missing) Values
Data is not always available
– E.g, many data samples do not have recorded value for several attributes, such as
customer age, customer income in sales data
Missing data may be due to
– Equipment malfunction
– Data is not entered
– Certain data may not be considered at the time of data collection
Missing Data Example (Titantic Data)
“titantic.csv” (https://www.kaggle.com/c/titanic/data)
Description for each feature contained in this dataset:
• Survival: Survival 0 = No, 1 = Yes
• Pclass: A proxy for economic status (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
• SibSp: number of siblings / spouses aboard the Titanic
• Parch: number of parents / children aboard the Titanic
• Ticket: Ticket number
• Fare: Passenger fare
• Cabin: Cabin number
• embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Missing Data Example (Titantic Data)
Age, cabin and embarked contains missing values
How to handle missing data?
Ignore the data sample with missing values
– Not a good solution, especially when data is scarce
Ignore attributes with missing values
– Use only attributes (features) with all values – May leave out important features
Fill in it by
– A global constant: e.g., “unknown” – Attribute mean/Median/mode
– Predict the missing value (data imputation)
Estimate gender based on first name (name gender) Estimate age based on first name (name popularity) Build a predictive model based on other attributes/features
Handing missing value by mean/median/mode
Handing missing value by mean/median/mode
A smarter way to fill with mean/median/mode
The Pclass value is ‘3’ and sex is ‘female’. Replace the missing values by “21.75” may be better.
Handing missing value by prediction model
Replace missing value by predicted values by a prediction model (e.g., a regression model)
Requires attribute dependencies
Can be used for handling missing data and
noisy data.
Prediction models will be discussed in deep in future.
Example of handling missing value by regression
weight = 0.327*height + 0.6
Example of handling missing value by mean
Mean value of weight
Data Cleaning: Handling Noisy Data
Noise: random error or variable in a measured variable
Incorrect attribute values may be due to – Errors in data collection devices
– Wronginput
– Technology limitation
How to Handle Noisy Data
– First sort data and partition into bins
– Smooth by bin mean/median/boundaries
Regression
– Smooth by fitting the data into regression functions
Clustering
– Detect and remove outliers
Simple Discretization Methods: Binning
Equal-width (distance) Partitioning
– Divides the range into N intervals of equal size: uniform grid
– Suppose min and max are the lowest and highest values of the attribute, the width of intervals will be: w = (max – min)/N
– The most straight-forward method
– Outliers may dominate presentation
– Skewed data is not handle well
Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same the number of
– Skewed data is also handled well
Example of Equal-width Binning for data smoothing
Suppose we have the following values for temperature and we want to divided them into 7 bins
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
Partition data into bins
– Compute the width w = (85-64)/7 = 3
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85]
Example of Equal-width Binning for data smoothing
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85]
Smoothing by bin means
– Each value in a bin is replaced by the mean value of the bin
[64.5, 64.5, 68.5, 68.5, 71.25, 71.25, 71.25, 71.25, 75, 75, 80.5, 80.5, 84, 84]
– Similarly, smoothing by bin medians can be used, in which each bin value is replaced by the bin median.
Example of Equal-width Binning for data smoothing: Smoothing by bin boundaries
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85)
Smoothing by bin boundaries
– Bin boundaries are the minimum and maximum values in a given bin. – Each bin value then is replaced by the closest boundary value
[64, 65, 68, 69, 70, 70, 72, 72, 75, 75, 80, 81, 83, 85] In general, the larger the width, the greater the effect of the smoothing
The bin boundaries for the first bin is 64 and 65 NOT 64 and 67.
Equal-width Binning
Advantage
– Simple and easy to implement
– Produce a reasonable abstraction of data
Disadvantage
– Where does N come from? – Sensitive to outliers
Example of Equal-depth Binning
Divides the range into N intervals, each containing approximately the same number of samples
E.g., we have the following values for prices and we want to divided them into 3 bins using Equal-depth binning
[4, 8, 15, 21, 21, 24, 25, 28, 34]
Partition into 3 bins (equal frequency)
[4, 8, 15, 21, 21, 24, 25, 28, 34]
Smooth by bin means
[9, 9, 9, 22, 22, 22, 29, 29, 29]
Smooth by bin boundaries
[4, 4, 15, 21, 21, 24, 25, 25, 34]
Handling Data Noisy by Regression Analysis
Data smoothing can also be done by regression analysis.
Handling Data Noisy by Clustering Analysis
Outliers may detected by clustering analysis.
Detect and remove outliers
Clustering algorithms will be discussed in depth in future lectures.
Data Transformation
Data Transformation
– A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
– Feature Type Conversion – Normalization
– Feature construction
Feature Type Conversion
Some tools can only deal with nominal values but other only deal with numeric values.
Features have to be converted to satisfy the requirements of different tools
– Numeric -> Nominal Binning
– Nominal -> numeric One hot encoding
– Ordinal -> numeric (order matters) A -> 4.0
A- -> 3.7 B+ -> 3.3 B -> 3
Nominal to Numeric (one-hot encoding)
Color_green
Color_Blue
• One of the ways to encode the nominal variable to numeric is one-hot encoding
• With one-hot encoding, a nominal feature becomes a vector whose size is the number of possible choices for that features
One-hot encoding in Python
Normalization
Data have attribute values
Can we compare these attribute values?
E.g., considering the following two records, which one is more similar to (1.65 m, 50kg)
– (1.40m,55kg) – (1.70m,56kg)
Different attributes take very different range of values. For computing the distance/similarity, the small value will disappear. We need to normalize data to makes different attributes comparable.
Normalization
For distance-based methods, normalization helps to prevent that attributes with large ranges out-weight attributes with small ranges
Scale the attribute values to a small specified range
Normalization Methods
– Min-Max normalization (normalized by range) – Z-Score normalization
– Normalization by decimal scaling
Normalization: min-max Normalization
Min-max normalization
– Performs a linear transformation on the original data.
– Suppose min, max are the minimum and maximum values of an attribute and we want to normalize the attribute value to [𝑚𝑖𝑛𝑛𝑒𝑤, 𝑚𝑎𝑥𝑛𝑒𝑤] , min-max normalization maps a value xi to xi’ by
𝑥𝑖′ = (𝑥𝑖 − 𝑚𝑖𝑛) (𝑚𝑎𝑥𝑛𝑒𝑤 − 𝑚𝑖𝑛𝑛𝑒𝑤) + 𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥 − 𝑚𝑖𝑛
E.g., suppose that the minimum and maximum values for the feature income are $12,000 and $98,000. We would like to map income to the range [0.0, 1.0]. By min-max normalization, what is the mapped value for $73,600?
73,600 − 12,000 98,000 − 12,000
1.0−0.0 +0.0=0.716
Deriving the formula for min-max Normalization
𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥𝑛𝑒𝑤
Find a linear transform x’ = a*x + b
We know min is mapped to minnew and max is mapped to maxnew – minnew = a*min + b
– maxnew = a*max +b
– Therefore, a = (maxnew – minnew)/(max-min);
b=minnew -(maxnew-minnew)*min/(max-min);
𝑥𝑖′ = (𝑥𝑖 − 𝑚𝑖𝑛) (𝑚𝑎𝑥𝑛𝑒𝑤 − 𝑚𝑖𝑛𝑛𝑒𝑤) + 𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥 − 𝑚𝑖𝑛
We take it from here
min-max Normalization problem
𝑚𝑖𝑛𝑛𝑒𝑤 𝑚𝑎𝑥𝑛𝑒𝑤
Min-max normalization will encounter an “out-of-bounds” error if a future input value is fall outside of the original data range.
In some cases, we may do not know the minimum and maximum values of an attribute.
min-max normalization in Python
min-max normalization in Python
Normalization: Z-score normalization Z-score normalization
– The values of an attribute are normalized to a scale with a mean value of 0 and standard deviation of 1.
– Z-score normalization maps a value xi to xi’ by
mean of a original x
𝑥 standard deviation of original x
E.g., suppose that the mean and standard deviation for the feature income
are $54,000 and $16,000. What is the mapped value for $73,600?
73,600 − 54,000 16,000
𝑥𝑖′ = (𝑥𝑖 − 𝑥ҧ) 𝜎
Z-score normalization in Python
Z-score normalization in Python
Normalization: Decimal Scaling
Decimal Scaling
– The values of an attribute are normalized by moving the decimal point.
– The number of decimal points moved depends on the maximum absolute value of the attribute.
– Decimal scaling maps a value xi to xi’ by
𝑥𝑖′= 𝑥𝑖 max(|𝑥𝑖|)
10𝑗 j is the smallest integer to make 10𝑗 < 1
E.g., suppose that the recorded values of an attribute range from − 986 to 917. The maximum absolute value is 986. To normalize by decimal scaling, we therefore divide each value by 1,000 (i.e., 𝑗 = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917.
Real world data is dirty
– Incomplete: lacking attribute values
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or names
Garbage In, Garbage Out
– Data pre-processing consumes more than 60% of a data analytics project effort.
Data Cleaning
– Fill in missing values;
– Smooth noisy data
– Identify and remove outliers
Data Transformation
– Feature type conversion – Normalization
Text Documents
A text digital document consists of a sequence of words and other symbols, e.g., punctuation.
The individual words and other symbols are known as tokens or terms. A textual document can be:
• Free text, also known as unstructured text, which is a continuous sequence of tokens.
• Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup.
Text Mining
The process to deriving information from the text.
It usually requires a preprocessing of the input data.
“...finding interesting regularities in large textual datasets...” (adapted from )
– ...where interesting means: non-trivial, hidden, previously unknown and potentially useful
“...finding semantic and abstract information from the surface form of textual data...”
Tokenization
Fundamental to Natural Language Processing (NLP), Information Retrival, deep Learning and AI
Parsing (chopping up) the document into basic units that are candidates for later analysis
– What parts of text to use and what not
Issues with
– Punctuation
– Special characters – Equations
– Languages
– Normalization (often by stemming)
Tokenization
Different forms of the same word are usually problematic for text data analysis, because they have different spelling and similar meaning (e.g. learns, learned, learning,...)
Stemming is a process of transforming a word into its stem (normalized form) – ...stemming provides an inexpensive mechanism to merge
Vector Space Model
Given a collection of documents 𝐷, let 𝑉 = 𝑡1, 𝑡2, ... , 𝑡 𝑉 be the set of distinctive terms in the collection, where 𝑡𝑖 is a term.
The set 𝑉 is usually called the vocabulary of the collection, and 𝑉 is its size, i.e., the number of terms in 𝑉.
A weight 𝑤 > 0 is associated with each term 𝑡 of a document 𝑑 ∈ 𝑖𝑗 𝑖𝑗
𝐷, quantifying the level of importance of 𝑡 in document 𝑑 . 𝑖𝑗
Each document 𝑑 is thus represented with a document vector, 𝑗
𝑑=𝑤,𝑤,…,𝑤
Simple Term Frequency Scheme
Term Frequency (TF) Scheme
– 𝑤 = the number of times that 𝑡 appears in document 𝑑 ,
denoted by 𝑓 . 𝑖𝑗
– Shortcoming: If a term appears in a large number of documents in the collection, it is probably not important or not discriminative. But this is not considered in TF Scheme.
Simple Term Frequency Scheme
Text analysis is fun. I like doing text analysis.
I like doing text analysis. I like this. Puppies like this.
I like puppies, they are fun.
I like this blog post. This is fun.
𝑑1 = 2,2,1,1,1,1,0,1,0,0,0,0,0
𝑑2 = 1,1,0,0,2,3,1,1,0,0,2,0,0 …
VSM: TF-IDF Scheme
𝑁 = total no. of documents
𝑑𝑓 = # of documents in which term 𝑡 appears at least once (doc
frequency)
𝑓 = raw frequency count of term 𝑡 in document 𝑑
Normalized term frequency of 𝑡 in 𝑑 : 𝑖𝑗
𝑚𝑎𝑥 𝑓1𝑗,𝑓2𝑗,…,𝑓 𝑉 𝑗
Inverse document frequency (IDF) of term 𝑡𝑖: – 𝑖𝑑𝑓=𝑙𝑜𝑔𝑁
Final TF-IDF term weight for document 𝑑 : 𝑗
– 𝑤 =𝑡𝑓 ×𝑖𝑑𝑓 𝑖𝑗 𝑖𝑗 𝑖
TF-IDF term weight for query 𝑞 : – 𝑤𝑖𝑞= 𝑓𝑖𝑞 ×𝑙𝑜𝑔𝑁
𝑚𝑎𝑥 𝑓 ,𝑓 ,…,𝑓 𝑑𝑓 1𝑞2𝑞𝑉𝑞 𝑖
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com