COMP3115 Exploratory Data Analysis and Visualization
Lecture 1: Introduction to Exploratory Data Analysis and Visualization
What Is Exploratory Data Analysis And Visualization?
Why To Study Exploratory Data Analysis And Visualization? How To Do Exploratory Data Analysis and Visualization?
Copyright By PowCoder代写 加微信 powcoder
Exploratory Data Analysis by Simple Summary Statistics
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA)
– EDA is an area of statistics and data analysis, where the idea is to first explore the data set, often using methods from descriptive statistics, scientific visualization, data tours, dimensionality reduction, and others.
– The data exploration is done without any pre-conceived notions or hypotheses.
– Closely related to the field of data mining.
Exploratory Data Analysis vs. Confirmatory Data Analysis
Exploratory Data Analysis
– The researcher examines the data without any pre-conceived ideas in order to discover
what the data can tell him about the phenome being studied
– Finds a good description
– Raises new questions
Confirmatory Data Analysis
– Mostly concerned with statistical hypothesis testing, confidence intervals, estimation – Tests a hypothesis
– Settles questions
Confirmatory Data Analysis:Independent Samples t Test
I tend to believe that very few differences exist between males and females in cognitive abilities but there is some evidence that there are gender differences in, for example, humor appreciation.
In this hypothetical study we ask: what percentage of cartoons do men and woman consider funny? We recruited 9 people from the psychology subject pool and asked them to view a cartoon. After the cartoon, each participant gave us a humor rating of the cartoon, from 0-100 (100 being the funniest possible). Here are those data.
Independent Samples t Test:
Gender Differences in Humor Appreciation
Null and Alternative hypotheses
H0: Women will categorize the same number of cartoons as funny as will men.
Ha: Women will categorize a different number of cartoons funny than will men.
H0: Women will categorize the same number of cartoons as funny as will men.
Ha: Women will categorize a different number of cartoons funny than will men.
T-test assumption – normality
Assumption1: The sampling distribution is normally distributed
the most extreme histogram interval does not have the highest frequency
the most extreme histogram interval usually has the highest frequency
Exploratory Data Analysis: Mann- test (nonparametric-test)
Imagine two samples of scores drawn at random from the same population The two samples are combined into one larger group and then ranked from
lowest to highest
In this case there should be a similar number of high and low ranked scores in each original group
If however, the two samples are from different populations with different medians then most of the scores from one sample will be lower in the ranked list than most of the scores from the other sample
– the data are not required to be
normally distributed
Other closely related concepts
Many fancy words are used for data analysis – Exploratory Data Analysis
– Data Mining
– Data Analytics
– Data Science
– Machine Learning
– Big Data Analytics – Deep Learning
– Artificial Intelligence –…
Even though the definition of these words are different, most of the time, all of them are talking about learning from data.
Disclaimer: I will use the term data analysis, data mining, machine learning interchangeably in my lectures
The “rebranding” effect of learning from data
Source: google trends
Why to study exploratory data analysis and visualization?
“sexy” means having rare qualities that are much in demand, data scientists are already there. They are difficult and expensive to hire and, given the very competitive market for their services, difficult to retain. There simply aren’t a lot of people with their combination of scientific background and computational and analytical skills.
Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012
Why to learn exploratory data analysis and visualization?
Top 10 JOBS IN AMERICA FOR 2020
Source: glassdoor
Data related job roles are controlling comparable jobs reports released over the past a few years.
Why to learn exploratory data analysis and visualization?
Huge amounts of data are collected from different domains
“We are drowning in information but starving for knowledge”-
The amount and the complexity of the collected data does not allow for manual analysis: we need automated analysis of massive data.
The 4th paradigm for scientific discovery
https://en.wikipedia.org/wiki/The_Fourth_Paradigm
Real-life Applications
Input Data
Classification
Is it a banana (or an apple)?
Movie ratings
Recommendation System
Recommend which movies to which users?
News articles
Clustering
What are the topics people discussed about in the news today?
English and Chinese sentences
Classification
Translation
How do computer learn from data?
Machine Learning: “gives computers the ability to learn without being explicitly programmed”.
We do not code the solution (we do not even know it).
We design the algorithms and let the algorithms learn from the data.
An example of learning from data
Learn to recognize apple or banana
apple apple
To predict it is apple or banana
An example of learning from data
Learn to recognize apple or banana
apple apple
shape (x2)
Feature Engineering
To predict it is apple or banana
An example of learning from data
Learn to recognize apple or banana
apple apple
shape (x2)
Feature Engineering
Model Building
To predict it is apple or banana
An example of learning from data
Learn to recognize apple or banana
apple apple
shape (x2)
Feature Engineering
Model Building
Training Label
Predicted it as apple
To predict it is apple or banana
Different Types of Learning Tasks
Supervised Learning – Data with labels
Supervised Learning
Unsupervised Learning – Data without labels
Unsupervised Learning
Supervised Learning
Supervised Learning
– A collection of labelled data (feature representation of instances + class labels) is
available (training data)
– The objective is to learn a mapping from feature representation to class labels. Then the learnt model can predict the class label of a new instance.
Different Types of Supervised Learning
– Classification: The output value (label) is categorical value ‘apple’ vs ‘banana’
– Regression: The output value is a continuous value Predict the house price
Supervised Learning: Classification Example
Model Building
Predicted it as apple
The goal of classification is to learn a mapping function from the feature (“color”, ‘shape’) space to class label (‘apple’, ‘banana’).
For the new instance, the output of a classification model is its predicted class label (‘apple’, ’banana’).
shape (x2)
Classification application: Fraud Detection
Goal: predict fraudulent cases in credit card transactions. Approach:
– Use credit card transactions and the information on its account-holder as features. Time of transaction; location of transaction; time & distance since last transactions, …,
– Historical records about frauds or normal transactions as the labels.
– Learn a model for the class of the transactions
– Use the learnt model to do fraud detection by observing credit card transactions
Classification application: Predictive Maintenance for ATM
Goal: predict whether an ATM machine will fail in next week.
Approach:
– Use the information of ATM as features
location; number of transaction per day; indoor/outdoor, …
– Extract labels (fail or not) from historical maintenance records
– Learn a model for predicting whether an ATM will fail in next week
– Use the learnt model for predictive maintenance Break and then fix -> fix before it breaks
Supervised Learning: Regression
Similar to classification, the only difference is that the output in regression problem is continuous value instead of categorical value.
E.g. housing price prediction – Input: the size of house
– Output: the price of house
Supervised Learning: Regression
Similar to classification, the only difference is that the output in regression problem is continuous value instead of categorical value.
E.g. housing price prediction – Input: the size of house
– Output: the price of house
Predict the price of a new instance
Unsupervised Learning: Clustering
Clustering
– Only the feature representation of instance is available.
– No Label information
– The goal is to discover groups of similar instance from data
Clustering
shape (x2)
Each data point describes an instance in terms of shape and color
No information on the actual class is available to the clustering algorithm
Clustering application: Market Segmentation
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct promotion.
Approach:
– Collect different features of customers based on their geographical and lifestyle related
information
Age, income, education, …
– Find clusters of similar customers
Clustering application: news articles clustering
Goal: Find groups (topics) of news articles that are similar to each other based on the important terms appearing in them.
Approach:
– Use words in the news articles for feature representation – Find clusters of news articles
News Articles Clustering Example: Google News
Clustering Algorithms
Partitional clustering (e.g., K-means) – Partitions are independent of each other
Hierarchical Clustering (e.g., agglomerative clustering, divisive clustering) – Partitions can be visualized using a tree structure (a dendrogram)
– Does not need the number of clusters as input
– Possible to view partitions at different levels of granularities (i.e., can refine/coarsen clusters) using different K.
How to do exploratory data analysis and visualization?
Data Acquisition
Data Preprocessing
Data Exploration
Data Modelling
How To Do Exploratory Data Analysis And Visualization?
Data Analyst / Data Scientist
Python for Data Analytics
Python is one of the most popular programing languages used by data analysts and data scientists
How To Do Data Analytics in Python?
Data Acquisition
• Beautiful Soup • Scrapy
• Pandas •…•…•…
Data Preprocessing
Python provides many useful open-source libraries for data analytics. We will learn how to use python for practical data analytics projects.
Data Exploration
• Matplotlib
Data Modeling
• Scikit-learn • Tensorflow • Keras
Quick Survey on Python Programming
– I have experience with Python programming
– I have experience with programming, but not with Python – I have no experience with programming
Exploratory Data Analysis by Simple Summary Statistics
What is Data? Iris Data set
Many of the exploratory data techniques are illustrated with the Iris Plant data set.
– Details of this data set can be found https://en.wikipedia.org/wiki/Iris_flower_data_set
– Three flower types (classes): Setosa
Versicolor Virginica
Iris-setosa
Iris-versicolor
Iris-virginica
What is Data? Iris Data set
sepal_length
sepal_width
petal_length
petal_width
versicolor
What is Data? Iris Data set
Each row represents one data sample (i.e., data point, data object). It represents an entity in real world (e.g., flower).
sepal_length
sepal_width
petal_length
petal_width
versicolor
What is Data? Iris Data set
Each row represents one data sample (i.e., data point, data object). It represents an entity in real world (e.g., flower).
Each column is a feature (i.e., attribute, variable) to describe a characteristic of a data sample.
sepal_length
sepal_width
petal_length
petal_width
versicolor
What is Data? Iris Data set
Each column is a feature (i.e., attribute, variable) to describe a characteristic of a data sample.
Each row represents one data sample (i.e., data point, data object). It represents an entity in real world (e.g., flower).
sepal_length
sepal_width
petal_length
petal_width
versicolor
• Data set is a collections of data samples
• This iris data set contains 150 samples (50 samples from each
of three species)
• 5 features: sepal length, sepal width, petal length, petal width,
Features (Attribute, Variable)
Attribute: a data field representing a characteristic of a data sample – Sepal length
Types of features
– Categorical Nominal
Ordinal – Numerical
Categorical Feature Types
– The values of nominal features are symbols or “name of
– E.g., hair_color = {black, brown, grey, red}
– Do not have any meaningful order
– Nominal data can be represented by numerical values (e.g., 1 for female and 0 for male) but these numbers do not have mathematical meaning.
– Similar to nominal type but the order matters
– E.g. size = {small, medium, large}; grades = {A, B, C}
Numerical Feature Types
A numerical feature is quantitative.
It is a measurable quantity represented in integer or real values.
E.g., a measurement – Person’s height
– Person’s weight
– Blood pressure
E.g., a count,
– such as number of stock shares a person owns – How many courses you’ve taken this semester
Feature Types
sepal_length
sepal_width
petal_length
petal_width
versicolor
Feature ‘sepal_length,’ ‘sepal_width’, ‘petal_length’, ‘petal_width’ are numerical type.
Feature ‘Species’ is nominal type.
Basic Statistical Descriptions of Data
Motivation
– To better understand the data: have an overall picture of the data.
Central Tendency
– To measure the location of the middle or center of a data distribution Mean
Median Mode
Spread of the Data
– To measure how the data spread out. Variance and Standard Deviation Range, Quartiles and Interquartile Range
Mean is the most common and effective numeric measure of the “center” of a set of data. Also known as the average.
Let [x1, x2, …, xn] be as a set of n values, the mean is computed as σ𝑛 𝑥 𝑥+𝑥+⋯+𝑥
𝑥ҧ= 𝑖=1 𝑖= 1 2 𝑛 𝑛𝑛
E.g., suppose we have the following values for examination scores [30, 40, 35, 60, 80, 60]. The mean score is computed as
σ𝑛 𝑥 30+40+35+60+80+60
𝑥ҧ= 𝑖=1 𝑖= =50.83 𝑛6
Mean is not always the best way of measuring the center of the data.
Mean is commonly denoted as
A major problem with the mean is its sensitivity to extreme (e.g., outlier) values
– E.g., the mean salary at a company may be substantially pushed up by a few highly paid managers
A better measure of the center is the median for skewed data.
Median: splits the data in half
The median, denoted m, is the middle
If there are an even number of values, the median is the average of the two middle values.
The median, denoted m, is the middle
value when the data are ordered.
value when the data are ordered.
Median is the middle value in a set of ordered data values. Median is commonly denoted as m
How to compute the median
– Sort the data values
– Median is the middle value if odd number of values, or average of the middle two values
– Odd number of values: the middle one
[30, 40, 35, 60, 80, 60, 70] -> [30, 35, 40, 60, 60, 70, 80]
– Even number of values: the average of the middle two values
[30, 40, 35, 60, 80, 60] -> [30, 35, 40, 60, 60, 80] median = (40 + 60)/2 = 50
Resistance to extreme (outliers)
A statistic is resistant if it is relatively unaffected by extreme values. The median is resistant, while the mean is not.
The mean and the median heart rate for n = 5 patients in their twenties are given by x = 82.2 and m = 80.
Suppose that the patient with a heart rate of 108 instead had an extremely high heart rate of 200.
The median doesn’t change at all, since 80 is still the middle value. The mean increases to x = 100.6.
The extreme value of 200 has a large effect on the mean but little effect on the median.
Mode is another measure of center tendency. It is the value that occurs most frequently in the data.
[30, 40, 35, 60, 80, 60]]
It is possible that the largest frequency correspond to serval different values. – Unimodal: dataset with one modes
– Bimodal: dataset with two modes
– Trimodal: dataset with three modes.
Symmetric vs. Skewed Data
Symmetric data: With perfect symmetric data distribution, the mean, median and mode are all at the same center value.
Skewed Data (data in real world application are not symmetric)
– Positively Skewed: Mode is smaller than median – Negatively Skewed: Mode is greater than median
Positively skewed
negative skewed
Mean and median for different shaped distributions
For symmetric distributions, the mean and the median will be the same
For skewed distributions, the mean will be more pulled towards the direction of skewness
Which measure should we use? Mean or median?
Wealth per adult
Mean wealth per adult in : 1.445
million HKD
Median wealth per adult in : 268,000
Why are them so different?
Median is often used when it comes to income- related issues due to the impact of outliers (i.e. extreme values) on mean
When we give a statistical summary of the values in a dataset, we are interested in not just the center of the data but also how spread out the data are.
Example — Two sets of exam scores
Consider the following two distributions of exam scores
Both distributions have a median of 74.5. Which distribution has more variability?
The answer to this question depends on how we measure variability.
Measures of spread
For Des Moines the mean temperature is 54.49°F and the median is 54.50°F. For San Francisco the mean temperature is 54.01°F and the median is
The dotplots show that, while the centers may be similar, the distributions are very different.
Measuring the Spread of the Data: Variance and Standard Deviation
The variance of n values, [x1, x2, …, xn] is computed as
𝑥ҧ is the mean value
𝑥𝑖 − 𝑥ҧ is the deviation of each data value 𝑥𝑖 from the mean value
σ𝑛 (𝑥−𝑥ҧ)2 𝑆2= 𝑖=1 𝑖
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com