PowerPoint Presentation
Recap from Week 2
Population and Samples
Measure data based on location (Mean, Median, Mode)
Measure data based on dispersion (SD, Variance, Skewness, Coefficient of Variance, z score)
Relationship between two variables (Covariance, Coefficient)
Sampling method & Sampling distribution
Confidence Internal & Hypotheses Testing
Statistical Symbols for Populations and Samples
Sample Population
Mean µ (mu)
Standard Deviation s σ (sigma)
Median MED MED
Variance s2 σ2
Size n N
Median: The middle value when the data are arranged in ascending or descending order (i.e., 50th percentile).
Mode: The value that occurs most frequently in data, it
represents the highest peak of the distribution.
Measure of Location – Mean, Median, and Mode
Mean: The average value, ideal for estimate mean for interval and ratio scale
Measures of Dispersion: Shape of Distribution
Variance is an average of the squared deviations from the mean (uses all data values).
Standard Deviation is the square root of the variance.
z-score indicates how many standard deviations a raw score is below or above the population mean
Skewness describes lack of symmetry
Negatively skewed: Mean < Median < Mode
Positively skewed: Mode < Median < Mean
Coefficient of Variation (CV)
Covariance is a measure of how much two random variables, X and Y, vary together
Relationship between two variables: Covariance
Correlation is a measure of the linear association
between two variables, X and Y.
Confidence Interval For the Mean
with Known Population Standard Deviation
Sample mean ± margin of error
Sample mean ± z α/2 (standard error)
a lower tail area of 1 − α/2).
Confidence Intervals
The probability that the population mean falls into this interval is 1-.
where z α/2 is the value of the standard normal
random variable for an upper tail area of α/2 (or
zα/2 =|norm.s.inv(/2)|
Exercise: “find the z”
2
2
sampling
distribution
1
Hypothesis Testing
Involves drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters.
or common understanding
H0 Null hypothesis: describes an existing theory
(the existing state)
H1 Alternative hypothesis: the complement of H0
Using sample data, we either:
reject H0 and conclude the sample data provides
sufficient evidence to support H1, or
fail to reject H0 and conclude the sample data does not support H1.
Week-3 Lecture Outcomes
The learning outcomes from this week’s lecture are:
Define attributes of datasets, such as missing values, outliers, and probability distributions.
Explain uses of various chart types
Explain why big data has led to a growth in the importance of data visualization
Identify organizational benefits achieved from modern visualization software
Select the best chart type to visually address descriptive analytic questions
Enrich datasets in SAS VA by creating hierarchies, groups, and calculations
Apply the fundamentals of data exploration using SAS VA to a dataset from your organization.
Data
Characteristics
Data Types
Categorical (nominal) data
- employee classification (manager, supervisor, associate)
Ordinal data
- survey responses (poor, average, good, very good, excellent)
Interval data
- date (23-09-2020; 25-09-2020)
Ratio data
- sales ($110.23; $0.00)
Cardinality
The number of unique points contained with a variable (e.g., a column of data).
Full cardinality
Every record has a unique identifier
Use as a key in a dataset
Lowest cardinality
All rows contains the same information
This provides little value in the current dataset but may be meaning when joined with another
Data Distributions
FIGURE 3.7 & 3.8, Vidgen et al. 2019
Outliers
An outlier is an observation that is distinctly different from majority of the data
This can be a procedural error or an anomaly in the event the data is captured of
Outliers can be removed from the data to produce a better model fit but could lead to an overfitted model
Missing Data
Missing data can be characterized as:
Missing completely at random (MCAR)
There is no relationship between the missingness of the data and any values, observed or missing
Missing at random (MAR)
There is a systematic relationship between the propensity of missing values and the observed data, but not the missing data
e.g., men are more likely to tell you their weight than women, weight is MAR
gender <-> weight (missing data)
Missing Data (Cont.)
Missing data can be characterized as:
Missing not at random (MNAR)
There is a relationship between the missingness of the data and any values, observed or missing
E.g., people with higher income tend to be missing on the variable “income”
income <-> income (missing data)
Missing Data (Cont.)
Any record with missing data can be omitted from the model
Although for smaller datasets, we will want to preserve as many records as possible
We can fill the missing records with the mean, median or mode of that variable
We might also want to capture the missing data as a feature
Chart
Types
Pie Chart
FIGURE 4.27, Vidgen et al. 2019
When proportion is important and the number of the categories is relatively small
Be Cautious with Pie Charts
FIGURE 4.27, Vidgen et al. 2019
Bar Chart
FIGURE 4.21, Vidgen et al. 2019
Quantitative difference between categorical or continuous data that has been segmented into different groups.
Use an insurance dataset as an example
Aggregation method from vertical axis
Bar Chart
FIGURE 4.22, Vidgen et al. 2019
Colour can be useful for an additional dimension, when the dimension has low cardinality
Histograms
FIGURE 4.23, Vidgen et al. 2019
Histogram is a particular bar chart that focuses on single variable
An individual bin is treated as having the same value
Histogram offers information such as central tendency and distribution of values
BMI: Body Mass Index
Line Chart
FIGURE 4.24, Vidgen et al. 2019
Compare trends among different categories;
The correlational relationship between x and y value;
How values across different categories change over time;
Can represent relationships among at least three variables.
Scatter Plot
FIGURE 4.25, Vidgen et al. 2019
A x value can correspond to multiple y values.
Trends are still visible.
Bubble Chart
FIGURE 4.26, Vidgen et al. 2019
Capture relationships among at least three variables (e.g., age, number of children, charges)
Geo Map
FIGURE 4.32, Vidgen et al. 2019
Box Plot
FIGURE 4.29, Vidgen et al. 2019
Easy to identify outliers.
Top of the box: 75 percentile
Bottom of the box: 25 percentile
Tree Map
FIGURE 4.30, Vidgen et al. 2019
Customer lifetime value by education level and marital status
Heat Map
FIGURE 4.31, Vidgen et al. 2019
Variables are not hierarchical.
Correlation Matrix
FIGURE 4.33, Vidgen et al. 2019
Linear relationship between different measure variables.
Colour denotes the degree of correlation.
Keep it Simple
FIGURE 4.28, Vidgen et al. 2019
Data
Exploration
Benefits of Visualisation
For large datasets it becomes impracticable for analysts to understand the data by inspection of the raw data in its tabular form
Visualisations are powerful, as it provides the data in a more digestible format
It is critical to understand the structure by investigating patterns, trends and relationships of the data before commencing statistical analysis
Exploration guides the predictive model development process
Anscombe’s Quartet – why visualization is needed?
FIGURE 4.1, Vidgen et al. 2019
Property Value
Mean of X 9
X: σ2 11
Mean of Y 7.5
Y: σ2 4.125
correlation(X,Y) 0.816
y=0.5x+3
Descriptive Statistics
(11 data points)
Visualisation Software
There are many data visualisation tools available, to help perform data exploration and visualisation on big data
These tools include SAS VA, Tableau, Microsoft Power BI etc.
Data visualisation tools can:
Quickly generate figures from big data sources
Automatically select the best visualisation based on the input data and the user’s objective
Collapse results such that the graphs convey meaning without losing valuable information
SAS Visual Analytics (SAS VA)
SAS VA is a browser-based analytics platform that uses proprietary technology to analyse large data sets
SAS VA enables users to prepare, explore and communicate data all in a single platform
Users can perform data mining tasks and build predictive analytic models while taking advantage of SAS’s powerful in-memory data capabilities
Allowing for rapid model development and refinement for either a single or multiple users simultaneously
SAS Visual Analytics (SAS VA) Demo
Log in details
Please navigate to sasva.business.unsw.edu.au
Log-in using zID and zPass
Note: Ensure flashplayer is enabled.
/docProps/thumbnail.jpeg