CS计算机代考程序代写 data mining Excel PowerPoint Presentation

PowerPoint Presentation

Recap from Week 2

Population and Samples
Measure data based on location (Mean, Median, Mode)
Measure data based on dispersion (SD, Variance, Skewness, Coefficient of Variance, z score)
Relationship between two variables (Covariance, Coefficient)
Sampling method & Sampling distribution
Confidence Internal & Hypotheses Testing

Statistical Symbols for Populations and Samples
Sample Population
Mean µ (mu)
Standard Deviation s σ (sigma)
Median MED MED
Variance s2 σ2
Size n N

Median: The middle value when the data are arranged in ascending or descending order (i.e., 50th percentile).

Mode: The value that occurs most frequently in data, it
represents the highest peak of the distribution.

Measure of Location – Mean, Median, and Mode

Mean: The average value, ideal for estimate mean for interval and ratio scale

Measures of Dispersion: Shape of Distribution

Variance is an average of the squared deviations from the mean (uses all data values).

Standard Deviation is the square root of the variance.

z-score indicates how many standard deviations a raw score is below or above the population mean

Skewness describes lack of symmetry
Negatively skewed: Mean < Median < Mode Positively skewed: Mode < Median < Mean Coefficient of Variation (CV) Covariance is a measure of how much two random variables, X and Y, vary together Relationship between two variables: Covariance Correlation is a measure of the linear association between two variables, X and Y. Confidence Interval For the Mean with Known Population Standard Deviation Sample mean ± margin of error Sample mean ± z α/2 (standard error) a lower tail area of 1 − α/2). Confidence Intervals The probability that the population mean  falls into this interval is 1-. where z α/2 is the value of the standard normal random variable for an upper tail area of α/2 (or zα/2 =|norm.s.inv(/2)| Exercise: “find the z”  2 2  sampling distribution 1  Hypothesis Testing Involves drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters. or common understanding H0 Null hypothesis: describes an existing theory (the existing state) H1 Alternative hypothesis: the complement of H0 Using sample data, we either: reject H0 and conclude the sample data provides sufficient evidence to support H1, or fail to reject H0 and conclude the sample data does not support H1. Week-3 Lecture Outcomes The learning outcomes from this week’s lecture are: Define attributes of datasets, such as missing values, outliers, and probability distributions. Explain uses of various chart types Explain why big data has led to a growth in the importance of data visualization Identify organizational benefits achieved from modern visualization software Select the best chart type to visually address descriptive analytic questions Enrich datasets in SAS VA by creating hierarchies, groups, and calculations Apply the fundamentals of data exploration using SAS VA to a dataset from your organization. Data Characteristics Data Types Categorical (nominal) data - employee classification (manager, supervisor, associate) Ordinal data - survey responses (poor, average, good, very good, excellent) Interval data - date (23-09-2020; 25-09-2020) Ratio data - sales ($110.23; $0.00) Cardinality The number of unique points contained with a variable (e.g., a column of data). Full cardinality Every record has a unique identifier Use as a key in a dataset Lowest cardinality All rows contains the same information This provides little value in the current dataset but may be meaning when joined with another Data Distributions FIGURE 3.7 & 3.8, Vidgen et al. 2019 Outliers An outlier is an observation that is distinctly different from majority of the data This can be a procedural error or an anomaly in the event the data is captured of Outliers can be removed from the data to produce a better model fit but could lead to an overfitted model Missing Data Missing data can be characterized as: Missing completely at random (MCAR) There is no relationship between the missingness of the data and any values, observed or missing Missing at random (MAR) There is a systematic relationship between the propensity of missing values and the observed data, but not the missing data e.g., men are more likely to tell you their weight than women, weight is MAR gender <-> weight (missing data)

Missing Data (Cont.)

Missing data can be characterized as:
Missing not at random (MNAR)
There is a relationship between the missingness of the data and any values, observed or missing
E.g., people with higher income tend to be missing on the variable “income”
income <-> income (missing data)

Missing Data (Cont.)

Any record with missing data can be omitted from the model
Although for smaller datasets, we will want to preserve as many records as possible
We can fill the missing records with the mean, median or mode of that variable
We might also want to capture the missing data as a feature

Chart
Types

Pie Chart

FIGURE 4.27, Vidgen et al. 2019
When proportion is important and the number of the categories is relatively small

Be Cautious with Pie Charts

FIGURE 4.27, Vidgen et al. 2019

Bar Chart

FIGURE 4.21, Vidgen et al. 2019
Quantitative difference between categorical or continuous data that has been segmented into different groups.
Use an insurance dataset as an example
Aggregation method from vertical axis

Bar Chart

FIGURE 4.22, Vidgen et al. 2019
Colour can be useful for an additional dimension, when the dimension has low cardinality

Histograms

FIGURE 4.23, Vidgen et al. 2019
Histogram is a particular bar chart that focuses on single variable
An individual bin is treated as having the same value
Histogram offers information such as central tendency and distribution of values
BMI: Body Mass Index

Line Chart

FIGURE 4.24, Vidgen et al. 2019
Compare trends among different categories;

The correlational relationship between x and y value;

How values across different categories change over time;

Can represent relationships among at least three variables.

Scatter Plot

FIGURE 4.25, Vidgen et al. 2019
A x value can correspond to multiple y values.

Trends are still visible.

Bubble Chart

FIGURE 4.26, Vidgen et al. 2019
Capture relationships among at least three variables (e.g., age, number of children, charges)

Geo Map

FIGURE 4.32, Vidgen et al. 2019

Box Plot

FIGURE 4.29, Vidgen et al. 2019
Easy to identify outliers.

Top of the box: 75 percentile
Bottom of the box: 25 percentile

Tree Map

FIGURE 4.30, Vidgen et al. 2019
Customer lifetime value by education level and marital status

Heat Map

FIGURE 4.31, Vidgen et al. 2019
Variables are not hierarchical.

Correlation Matrix

FIGURE 4.33, Vidgen et al. 2019
Linear relationship between different measure variables.

Colour denotes the degree of correlation.

Keep it Simple

FIGURE 4.28, Vidgen et al. 2019

Data
Exploration

Benefits of Visualisation
For large datasets it becomes impracticable for analysts to understand the data by inspection of the raw data in its tabular form
Visualisations are powerful, as it provides the data in a more digestible format
It is critical to understand the structure by investigating patterns, trends and relationships of the data before commencing statistical analysis
Exploration guides the predictive model development process

Anscombe’s Quartet – why visualization is needed?

FIGURE 4.1, Vidgen et al. 2019
Property Value
Mean of X 9
X: σ2 11
Mean of Y 7.5
Y: σ2 4.125
correlation(X,Y) 0.816
y=0.5x+3

Descriptive Statistics
(11 data points)

Visualisation Software
There are many data visualisation tools available, to help perform data exploration and visualisation on big data
These tools include SAS VA, Tableau, Microsoft Power BI etc.
Data visualisation tools can:
Quickly generate figures from big data sources
Automatically select the best visualisation based on the input data and the user’s objective
Collapse results such that the graphs convey meaning without losing valuable information

SAS Visual Analytics (SAS VA)
SAS VA is a browser-based analytics platform that uses proprietary technology to analyse large data sets
SAS VA enables users to prepare, explore and communicate data all in a single platform
Users can perform data mining tasks and build predictive analytic models while taking advantage of SAS’s powerful in-memory data capabilities
Allowing for rapid model development and refinement for either a single or multiple users simultaneously

SAS Visual Analytics (SAS VA) Demo

Log in details

Please navigate to sasva.business.unsw.edu.au
Log-in using zID and zPass
Note: Ensure flashplayer is enabled.

/docProps/thumbnail.jpeg