Basics Tables Statistical tests Figures
CORPFIN 2503 – Business Data Analytics: Descriptive statistics and data exploration
Week 3: August 9th, 2021
£ius CORPFIN 2503, Week 3 1/69
Copyright By PowCoder代写 加微信 powcoder
Statistical tests
Statistical tests
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Descriptive statistics
Descriptive statistics give an overall picture of the data.
Descriptive statistics provide:
• means, medians, and other statistical properties • various graphs, plots. . .
• correlation matrix
• basic statistical tests.
£ius CORPFIN 2503, Week 3 3/69
Tables Statistical tests Figures
Descriptive statistics II
There is no universal to-do-list for the descriptive statistics.
The aspects of the data that could be provided depend on: • data properties
• the purpose of the project.
£ius CORPFIN 2503, Week 3 4/69
Statistical tests Figures
Suppose we would like to analyze whether the discount on new cars depends on a car origin (USA, Asia, Europe).
/* Creating data file: */
DATA work.car_data;
SET SAShelp.Cars;
£ius CORPFIN 2503, Week 3 5/69
Tables Statistical tests
Example II
First, we look at the data, either by: • opening the data le or
• `printing’ the data:
PROC PRINT DATA=car_data(obs=20);
£ius CORPFIN 2503, Week 3
Basics Tables Statistical tests Figures
Example III
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Example: Frequency distribution
Next, we should look at the frequency distribution of car origin (USA, Asia, Europe).
proc freq data=work.car_data;
tables origin;
£ius CORPFIN 2503, Week 3 8/69
Basics Tables Statistical tests Figures
Example: Frequency distribution II
£ius CORPFIN 2503, Week 3 9/69
Tables Statistical tests Figures
Example: Frequency distribution III
Alternatively, we can create a pie chart:
proc gchart data=work.car_data;
PIE origin / type=percent;
£ius CORPFIN 2503, Week 3 10/69
Basics Tables Statistical tests Figures
Example: Frequency distribution IV
£ius CORPFIN 2503, Week 3 11/69
Tables Statistical tests Figures
Descriptive statistics III
Next, we should identify the other potential determinants (besides origin) of the discount. Let’s assume they are:
• car manufacturer (`make’)
• car type (`type’)
• drivetrain type (`drivetrain’) • car sticker price (`MSRP’)
• engine size (`enginesize’) and • car length (`length’).
CORPFIN 2503, Week 3
Tables Statistical tests Figures
Descriptive statistics IV
Then we should provide a table with key statistics of our numerical variables:
• standard deviation
• minimum and maximum values • median
• 25th and 75th percentile values.
£ius CORPFIN 2503, Week 3 13/69
Tables Statistical tests Figures
Descriptive statistics IV
What about non-numerical variable such as car manufacturer, car type, and drivetrain type?
One should code them as dummy variables (also known as indicator variables).
Then compute their key statistics as well.
£ius CORPFIN 2503, Week 3 14/69
Tables Statistical tests Figures
Dummy variables
If we have a gender variable, then its coding is very simple: • gender=1 if female
• gender=0 if male.
What about variables that can take more than 2 values such as drivetrain type or car type?
£ius CORPFIN 2503, Week 3 15/69
Tables Statistical tests
Dummy variables II
proc freq data=work.car_data;
tables drivetrain type;
£ius CORPFIN 2503, Week 3
Basics Tables Statistical tests Figures
Dummy variables III
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Dummy variables IV
We should create dummy variables for each value of non-numerical variable.
E.g., for drivetrain, we should generate 3 dummy variables: • all=1 if `drivetrain’ is equal to All, 0 otherwise
• front=1 if `drivetrain’ is equal to Front, 0 otherwise
• rear=1 if `drivetrain’ is equal to Rear, 0 otherwise.
£ius CORPFIN 2503, Week 3 18/69
Tables Statistical tests Figures
Dummy variables V
For car type, this might be not practical as there are 6 possible values and some of them feature low frequency.
E.g., there are only three hybrid cars in the sample.
In this case, one can generate dummy variables only for more frequent values.
If there are 6 possible values, in general, it is sucient to code only 3.
£ius CORPFIN 2503, Week 3 19/69
Tables Statistical tests
Dummy variables VI
We will code only one.
DATA work.car_data;
SET work.car_data;
IF type=’Sedan’ then sedan=1;
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Example: Summary statistics
We can either use PROC MEANS or PROC UNIVARIATE.
proc means data = work.car_data n mean std min p25
median p75 max;
var discount msrp enginesize length sedan;
proc univariate data = work.car_data;
var discount msrp enginesize length sedan;
£ius CORPFIN 2503, Week 3 21/69
Basics Tables Statistical tests Figures
Example: Summary statistics II
The results from PROC UNIVARIATE are less user friendly and one needs to manually compile a table.
£ius CORPFIN 2503, Week 3 22/69
Basics Tables Statistical tests Figures
Example: Summary statistics III
£ius CORPFIN 2503, Week 3 23/69
Tables Statistical tests Figures
Example: Summary statistics IV
One should also provide summary statistics by car origin.
proc means data = work.car_data n mean std min p25
median p75 max;
var discount msrp enginesize length sedan;
class origin;
£ius CORPFIN 2503, Week 3 24/69
Basics Tables Statistical tests Figures
Example: Summary statistics V
£ius CORPFIN 2503, Week 3 25/69
Tables Statistical tests Figures
Example: Summary statistics VI
Let’s limit the number of decimal places in output.
proc means data = work.car_data n mean std min p25
median p75 max maxdec=2;
var discount msrp enginesize length sedan;
class origin;
£ius CORPFIN 2503, Week 3 26/69
Basics Tables Statistical tests Figures
Example: Summary statistics VII
£ius CORPFIN 2503, Week 3 27/69
Tables Statistical tests Figures
Example: Summary statistics VIII
The results suggest that discount might be more or less the same regardless of the car origin.
What about drivetrain type? Or car type?
£ius CORPFIN 2503, Week 3 28/69
Tables Statistical tests Figures
Example: Summary statistics IX
proc means data = work.car_data mean std min p25
median p75 max maxdec=3;
var discount;
class origin drivetrain;
£ius CORPFIN 2503, Week 3 29/69
Basics Tables Statistical tests Figures
Example: Summary statistics X
£ius CORPFIN 2503, Week 3 30/69
Tables Statistical tests Figures
Example: Summary statistics XI
proc means data = work.car_data mean std min p25
median p75 max maxdec=3;
var discount;
class origin drivetrain type;
£ius CORPFIN 2503, Week 3 31/69
Basics Tables Statistical tests Figures
Example: Summary statistics XII
£ius CORPFIN 2503, Week 3 32/69
Tables Statistical tests Figures
Example: Summary statistics XIII
The table in the previous slide is incomplete.
The data seems to be too granular.
It might be better not to report the table on the previous slide.
£ius CORPFIN 2503, Week 3 33/69
Tables Statistical tests Figures
Example: Summary statistics XIV
One can also use PROC TABULATE to display descriptive statistics in tabular format.
For example, to display the number of observations and mean.
proc tabulate data=work.car_data;
var discount;
table discount*N discount*MEAN;
£ius CORPFIN 2503, Week 3 34/69
Tables Statistical tests Figures
Example: Summary statistics XV
Or to display the means of discount and MSRP by car type.
proc tabulate data=work.car_data;
var discount MSRP;
class type;
table type, discount*MEAN MSRP*MEAN;
£ius CORPFIN 2503, Week 3 35/69
Tables Statistical tests Figures
Example: Summary statistics XVI
Or to display the means of discount and MSRP by car type and drive train.
proc tabulate data=work.car_data;
var discount MSRP;
class type drivetrain;
table type, discount*drivetrain*MEAN MSRP*drivetrain*MEAN;
£ius CORPFIN 2503, Week 3 36/69
Tables Statistical tests Figures
Two-way tables
Two-way tables are used to illustrate the distribution of observations.
proc freq data=work.car_data;
tables origin*type / norow nocol nopercent;
£ius CORPFIN 2503, Week 3 37/69
Basics Tables Statistical tests Figures
Two-way tables II
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Correlation matrix
Correlation matrix shows correlation coecients for dierent combinations of variable pairs.
PROC CORR DATA=work.car_data;
var discount msrp enginesize length sedan;
£ius CORPFIN 2503, Week 3 39/69
Basics Tables Statistical tests Figures
Correlation matrix II
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Statistical tests
One should also report basic statistical tests.
2-sample t-test is very common. One can also test whether medians are statistically dierent across the sub-samples.
Let’s test whether discount depends on car origin.
£ius CORPFIN 2503, Week 3 41/69
Tables Statistical tests Figures
Example: Statistical tests
First, we need to create dummy variables for car origin.
DATA work.car_data;
SET work.car_data;
origin_usa=0;
IF origin=’USA’ then origin_usa=1;
origin_asia=0;
IF origin=’Asia’ then origin_asia=1;
origin_europe=0;
IF origin=’Europe’ then origin_europe=1;
£ius CORPFIN 2503, Week 3 42/69
Tables Statistical tests Figures
Example: Statistical tests II
T-test for cars produced in the USA vs cars produced elsewhere:
proc ttest data=work.car_data;
class origin_usa;
var discount;
£ius CORPFIN 2503, Week 3 43/69
Basics Tables Statistical tests Figures
Example: Statistical tests III
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Example: Statistical tests IV
T-test for cars produced in the USA vs cars produced in Europe:
proc ttest data=work.car_data;
class origin_usa;
var discount;
where origin ne ‘Asia’;
£ius CORPFIN 2503, Week 3 45/69
Basics Tables Statistical tests Figures
Example: Statistical tests V
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Example: Statistical tests VI
In both cases, we fail to reject the null hypothesis that the discount is the same in both sub-samples.
In our example, one should probably conduct t-tests for all possible combinations of car origin:
• USA vs Europe • USA vs Asia
• Europe vs Asia.
£ius CORPFIN 2503, Week 3 47/69
Basics Tables
Statistical tests Figures
One can produce a few gures to better describe the data.
Refer to Workshop #2 regarding key types of gures and plots.
£ius CORPFIN 2503, Week 3 48/69
Tables Statistical tests Figures
A histogram is a visual representation of the distribution of the numerical data.
Steps to construct a histogram:
1. sort the data from smallest to highest value
2. divide the entire range of values into a series of intervals (bins) 3. count how many values fall into each bin
4. plot the results using bar charts.
£ius CORPFIN 2503, Week 3 49/69
Tables Statistical tests
Histogram: Bin width
Bin width is important:
• if too small, then histogram is too messy • if too big, then lots of information is lost.
SAS does a good job in selecting bin width.
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Histogram: Bin width II
Let’s create a few histograms using the
following data:
1 1.1 1.15 1.2 1.4 1.45
1.5 1.55 1.6 1.7 1.75 1.8 1.9
CORPFIN 2503, Week 3
Basics Tables Statistical tests Figures
Histogram: Bin width III
1.2 1 0.8 0.6 0.4 0.2 0
Bins too narrow
0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 More
7 6 5 4 3 2 1 0
Bins too wide
0.5 1 1.5 2 More
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1 1.2 1.4 1.6 1.8 2 More
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Example: Histogram
proc univariate data = work.car_data plots;
var discount;
£ius CORPFIN 2503, Week 3 53/69
Basics Tables Statistical tests Figures
Example: Histogram II
£ius CORPFIN 2503, Week 3
Tables Statistical tests
Example: Histogram III
Alternatively, one can use procedure SGPLOT:
proc sgplot data=work.car_data;
histogram discount;
£ius CORPFIN 2503, Week 3
Basics Tables Statistical tests Figures
Example: Histogram IV
£ius CORPFIN 2503, Week 3
Tables Statistical tests
Example: Histogram V
To produce histograms of discount by origin:
proc univariate data=work.car_data;
class origin;
var discount;
histogram discount / nrows=3;
£ius CORPFIN 2503, Week 3
Basics Tables Statistical tests Figures
Example: Histogram VI
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Box plots illustrate data distribution and key statistical properties.
Box plots are not popular nowadays.
PROC UNIVARIATE or PROC BOXPLOT can be used to produce box plots.
£ius CORPFIN 2503, Week 3 59/69
Basics Tables Statistical tests Figures
Example: Box plots II
Source: SAS User’s Guide, p. 796.
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Example: Box plots
To produce box plots of discount by origin:
proc univariate data = work.car_data plots;
var discount;
by origin;
£ius CORPFIN 2503, Week 3 61/69
Basics Tables Statistical tests Figures
Example: Box plots II
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Scatter plots
Scatter plot is a gure in which the values of two variables are plotted along two axes.
Scatter plots help reveal whether there is any relation (linear or non-linear) between the two variables.
£ius CORPFIN 2503, Week 3 63/69
Tables Statistical tests Figures
Example: Scatter plots
SAS procedure GPLOT can be used to produce scatter plots:
proc gplot data=work.car_data;
title ‘Scatter plot of car length and discount’;
plot length* discount=1;
£ius CORPFIN 2503, Week 3 64/69
Basics Tables Statistical tests Figures
Example: Scatter plots II
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
Example: Scatter plots III
Scatter plots by origin:
SYMBOL1 V=circle C=black I=none height=2;
SYMBOL2 V=star C=red I=none;
SYMBOL2 V=square C=blue I=none height=2;
proc gplot data=work.car_data;
title ‘Scatter plot of car length and discount by origin’;
plot length* discount=origin;
£ius CORPFIN 2503, Week 3 66/69
Basics Tables Statistical tests Figures
Example: Scatter plots IV
£ius CORPFIN 2503, Week 3
Tables Statistical tests Figures
We covered a lot of dierent types of tables and gures.
However, it does not mean that one needs to use all of them.
Recall from the previous lecture, that one should not overload a report or presentation with plots and tables.
£ius CORPFIN 2503, Week 3 68/69
Basics Tables Statistical tests Figures
Required reading
Konasani, V. R. and Kadre, S. (2015). Practical Business Analytics Using SAS: A Hands-on Guide: chapters 6, 7, and 8.
£ius CORPFIN 2503, Week 3 69/69
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com