Basics Tables Statistical tests Figures
CORPFIN 2503 – Business Data Analytics:
Descriptive statistics and data exploration
£ius
Week 3: August 9th, 2021
£ius CORPFIN 2503, Week 3 1/69
Basics Tables Statistical tests Figures
Outline
Basics
Tables
Statistical tests
Figures
£ius CORPFIN 2503, Week 3 2/69
Basics Tables Statistical tests Figures
Descriptive statistics
Descriptive statistics give an overall picture of the data.
Descriptive statistics provide:
• means, medians, and other statistical properties
• various graphs, plots. . .
• correlation matrix
• basic statistical tests.
£ius CORPFIN 2503, Week 3 3/69
Basics Tables Statistical tests Figures
Descriptive statistics II
There is no universal to-do-list for the descriptive statistics.
The aspects of the data that could be provided depend on:
• data properties
• the purpose of the project.
£ius CORPFIN 2503, Week 3 4/69
Basics Tables Statistical tests Figures
Example
Suppose we would like to analyze whether the discount on new cars
depends on a car origin (USA, Asia, Europe).
/* Creating data file: */
DATA work.car_data;
SET SAShelp.Cars;
RUN;
£ius CORPFIN 2503, Week 3 5/69
Basics Tables Statistical tests Figures
Example II
First, we look at the data, either by:
• opening the data �le or
• `printing’ the data:
PROC PRINT DATA=car_data(obs=20);
RUN;
£ius CORPFIN 2503, Week 3 6/69
Basics Tables Statistical tests Figures
Example III
£ius CORPFIN 2503, Week 3 7/69
Basics Tables Statistical tests Figures
Example: Frequency distribution
Next, we should look at the frequency distribution of car origin
(USA, Asia, Europe).
proc freq data=work.car_data;
tables origin;
run;
£ius CORPFIN 2503, Week 3 8/69
Basics Tables Statistical tests Figures
Example: Frequency distribution II
£ius CORPFIN 2503, Week 3 9/69
Basics Tables Statistical tests Figures
Example: Frequency distribution III
Alternatively, we can create a pie chart:
proc gchart data=work.car_data;
PIE origin / type=percent;
run;
£ius CORPFIN 2503, Week 3 10/69
Basics Tables Statistical tests Figures
Example: Frequency distribution IV
£ius CORPFIN 2503, Week 3 11/69
Basics Tables Statistical tests Figures
Descriptive statistics III
Next, we should identify the other potential determinants (besides
origin) of the discount. Let’s assume they are:
• car manufacturer (`make’)
• car type (`type’)
• drivetrain type (`drivetrain’)
• car sticker price (`MSRP’)
• engine size (`enginesize’) and
• car length (`length’).
£ius CORPFIN 2503, Week 3 12/69
Basics Tables Statistical tests Figures
Descriptive statistics IV
Then we should provide a table with key statistics of our numerical
variables:
• mean
• standard deviation
• minimum and maximum values
• median
• 25th and 75th percentile values.
£ius CORPFIN 2503, Week 3 13/69
Basics Tables Statistical tests Figures
Descriptive statistics IV
What about non-numerical variable such as car manufacturer, car
type, and drivetrain type?
One should code them as dummy variables (also known as indicator
variables).
Then compute their key statistics as well.
£ius CORPFIN 2503, Week 3 14/69
Basics Tables Statistical tests Figures
Dummy variables
If we have a gender variable, then its coding is very simple:
• gender=1 if female
• gender=0 if male.
What about variables that can take more than 2 values such as
drivetrain type or car type?
£ius CORPFIN 2503, Week 3 15/69
Basics Tables Statistical tests Figures
Dummy variables II
proc freq data=work.car_data;
tables drivetrain type;
run;
£ius CORPFIN 2503, Week 3 16/69
Basics Tables Statistical tests Figures
Dummy variables III
£ius CORPFIN 2503, Week 3 17/69
Basics Tables Statistical tests Figures
Dummy variables IV
We should create dummy variables for each value of non-numerical
variable.
E.g., for drivetrain, we should generate 3 dummy variables:
• all=1 if `drivetrain’ is equal to All, 0 otherwise
• front=1 if `drivetrain’ is equal to Front, 0 otherwise
• rear=1 if `drivetrain’ is equal to Rear, 0 otherwise.
£ius CORPFIN 2503, Week 3 18/69
Basics Tables Statistical tests Figures
Dummy variables V
For car type, this might be not practical as there are 6 possible
values and some of them feature low frequency.
E.g., there are only three hybrid cars in the sample.
In this case, one can generate dummy variables only for more
frequent values.
If there are 6 possible values, in general, it is su�cient to code only
3.
£ius CORPFIN 2503, Week 3 19/69
Basics Tables Statistical tests Figures
Dummy variables VI
We will code only one.
DATA work.car_data;
SET work.car_data;
sedan=0;
IF type=’Sedan’ then sedan=1;
RUN;
£ius CORPFIN 2503, Week 3 20/69
Basics Tables Statistical tests Figures
Example: Summary statistics
We can either use PROC MEANS or PROC UNIVARIATE.
proc means data = work.car_data n mean std min p25
median p75 max;
var discount msrp enginesize length sedan;
run;
proc univariate data = work.car_data;
var discount msrp enginesize length sedan;
run;
£ius CORPFIN 2503, Week 3 21/69
Basics Tables Statistical tests Figures
Example: Summary statistics II
The results from PROC UNIVARIATE are less user friendly and one
needs to manually compile a table.
£ius CORPFIN 2503, Week 3 22/69
Basics Tables Statistical tests Figures
Example: Summary statistics III
£ius CORPFIN 2503, Week 3 23/69
Basics Tables Statistical tests Figures
Example: Summary statistics IV
One should also provide summary statistics by car origin.
proc means data = work.car_data n mean std min p25
median p75 max;
var discount msrp enginesize length sedan;
class origin;
run;
£ius CORPFIN 2503, Week 3 24/69
Basics Tables Statistical tests Figures
Example: Summary statistics V
£ius CORPFIN 2503, Week 3 25/69
Basics Tables Statistical tests Figures
Example: Summary statistics VI
Let’s limit the number of decimal places in output.
proc means data = work.car_data n mean std min p25
median p75 max maxdec=2;
var discount msrp enginesize length sedan;
class origin;
run;
£ius CORPFIN 2503, Week 3 26/69
Basics Tables Statistical tests Figures
Example: Summary statistics VII
£ius CORPFIN 2503, Week 3 27/69
Basics Tables Statistical tests Figures
Example: Summary statistics VIII
The results suggest that discount might be more or less the same
regardless of the car origin.
What about drivetrain type? Or car type?
£ius CORPFIN 2503, Week 3 28/69
Basics Tables Statistical tests Figures
Example: Summary statistics IX
proc means data = work.car_data mean std min p25
median p75 max maxdec=3;
var discount;
class origin drivetrain;
run;
£ius CORPFIN 2503, Week 3 29/69
Basics Tables Statistical tests Figures
Example: Summary statistics X
£ius CORPFIN 2503, Week 3 30/69
Basics Tables Statistical tests Figures
Example: Summary statistics XI
proc means data = work.car_data mean std min p25
median p75 max maxdec=3;
var discount;
class origin drivetrain type;
run;
£ius CORPFIN 2503, Week 3 31/69
Basics Tables Statistical tests Figures
Example: Summary statistics XII
£ius CORPFIN 2503, Week 3 32/69
Basics Tables Statistical tests Figures
Example: Summary statistics XIII
The table in the previous slide is incomplete.
The data seems to be too granular.
It might be better not to report the table on the previous slide.
£ius CORPFIN 2503, Week 3 33/69
Basics Tables Statistical tests Figures
Example: Summary statistics XIV
One can also use PROC TABULATE to display descriptive statistics
in tabular format.
For example, to display the number of observations and mean.
proc tabulate data=work.car_data;
var discount;
table discount*N discount*MEAN;
run;
£ius CORPFIN 2503, Week 3 34/69
Basics Tables Statistical tests Figures
Example: Summary statistics XV
Or to display the means of discount and MSRP by car type.
proc tabulate data=work.car_data;
var discount MSRP;
class type;
table type, discount*MEAN MSRP*MEAN;
run;
£ius CORPFIN 2503, Week 3 35/69
Basics Tables Statistical tests Figures
Example: Summary statistics XVI
Or to display the means of discount and MSRP by car type and
drive train.
proc tabulate data=work.car_data;
var discount MSRP;
class type drivetrain;
table type, discount*drivetrain*MEAN MSRP*drivetrain*MEAN;
run;
£ius CORPFIN 2503, Week 3 36/69
Basics Tables Statistical tests Figures
Two-way tables
Two-way tables are used to illustrate the distribution of
observations.
proc freq data=work.car_data;
tables origin*type / norow nocol nopercent;
run;
£ius CORPFIN 2503, Week 3 37/69
Basics Tables Statistical tests Figures
Two-way tables II
£ius CORPFIN 2503, Week 3 38/69
Basics Tables Statistical tests Figures
Correlation matrix
Correlation matrix shows correlation coe�cients for di�erent
combinations of variable pairs.
PROC CORR DATA=work.car_data;
var discount msrp enginesize length sedan;
RUN;
£ius CORPFIN 2503, Week 3 39/69
Basics Tables Statistical tests Figures
Correlation matrix II
£ius CORPFIN 2503, Week 3 40/69
Basics Tables Statistical tests Figures
Statistical tests
One should also report basic statistical tests.
2-sample t-test is very common. One can also test whether
medians are statistically di�erent across the sub-samples.
Let’s test whether discount depends on car origin.
£ius CORPFIN 2503, Week 3 41/69
Basics Tables Statistical tests Figures
Example: Statistical tests
First, we need to create dummy variables for car origin.
DATA work.car_data;
SET work.car_data;
origin_usa=0;
IF origin=’USA’ then origin_usa=1;
origin_asia=0;
IF origin=’Asia’ then origin_asia=1;
origin_europe=0;
IF origin=’Europe’ then origin_europe=1;
RUN;
£ius CORPFIN 2503, Week 3 42/69
Basics Tables Statistical tests Figures
Example: Statistical tests II
T-test for cars produced in the USA vs cars produced elsewhere:
proc ttest data=work.car_data;
class origin_usa;
var discount;
run;
£ius CORPFIN 2503, Week 3 43/69
Basics Tables Statistical tests Figures
Example: Statistical tests III
£ius CORPFIN 2503, Week 3 44/69
Basics Tables Statistical tests Figures
Example: Statistical tests IV
T-test for cars produced in the USA vs cars produced in Europe:
proc ttest data=work.car_data;
class origin_usa;
var discount;
where origin ne ‘Asia’;
run;
£ius CORPFIN 2503, Week 3 45/69
Basics Tables Statistical tests Figures
Example: Statistical tests V
£ius CORPFIN 2503, Week 3 46/69
Basics Tables Statistical tests Figures
Example: Statistical tests VI
In both cases, we fail to reject the null hypothesis that the discount
is the same in both sub-samples.
In our example, one should probably conduct t-tests for all possible
combinations of car origin:
• USA vs Europe
• USA vs Asia
• Europe vs Asia.
£ius CORPFIN 2503, Week 3 47/69
Basics Tables Statistical tests Figures
Figures
One can produce a few �gures to better describe the data.
Refer to Workshop #2 regarding key types of �gures and plots.
£ius CORPFIN 2503, Week 3 48/69
Basics Tables Statistical tests Figures
Histogram
A histogram is a visual representation of the distribution of the
numerical data.
Steps to construct a histogram:
1. sort the data from smallest to highest value
2. divide the entire range of values into a series of intervals (bins)
3. count how many values fall into each bin
4. plot the results using bar charts.
£ius CORPFIN 2503, Week 3 49/69
Basics Tables Statistical tests Figures
Histogram: Bin width
Bin width is important:
• if too small, then histogram is too messy
• if too big, then lots of information is lost.
SAS does a good job in selecting bin width.
£ius CORPFIN 2503, Week 3 50/69
Basics Tables Statistical tests Figures
Histogram: Bin width II
Let’s create a few histograms using the following data:
Data
1 1.5
1.1 1.55
1.15 1.6
1.2 1.7
1.4 1.75
1.45 1.8
1.9
£ius CORPFIN 2503, Week 3 51/69
Basics Tables Statistical tests Figures
Histogram: Bin width III
0
0.2
0.4
0.6
0.8
1
1.2
0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 More
Bins too narrow
0
1
2
3
4
5
6
7
0.5 1 1.5 2 More
Bins too wide
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 1.2 1.4 1.6 1.8 2 More
Just fine
£ius CORPFIN 2503, Week 3 52/69
Basics Tables Statistical tests Figures
Example: Histogram
proc univariate data = work.car_data plots;
var discount;
run;
£ius CORPFIN 2503, Week 3 53/69
Basics Tables Statistical tests Figures
Example: Histogram II
£ius CORPFIN 2503, Week 3 54/69
Basics Tables Statistical tests Figures
Example: Histogram III
Alternatively, one can use procedure SGPLOT:
proc sgplot data=work.car_data;
histogram discount;
run;
£ius CORPFIN 2503, Week 3 55/69
Basics Tables Statistical tests Figures
Example: Histogram IV
£ius CORPFIN 2503, Week 3 56/69
Basics Tables Statistical tests Figures
Example: Histogram V
To produce histograms of discount by origin:
proc univariate data=work.car_data;
class origin;
var discount;
histogram discount / nrows=3;
run;
£ius CORPFIN 2503, Week 3 57/69
Basics Tables Statistical tests Figures
Example: Histogram VI
£ius CORPFIN 2503, Week 3 58/69
Basics Tables Statistical tests Figures
Box plots
Box plots illustrate data distribution and key statistical properties.
Box plots are not popular nowadays.
PROC UNIVARIATE or PROC BOXPLOT can be used to produce
box plots.
£ius CORPFIN 2503, Week 3 59/69
Basics Tables Statistical tests Figures
Example: Box plots II
Source: SAS User’s Guide, p. 796.
£ius CORPFIN 2503, Week 3 60/69
Basics Tables Statistical tests Figures
Example: Box plots
To produce box plots of discount by origin:
proc univariate data = work.car_data plots;
var discount;
by origin;
run;
£ius CORPFIN 2503, Week 3 61/69
Basics Tables Statistical tests Figures
Example: Box plots II
£ius CORPFIN 2503, Week 3 62/69
Basics Tables Statistical tests Figures
Scatter plots
Scatter plot is a �gure in which the values of two variables are
plotted along two axes.
Scatter plots help reveal whether there is any relation (linear or
non-linear) between the two variables.
£ius CORPFIN 2503, Week 3 63/69
Basics Tables Statistical tests Figures
Example: Scatter plots
SAS procedure GPLOT can be used to produce scatter plots:
proc gplot data=work.car_data;
title ‘Scatter plot of car length and discount’;
plot length* discount=1;
run;
£ius CORPFIN 2503, Week 3 64/69
Basics Tables Statistical tests Figures
Example: Scatter plots II
£ius CORPFIN 2503, Week 3 65/69
Basics Tables Statistical tests Figures
Example: Scatter plots III
Scatter plots by origin:
SYMBOL1 V=circle C=black I=none height=2;
SYMBOL2 V=star C=red I=none;
SYMBOL2 V=square C=blue I=none height=2;
proc gplot data=work.car_data;
title ‘Scatter plot of car length and discount by origin’;
plot length* discount=origin;
run;
£ius CORPFIN 2503, Week 3 66/69
Basics Tables Statistical tests Figures
Example: Scatter plots IV
£ius CORPFIN 2503, Week 3 67/69
Basics Tables Statistical tests Figures
Summary
We covered a lot of di�erent types of tables and �gures.
However, it does not mean that one needs to use all of them.
Recall from the previous lecture, that one should not overload a
report or presentation with plots and tables.
£ius CORPFIN 2503, Week 3 68/69
Basics Tables Statistical tests Figures
Required reading
Konasani, V. R. and Kadre, S. (2015). �Practical Business
Analytics Using SAS: A Hands-on Guide�: chapters 6, 7, and 8.
£ius CORPFIN 2503, Week 3 69/69
Basics
Basics
Tables
Tables
Statistical tests
Statistical tests
Figures
Figures