程序代做CS代考 Basics Tables Statistical tests Figures

Basics Tables Statistical tests Figures

CORPFIN 2503 – Business Data Analytics:

Descriptive statistics and data exploration

£ius

Week 3: August 9th, 2021

£ius CORPFIN 2503, Week 3 1/69

Basics Tables Statistical tests Figures

Outline

Basics

Tables

Statistical tests

Figures

£ius CORPFIN 2503, Week 3 2/69

Basics Tables Statistical tests Figures

Descriptive statistics

Descriptive statistics give an overall picture of the data.

Descriptive statistics provide:

• means, medians, and other statistical properties
• various graphs, plots. . .
• correlation matrix
• basic statistical tests.

£ius CORPFIN 2503, Week 3 3/69

Basics Tables Statistical tests Figures

Descriptive statistics II

There is no universal to-do-list for the descriptive statistics.

The aspects of the data that could be provided depend on:

• data properties
• the purpose of the project.

£ius CORPFIN 2503, Week 3 4/69

Basics Tables Statistical tests Figures

Example

Suppose we would like to analyze whether the discount on new cars

depends on a car origin (USA, Asia, Europe).

/* Creating data file: */

DATA work.car_data;

SET SAShelp.Cars;

RUN;

£ius CORPFIN 2503, Week 3 5/69

Basics Tables Statistical tests Figures

Example II

First, we look at the data, either by:

• opening the data �le or
• `printing’ the data:

PROC PRINT DATA=car_data(obs=20);

RUN;

£ius CORPFIN 2503, Week 3 6/69

Basics Tables Statistical tests Figures

Example III

£ius CORPFIN 2503, Week 3 7/69

Basics Tables Statistical tests Figures

Example: Frequency distribution

Next, we should look at the frequency distribution of car origin

(USA, Asia, Europe).

proc freq data=work.car_data;

tables origin;

run;

£ius CORPFIN 2503, Week 3 8/69

Basics Tables Statistical tests Figures

Example: Frequency distribution II

£ius CORPFIN 2503, Week 3 9/69

Basics Tables Statistical tests Figures

Example: Frequency distribution III

Alternatively, we can create a pie chart:

proc gchart data=work.car_data;

PIE origin / type=percent;

run;

£ius CORPFIN 2503, Week 3 10/69

Basics Tables Statistical tests Figures

Example: Frequency distribution IV

£ius CORPFIN 2503, Week 3 11/69

Basics Tables Statistical tests Figures

Descriptive statistics III

Next, we should identify the other potential determinants (besides

origin) of the discount. Let’s assume they are:

• car manufacturer (`make’)
• car type (`type’)
• drivetrain type (`drivetrain’)
• car sticker price (`MSRP’)
• engine size (`enginesize’) and
• car length (`length’).

£ius CORPFIN 2503, Week 3 12/69

Basics Tables Statistical tests Figures

Descriptive statistics IV

Then we should provide a table with key statistics of our numerical

variables:

• mean
• standard deviation
• minimum and maximum values
• median
• 25th and 75th percentile values.

£ius CORPFIN 2503, Week 3 13/69

Basics Tables Statistical tests Figures

Descriptive statistics IV

What about non-numerical variable such as car manufacturer, car

type, and drivetrain type?

One should code them as dummy variables (also known as indicator

variables).

Then compute their key statistics as well.

£ius CORPFIN 2503, Week 3 14/69

Basics Tables Statistical tests Figures

Dummy variables

If we have a gender variable, then its coding is very simple:

• gender=1 if female
• gender=0 if male.

What about variables that can take more than 2 values such as

drivetrain type or car type?

£ius CORPFIN 2503, Week 3 15/69

Basics Tables Statistical tests Figures

Dummy variables II

proc freq data=work.car_data;

tables drivetrain type;

run;

£ius CORPFIN 2503, Week 3 16/69

Basics Tables Statistical tests Figures

Dummy variables III

£ius CORPFIN 2503, Week 3 17/69

Basics Tables Statistical tests Figures

Dummy variables IV

We should create dummy variables for each value of non-numerical

variable.

E.g., for drivetrain, we should generate 3 dummy variables:

• all=1 if `drivetrain’ is equal to All, 0 otherwise
• front=1 if `drivetrain’ is equal to Front, 0 otherwise
• rear=1 if `drivetrain’ is equal to Rear, 0 otherwise.

£ius CORPFIN 2503, Week 3 18/69

Basics Tables Statistical tests Figures

Dummy variables V

For car type, this might be not practical as there are 6 possible

values and some of them feature low frequency.

E.g., there are only three hybrid cars in the sample.

In this case, one can generate dummy variables only for more

frequent values.

If there are 6 possible values, in general, it is su�cient to code only

3.

£ius CORPFIN 2503, Week 3 19/69

Basics Tables Statistical tests Figures

Dummy variables VI

We will code only one.

DATA work.car_data;

SET work.car_data;

sedan=0;

IF type=’Sedan’ then sedan=1;

RUN;

£ius CORPFIN 2503, Week 3 20/69

Basics Tables Statistical tests Figures

Example: Summary statistics

We can either use PROC MEANS or PROC UNIVARIATE.

proc means data = work.car_data n mean std min p25

median p75 max;

var discount msrp enginesize length sedan;

run;

proc univariate data = work.car_data;

var discount msrp enginesize length sedan;

run;

£ius CORPFIN 2503, Week 3 21/69

Basics Tables Statistical tests Figures

Example: Summary statistics II

The results from PROC UNIVARIATE are less user friendly and one

needs to manually compile a table.

£ius CORPFIN 2503, Week 3 22/69

Basics Tables Statistical tests Figures

Example: Summary statistics III

£ius CORPFIN 2503, Week 3 23/69

Basics Tables Statistical tests Figures

Example: Summary statistics IV

One should also provide summary statistics by car origin.

proc means data = work.car_data n mean std min p25

median p75 max;

var discount msrp enginesize length sedan;

class origin;

run;

£ius CORPFIN 2503, Week 3 24/69

Basics Tables Statistical tests Figures

Example: Summary statistics V

£ius CORPFIN 2503, Week 3 25/69

Basics Tables Statistical tests Figures

Example: Summary statistics VI

Let’s limit the number of decimal places in output.

proc means data = work.car_data n mean std min p25

median p75 max maxdec=2;

var discount msrp enginesize length sedan;

class origin;

run;

£ius CORPFIN 2503, Week 3 26/69

Basics Tables Statistical tests Figures

Example: Summary statistics VII

£ius CORPFIN 2503, Week 3 27/69

Basics Tables Statistical tests Figures

Example: Summary statistics VIII

The results suggest that discount might be more or less the same

regardless of the car origin.

What about drivetrain type? Or car type?

£ius CORPFIN 2503, Week 3 28/69

Basics Tables Statistical tests Figures

Example: Summary statistics IX

proc means data = work.car_data mean std min p25

median p75 max maxdec=3;

var discount;

class origin drivetrain;

run;

£ius CORPFIN 2503, Week 3 29/69

Basics Tables Statistical tests Figures

Example: Summary statistics X

£ius CORPFIN 2503, Week 3 30/69

Basics Tables Statistical tests Figures

Example: Summary statistics XI

proc means data = work.car_data mean std min p25

median p75 max maxdec=3;

var discount;

class origin drivetrain type;

run;

£ius CORPFIN 2503, Week 3 31/69

Basics Tables Statistical tests Figures

Example: Summary statistics XII

£ius CORPFIN 2503, Week 3 32/69

Basics Tables Statistical tests Figures

Example: Summary statistics XIII

The table in the previous slide is incomplete.

The data seems to be too granular.

It might be better not to report the table on the previous slide.

£ius CORPFIN 2503, Week 3 33/69

Basics Tables Statistical tests Figures

Example: Summary statistics XIV

One can also use PROC TABULATE to display descriptive statistics

in tabular format.

For example, to display the number of observations and mean.

proc tabulate data=work.car_data;

var discount;

table discount*N discount*MEAN;

run;

£ius CORPFIN 2503, Week 3 34/69

Basics Tables Statistical tests Figures

Example: Summary statistics XV
Or to display the means of discount and MSRP by car type.

proc tabulate data=work.car_data;

var discount MSRP;

class type;

table type, discount*MEAN MSRP*MEAN;

run;

£ius CORPFIN 2503, Week 3 35/69

Basics Tables Statistical tests Figures

Example: Summary statistics XVI
Or to display the means of discount and MSRP by car type and

drive train.

proc tabulate data=work.car_data;

var discount MSRP;

class type drivetrain;

table type, discount*drivetrain*MEAN MSRP*drivetrain*MEAN;

run;

£ius CORPFIN 2503, Week 3 36/69

Basics Tables Statistical tests Figures

Two-way tables

Two-way tables are used to illustrate the distribution of

observations.

proc freq data=work.car_data;

tables origin*type / norow nocol nopercent;

run;

£ius CORPFIN 2503, Week 3 37/69

Basics Tables Statistical tests Figures

Two-way tables II

£ius CORPFIN 2503, Week 3 38/69

Basics Tables Statistical tests Figures

Correlation matrix

Correlation matrix shows correlation coe�cients for di�erent

combinations of variable pairs.

PROC CORR DATA=work.car_data;

var discount msrp enginesize length sedan;

RUN;

£ius CORPFIN 2503, Week 3 39/69

Basics Tables Statistical tests Figures

Correlation matrix II

£ius CORPFIN 2503, Week 3 40/69

Basics Tables Statistical tests Figures

Statistical tests

One should also report basic statistical tests.

2-sample t-test is very common. One can also test whether

medians are statistically di�erent across the sub-samples.

Let’s test whether discount depends on car origin.

£ius CORPFIN 2503, Week 3 41/69

Basics Tables Statistical tests Figures

Example: Statistical tests

First, we need to create dummy variables for car origin.

DATA work.car_data;

SET work.car_data;

origin_usa=0;

IF origin=’USA’ then origin_usa=1;

origin_asia=0;

IF origin=’Asia’ then origin_asia=1;

origin_europe=0;

IF origin=’Europe’ then origin_europe=1;

RUN;

£ius CORPFIN 2503, Week 3 42/69

Basics Tables Statistical tests Figures

Example: Statistical tests II

T-test for cars produced in the USA vs cars produced elsewhere:

proc ttest data=work.car_data;

class origin_usa;

var discount;

run;

£ius CORPFIN 2503, Week 3 43/69

Basics Tables Statistical tests Figures

Example: Statistical tests III

£ius CORPFIN 2503, Week 3 44/69

Basics Tables Statistical tests Figures

Example: Statistical tests IV

T-test for cars produced in the USA vs cars produced in Europe:

proc ttest data=work.car_data;

class origin_usa;

var discount;

where origin ne ‘Asia’;

run;

£ius CORPFIN 2503, Week 3 45/69

Basics Tables Statistical tests Figures

Example: Statistical tests V

£ius CORPFIN 2503, Week 3 46/69

Basics Tables Statistical tests Figures

Example: Statistical tests VI

In both cases, we fail to reject the null hypothesis that the discount

is the same in both sub-samples.

In our example, one should probably conduct t-tests for all possible

combinations of car origin:

• USA vs Europe
• USA vs Asia
• Europe vs Asia.

£ius CORPFIN 2503, Week 3 47/69

Basics Tables Statistical tests Figures

Figures

One can produce a few �gures to better describe the data.

Refer to Workshop #2 regarding key types of �gures and plots.

£ius CORPFIN 2503, Week 3 48/69

Basics Tables Statistical tests Figures

Histogram

A histogram is a visual representation of the distribution of the

numerical data.

Steps to construct a histogram:

1. sort the data from smallest to highest value

2. divide the entire range of values into a series of intervals (bins)

3. count how many values fall into each bin

4. plot the results using bar charts.

£ius CORPFIN 2503, Week 3 49/69

Basics Tables Statistical tests Figures

Histogram: Bin width

Bin width is important:

• if too small, then histogram is too messy
• if too big, then lots of information is lost.

SAS does a good job in selecting bin width.

£ius CORPFIN 2503, Week 3 50/69

Basics Tables Statistical tests Figures

Histogram: Bin width II

Let’s create a few histograms using the following data:

Data

1 1.5

1.1 1.55

1.15 1.6

1.2 1.7

1.4 1.75

1.45 1.8

1.9

£ius CORPFIN 2503, Week 3 51/69

Basics Tables Statistical tests Figures

Histogram: Bin width III

0

0.2

0.4

0.6

0.8

1

1.2

0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 More

Bins too narrow

0

1

2

3

4

5

6

7

0.5 1 1.5 2 More

Bins too wide

0
0.5

1
1.5

2
2.5

3
3.5

4
4.5

1 1.2 1.4 1.6 1.8 2 More

Just fine

£ius CORPFIN 2503, Week 3 52/69

Basics Tables Statistical tests Figures

Example: Histogram

proc univariate data = work.car_data plots;

var discount;

run;

£ius CORPFIN 2503, Week 3 53/69

Basics Tables Statistical tests Figures

Example: Histogram II

£ius CORPFIN 2503, Week 3 54/69

Basics Tables Statistical tests Figures

Example: Histogram III

Alternatively, one can use procedure SGPLOT:

proc sgplot data=work.car_data;

histogram discount;

run;

£ius CORPFIN 2503, Week 3 55/69

Basics Tables Statistical tests Figures

Example: Histogram IV

£ius CORPFIN 2503, Week 3 56/69

Basics Tables Statistical tests Figures

Example: Histogram V

To produce histograms of discount by origin:

proc univariate data=work.car_data;

class origin;

var discount;

histogram discount / nrows=3;

run;

£ius CORPFIN 2503, Week 3 57/69

Basics Tables Statistical tests Figures

Example: Histogram VI

£ius CORPFIN 2503, Week 3 58/69

Basics Tables Statistical tests Figures

Box plots

Box plots illustrate data distribution and key statistical properties.

Box plots are not popular nowadays.

PROC UNIVARIATE or PROC BOXPLOT can be used to produce

box plots.

£ius CORPFIN 2503, Week 3 59/69

Basics Tables Statistical tests Figures

Example: Box plots II

Source: SAS User’s Guide, p. 796.
£ius CORPFIN 2503, Week 3 60/69

Basics Tables Statistical tests Figures

Example: Box plots

To produce box plots of discount by origin:

proc univariate data = work.car_data plots;

var discount;

by origin;

run;

£ius CORPFIN 2503, Week 3 61/69

Basics Tables Statistical tests Figures

Example: Box plots II

£ius CORPFIN 2503, Week 3 62/69

Basics Tables Statistical tests Figures

Scatter plots

Scatter plot is a �gure in which the values of two variables are

plotted along two axes.

Scatter plots help reveal whether there is any relation (linear or

non-linear) between the two variables.

£ius CORPFIN 2503, Week 3 63/69

Basics Tables Statistical tests Figures

Example: Scatter plots

SAS procedure GPLOT can be used to produce scatter plots:

proc gplot data=work.car_data;

title ‘Scatter plot of car length and discount’;

plot length* discount=1;

run;

£ius CORPFIN 2503, Week 3 64/69

Basics Tables Statistical tests Figures

Example: Scatter plots II

£ius CORPFIN 2503, Week 3 65/69

Basics Tables Statistical tests Figures

Example: Scatter plots III

Scatter plots by origin:

SYMBOL1 V=circle C=black I=none height=2;

SYMBOL2 V=star C=red I=none;

SYMBOL2 V=square C=blue I=none height=2;

proc gplot data=work.car_data;

title ‘Scatter plot of car length and discount by origin’;

plot length* discount=origin;

run;

£ius CORPFIN 2503, Week 3 66/69

Basics Tables Statistical tests Figures

Example: Scatter plots IV

£ius CORPFIN 2503, Week 3 67/69

Basics Tables Statistical tests Figures

Summary

We covered a lot of di�erent types of tables and �gures.

However, it does not mean that one needs to use all of them.

Recall from the previous lecture, that one should not overload a

report or presentation with plots and tables.

£ius CORPFIN 2503, Week 3 68/69

Basics Tables Statistical tests Figures

Required reading

Konasani, V. R. and Kadre, S. (2015). �Practical Business

Analytics Using SAS: A Hands-on Guide�: chapters 6, 7, and 8.

£ius CORPFIN 2503, Week 3 69/69

Basics
Basics

Tables
Tables

Statistical tests
Statistical tests

Figures
Figures