Basics Tables Statistical tests Figures Predictions
BUSN 7001 – Predictive and Visual Analytics for Business
Week 3: Descriptive statistics and predictions
£ius BUSN 7001, Week 3 1/80
Copyright By PowCoder代写 加微信 powcoder
Basics Tables
Statistical tests Figures
Predictions
Statistical tests
Predictions
BUSN 7001, Week 3 2/80
Basics Tables Statistical tests Figures
Descriptive statistics
Descriptive statistics give an overall picture of the data.
Descriptive statistics provide:
• means, medians, and other statistical properties • various graphs, plots. . .
• correlation matrix
• basic statistical tests.
Predictions
£ius BUSN 7001, Week 3 3/80
Basics Tables Statistical tests Figures
Descriptive statistics II
There is no universal to-do-list for the descriptive statistics.
The aspects of the data that could be provided depend on: • data properties
• the purpose of the project.
Predictions
£ius BUSN 7001, Week 3 4/80
Basics Tables Statistical tests Figures Predictions
Suppose we would like to analyze whether the discount on new cars depends on a car origin (USA, Asia, Europe).
/* Creating data file: */
DATA work.car_data;
SET SAShelp.Cars;
£ius BUSN 7001, Week 3 5/80
Basics Tables Statistical tests Figures
Predictions
Example II
First, we look at the data, either by: • opening the data le or
• `printing’ the data:
PROC PRINT DATA=car_data(obs=20);
£ius BUSN 7001, Week 3 6/80
Basics Tables Statistical tests Figures Predictions
Example III
£ius BUSN 7001, Week 3 7/80
Basics Tables Statistical tests Figures
Example: Frequency distribution
Next, we should look at the frequency distribution of car origin (USA, Asia, Europe).
proc freq data=work.car_data;
tables origin;
Predictions
£ius BUSN 7001, Week 3 8/80
Basics Tables Statistical tests Figures Predictions
Example: Frequency distribution II
£ius BUSN 7001, Week 3 9/80
Basics Tables Statistical tests Figures
Predictions
Example: Frequency distribution III
Alternatively, we can create a pie chart:
proc gchart data=work.car_data;
PIE origin / type=percent;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Frequency distribution IV
£ius BUSN 7001, Week 3 11/80
Basics Tables Statistical tests Figures Predictions
Descriptive statistics III
Next, we should identify the other potential determinants (besides origin) of the discount. Let’s assume they are:
• car manufacturer (`make’)
• car type (`type’)
• drivetrain type (`drivetrain’) • car sticker price (`MSRP’)
• engine size (`enginesize’) and • car length (`length’).
BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Descriptive statistics IV
Then we should provide a table with key statistics of our numerical variables:
• standard deviation
• minimum and maximum values • median
• 25th and 75th percentile values.
£ius BUSN 7001, Week 3 13/80
Basics Tables Statistical tests Figures Predictions
Descriptive statistics IV
What about non-numerical variable such as car manufacturer, car type, and drivetrain type?
One should code them as dummy variables (also known as indicator variables).
Then compute their key statistics as well.
£ius BUSN 7001, Week 3 14/80
Basics Tables Statistical tests Figures
Dummy variables
If we have a gender variable, then its coding is very simple: • gender=1 if female
• gender=0 if male.
What about variables that can take more than 2 values such as drivetrain type or car type?
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures
Dummy variables II
proc freq data=work.car_data;
tables drivetrain type;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Dummy variables III
£ius BUSN 7001, Week 3 17/80
Basics Tables Statistical tests Figures Predictions
Dummy variables IV
We should create dummy variables for each value of non-numerical variable.
E.g., for drivetrain, we should generate 3 dummy variables: • all=1 if `drivetrain’ is equal to All, 0 otherwise
• front=1 if `drivetrain’ is equal to Front, 0 otherwise
• rear=1 if `drivetrain’ is equal to Rear, 0 otherwise.
£ius BUSN 7001, Week 3 18/80
Basics Tables Statistical tests Figures Predictions
Dummy variables V
For car type, this might be not practical as there are 6 possible values and some of them feature low frequency.
E.g., there are only three hybrid cars in the sample.
In this case, one can generate dummy variables only for more frequent values.
If there are 6 possible values, in general, it is sucient to code only 3.
£ius BUSN 7001, Week 3 19/80
Basics Tables Statistical tests Figures
Dummy variables VI
We will code only one.
DATA work.car_data;
SET work.car_data;
IF type=’Sedan’ then sedan=1;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures
Predictions
Example: Summary statistics
We can either use PROC MEANS or PROC UNIVARIATE.
proc means data = work.car_data n mean std min p25
median p75 max;
var discount msrp enginesize length sedan;
proc univariate data = work.car_data;
var discount msrp enginesize length sedan;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics II
The results from PROC UNIVARIATE are less user friendly and one needs to manually compile a table.
£ius BUSN 7001, Week 3 22/80
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics III
£ius BUSN 7001, Week 3 23/80
Basics Tables Statistical tests Figures
Example: Summary statistics IV
One should also provide summary statistics by car origin.
proc means data = work.car_data n mean std min p25
median p75 max;
var discount msrp enginesize length sedan;
class origin;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics V
£ius BUSN 7001, Week 3 25/80
Basics Tables Statistical tests Figures
Predictions
Example: Summary statistics VI
Let’s limit the number of decimal places in output.
proc means data = work.car_data n mean std min p25
median p75 max maxdec=2;
var discount msrp enginesize length sedan;
class origin;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics VII
£ius BUSN 7001, Week 3 27/80
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics VIII
The results suggest that discount might be more or less the same regardless of the car origin.
What about drivetrain type? Or car type?
£ius BUSN 7001, Week 3 28/80
Basics Tables Statistical tests Figures
Predictions
Example: Summary statistics IX
proc means data = work.car_data mean std min p25
median p75 max maxdec=3;
var discount;
class origin drivetrain;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics X
£ius BUSN 7001, Week 3 30/80
Basics Tables Statistical tests Figures
Example: Summary statistics XI
proc means data = work.car_data mean std min p25
median p75 max maxdec=3;
var discount;
class origin drivetrain type;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics XII
£ius BUSN 7001, Week 3 32/80
Basics Tables Statistical tests Figures
Predictions
Example: Summary statistics XIII
The table in the previous slide is incomplete.
The data seems to be too granular.
It might be better not to report the table on the previous slide.
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics XIV
One can also use PROC TABULATE to display descriptive statistics in tabular format.
For example, to display the number of observations and mean.
proc tabulate data=work.car_data;
var discount;
table discount*N discount*MEAN;
£ius BUSN 7001, Week 3 34/80
Basics Tables Statistical tests Figures
Example: Summary statistics XV
Or to display the means of discount and MSRP by car type.
proc tabulate data=work.car_data;
var discount MSRP;
class type;
table type, discount*MEAN MSRP*MEAN;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Summary statistics XVI
Or to display the means of discount and MSRP by car type and drive train.
proc tabulate data=work.car_data;
var discount MSRP;
class type drivetrain;
table type, discount*drivetrain*MEAN MSRP*drivetrain*MEAN;
£ius BUSN 7001, Week 3 36/80
Basics Tables Statistical tests Figures Predictions
Summary statistics with SAS Visual Analytics
Not easy to get . . . (Click on `Data’, then `Actions’, then `View measure details…’):
£ius BUSN 7001, Week 3 37/80
Basics Tables Statistical tests Figures
Predictions
Two-way tables
Two-way tables are used to illustrate the distribution of observations.
proc freq data=work.car_data;
tables origin*type / norow nocol nopercent;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Two-way tables II
£ius BUSN 7001, Week 3 39/80
Basics Tables Statistical tests Figures Predictions
Two-way tables with SAS Visual Analytics
£ius BUSN 7001, Week 3 40/80
Basics Tables Statistical tests Figures
Predictions
Correlation matrix
Correlation matrix shows correlation coecients for dierent combinations of variable pairs.
PROC CORR DATA=work.car_data;
var discount msrp enginesize length sedan;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Correlation matrix II
£ius BUSN 7001, Week 3 42/80
Basics Tables Statistical tests Figures
Statistical tests
One should also report basic statistical tests.
2-sample t-test is very common. One can also test whether medians are statistically dierent across the sub-samples.
Let’s test whether discount depends on car origin.
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures
Example: Statistical tests
First, we need to create dummy variables for car origin.
DATA work.car_data;
SET work.car_data;
origin_usa=0;
IF origin=’USA’ then origin_usa=1;
origin_asia=0;
IF origin=’Asia’ then origin_asia=1;
origin_europe=0;
IF origin=’Europe’ then origin_europe=1;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures
Predictions
Example: Statistical tests II
T-test for cars produced in the USA vs cars produced elsewhere:
proc ttest data=work.car_data;
class origin_usa;
var discount;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Statistical tests III
£ius BUSN 7001, Week 3 46/80
Basics Tables Statistical tests Figures
Example: Statistical tests IV
T-test for cars produced in the USA vs cars produced in Europe:
proc ttest data=work.car_data;
class origin_usa;
var discount;
where origin ne ‘Asia’;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Statistical tests V
£ius BUSN 7001, Week 3 48/80
Basics Tables Statistical tests Figures Predictions
Example: Statistical tests VI
In both cases, we fail to reject the null hypothesis that the discount is the same in both sub-samples.
In our example, one should probably conduct t-tests for all possible combinations of car origin:
• USA vs Europe • USA vs Asia
• Europe vs Asia.
£ius BUSN 7001, Week 3 49/80
Basics Tables Statistical tests Figures Predictions
One can produce a few gures to better describe the data.
Refer to Workshop #2 regarding key types of gures and plots.
£ius BUSN 7001, Week 3 50/80
Basics Tables Statistical tests Figures Predictions
A histogram is a visual representation of the distribution of the numerical data.
Steps to construct a histogram:
1. sort the data from smallest to highest value
2. divide the entire range of values into a series of intervals (bins) 3. count how many values fall into each bin
4. plot the results using bar charts.
£ius BUSN 7001, Week 3 51/80
Basics Tables Statistical tests Figures
Histogram: Bin width
Bin width is important:
• if too small, then histogram is too messy • if too big, then lots of information is lost.
SAS does a good job in selecting bin width.
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures
Predictions
Histogram: Bin width II
Let’s create a few histograms using the
following data:
1 1.1 1.15 1.2 1.4 1.45
1.5 1.55 1.6 1.7 1.75 1.8 1.9
BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Histogram: Bin width III
1.2 1 0.8 0.6 0.4 0.2 0
Bins too narrow
0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 More
7 6 5 4 3 2 1 0
Bins too wide
0.5 1 1.5 2 More
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1 1.2 1.4 1.6 1.8 2 More
£ius BUSN 7001, Week 3 54/80
Basics Tables Statistical tests Figures
Example: Histogram
proc univariate data = work.car_data plots;
var discount;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Histogram II
£ius BUSN 7001, Week 3 56/80
Basics Tables Statistical tests Figures
Predictions
Example: Histogram III
Alternatively, one can use procedure SGPLOT:
proc sgplot data=work.car_data;
histogram discount;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Histogram IV
£ius BUSN 7001, Week 3 58/80
Basics Tables Statistical tests Figures
Example: Histogram V
To produce histograms of discount by origin:
proc univariate data=work.car_data;
class origin;
var discount;
histogram discount / nrows=3;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Histogram VI
£ius BUSN 7001, Week 3 60/80
Basics Tables Statistical tests Figures Predictions
Box plots illustrate data distribution and key statistical properties.
Box plots are not popular nowadays.
PROC UNIVARIATE or PROC BOXPLOT can be used to produce box plots.
£ius BUSN 7001, Week 3 61/80
Basics Tables Statistical tests Figures Predictions
Example: Box plots II
Source: SAS User’s Guide, p. 796.
£ius BUSN 7001, Week 3 62/80
Basics Tables Statistical tests Figures
Example: Box plots
To produce box plots of discount by origin:
proc univariate data = work.car_data plots;
var discount;
by origin;
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Box plots II
£ius BUSN 7001, Week 3 64/80
Basics Tables Statistical tests Figures
Predictions
Scatter plots
Scatter plot is a gure in which the values of two variables are plotted along two axes.
Scatter plots help reveal whether there is any relation (linear or non-linear) between the two variables.
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures
Predictions
Example: Scatter plots
SAS procedure GPLOT can be used to produce scatter plots:
proc gplot data=work.car_data;
title ‘Scatter plot of car length and discount’;
plot length* discount=1;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Example: Scatter plots II
£ius BUSN 7001, Week 3 67/80
Basics Tables Statistical tests Figures Predictions
Example: Scatter plots III
Scatter plots by origin:
SYMBOL1 V=circle C=black I=none height=2;
SYMBOL2 V=star C=red I=none;
SYMBOL2 V=square C=blue I=none height=2;
proc gplot data=work.car_data;
title ‘Scatter plot of car length and discount by origin’;
plot length* discount=origin;
£ius BUSN 7001, Week 3 68/80
Basics Tables Statistical tests Figures Predictions
Example: Scatter plots IV
£ius BUSN 7001, Week 3 69/80
Basics Tables Statistical tests Figures
Predictions
Figures with SAS Visual Analytics
Easier to make than using SAS as no coding is needed.
Lots of options.
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Figures with SAS Visual Analytics: Box plots
£ius BUSN 7001, Week 3 71/80
Basics Tables Statistical tests Figures Predictions
Suppose you are a portfolio manager and you have been asked by your client to identify `value’ stocks traded on ASX:
`Value’ stocks have stable cash ows, low growth, and less risk.
`Growth’ stocks feature volatile cash ows, high earnings or sales growth and are riskier.
Historically, `value’ stocks outperformed `growth’ stocks.
£ius BUSN 7001, Week 3 72/80
Basics Tables Statistical tests Figures Predictions
To keep things simple, let’s assume that value stocks have: 1. non-zero dividends
2. leverage below mean
3. beta below mean
4. ROA positive and above median
5. non-negative revenue growth but smaller than median.
This is a simplied version of denition from Piotroski (2000) `The Use of Historical Financial Statement Information to Separate Winners from Losers’ (https://www.jstor.org/stable/2672906).
£ius BUSN 7001, Week 3 73/80
Basics Tables Statistical tests Figures Predictions
Solution: Import data and compute means and medians
/*Importing the data: */
PROC IMPORT OUT= WORK.ASX
DATAFILE= “C: … \ASX.csv”
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
/* Computing mean and median values: */
PROC MEANS DATA=work.ASX n min mean median max std maxdec=2;
VAR leverage beta roa revenue_growth;
£ius BUSN 7001, Week 3 74/80
Basics Tables Statistical tests Figures Predictions
Means and medians
£ius BUSN 7001, Week 3 75/80
Basics Tables Statistical tests Figures Predictions
Solution: Removing rms with missing dividend yield etc.
DATA work.ASX;
SET work.ASX;
IF dividend_yield ne .;
IF dividend_yield>0;
IF leverage<110.12;
IF beta<1.1378009;
IF roa>-10.92;
IF revenue_growth>0;
IF revenue_growth<10.40; /* Median */
/* Mean */
/* Mean */
/* Median */
£ius BUSN 7001, Week 3 76/80
Basics Tables Statistical tests Figures
Predictions
32 stocks are left. Let's `print' them.
PROC PRINT DATA=work.ASX;
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Several value stocks
£ius BUSN 7001, Week 3 78/80
Basics Tables Statistical tests Figures
We covered a lot of dierent types of tables and gures.
However, it does not mean that one needs to use all of them.
Recall from the previous lecture, that one should not overload a report or presentation with plots and tables.
Lastly, we identied `value' stocks.
Predictions
£ius BUSN 7001, Week 3
Basics Tables Statistical tests Figures Predictions
Required reading
Konasani, V. R. and Kadre, S. (2015). Practi
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com