CORPFIN 2503 – Business Data Analytics
2021 S2, Workshop 3: Descriptive statistics and data exploration
£ius
1 Financial data of ASX �rms
Let’s download �nancial data for all stocks traded on ASX using Eikon:
1. Search for Screener
2. Filter for stocks traded in Australia (country of exchange)
3. select the following variables:
� Dividend per Share – Actual (FY0, AUD)
� Total Assets, Reported (FY0, AUD)
� Cash Flow (FY0, AUD)
� Cash Flow (FY-1, AUD)
� IPO date
� Beta
� Retained Earnings (Accumulated De�cit)(FY0, AUD)
� Total Equity (FY0, AUD)
� Total Debt to Total Equity, Percent (FY0).
4. Click on Excel icon to save the data as Excel �le.
1
2 Frequency distribution
The FREQ procedure generates frequency distribution in SAS:
Let’s �nd the frequency distribution of car type (i.e., the percentage of each car
type) using a data�le provided by SAS:
/* Creating data file: */
DATA work.car_data;
SET SAShelp.Cars;
RUN;
proc freq data=work.car_data;
tables type;
run;
Now let’s �nd cross-tabulation of two variables (two-way table):
proc freq data=work.car_data;
tables type*Cylinders;
run;
/* Cross-tabulation of both variables (two-way table) with additional options: */
proc freq data=work.car_data;
tables type*Cylinders / norow nocol nopercent;
run;
3 Summary statistics tables (in the lab and at home)
The TABULATE procedure generates tables with one or more statistics: average,
minimum, maximum values etc. (please refer to SAS help �les for more information).
Let’s �nd average discount for a new car using a data�le provided by SAS:
/* Creating a new variable: */
DATA work.car_data;
SET work.car_data;
discount=MSRP/Invoice-1;
2
RUN;
/* Finding average values of discount for different car types etc. */
/* using procedure TABULATE: */
proc tabulate data=work.car_data;
var discount;
table discount*N discount*MEAN;
run;
/* or */
proc tabulate data=work.car_data;
var discount;
table discount*(N MEAN);
run;
Let’s �nd average discount for a new car for di�erent car types:
proc tabulate data=work.car_data;
var discount;
class type;
table discount*MEAN*type;
run;
proc tabulate data=work.car_data;
var discount MSRP;
class type;
table type, discount*MEAN MSRP*MEAN;
run;
We can also create 2-dimensional table:
proc tabulate data=work.car_data;
var discount MSRP;
class type drivetrain;
table type, discount*drivetrain*MEAN MSRP*drivetrain*MEAN;
run;
3
or
proc tabulate data=work.car_data;
var discount MSRP;
class type drivetrain;
table drivetrain, discount*type*MEAN MSRP*type*MEAN;
run;
Let’s format the numbers more appropriately:
proc tabulate data=work.car_data
format=6.3; /* 6 is format (total) width, 3 is a decimal width. */
var discount MSRP;
class type drivetrain origin;
table type*origin, discount*drivetrain*MEAN;
run;
/* To make table look better. */
proc tabulate data=work.car_data
format=6.3; /* 6 is format (total) width, 3 is a decimal width. */
var discount MSRP;
class type drivetrain origin;
table type*origin, discount=”*drivetrain*MEAN=”
/ BOX=’Average discount’;
run;
One can also apply one of the built-in SAS styles.
PROC TEMPLATE;
LIST STYLES;
RUN;
For SAS in the Financial Markets Lab:
ODS HTML STYLE=OCEAN;
proc tabulate data=work.car_data
format=6.3; /* 6 is format (total) width, 3 is a decimal width. */
var discount MSRP;
4
class type drivetrain origin;
table type*origin, discount=”*drivetrain*MEAN=”
/ BOX=’Average discount’;
run;
ODS HTML CLOSE;
/* You should get a table in a green background. You might need to run the
above code twice. */
For SAS onDemand for Academics:
ods html5 (id=web) style=Styles.OCEAN;
proc tabulate data=work.car_data
format=6.3; /* 6 is format (total) width, 3 is a decimal width. */
var discount MSRP;
class type drivetrain origin;
table type*origin, discount=”*drivetrain*MEAN=”
/ BOX=’Average discount’;
run;
ODS HTML CLOSE;
ods preferences;
4 SAS procedure MEANS
Let’s use SAS procedure MEANS to create a data �le with average values:
/* Creating another data file (this time, we need only a few variables): */
DATA work.car_data;
SET SAShelp.Cars (keep = make type msrp invoice);
RUN;
/* Calculating means of MSRP and Invoice: */
proc means data=work.car_data;
var msrp invoice;
output out=work.avg_value
mean(msrp)=mean_msrp
mean(invoice)=mean_invoice;
5
run;
/* Dropping redundant variables: */
data work.avg_value;
set work.avg_value (drop = _TYPE_ _FREQ_);
run;
Let’s merge the created �le (work.avg_value) with the original �le (work.car_data)
to create a new �le (work.car_data2). The easiest way to implement this is to create
the same variable (to be precise, a constant) in both �les. The variable must have
the same value (let’s say 1):
data work.avg_value;
set work.avg_value;
x=1;
run;
data work.car_data;
set work.car_data;
x=1;
run;
Now let’s merge the �les by `x’:
DATA work.car_data2;
MERGE work.car_data work.avg_value;
BY x;
RUN;
The last step is to drop x:
data work.car_data2;
set work.car_data2 (drop = x);
run;
6
5 Descriptive statistics (at home)
Suppose you are stock analyst. You would like to identify �rms that currently do
not pay dividends but might start paying dividends soon. The previous research
shows that stock price increases at the announcement of dividend initiation. Your
task is to do descriptive analysis using the Excel �le downloaded in Task 1. Assume
that dividend payers are �rms that:
1. are larger (in terms of total assets)
2. are elder (2020 minus IPO year):
� SAS code: age=2020-year(IPO_date);
3. are less risky (in terms of beta and debt-to-equity ratio)
4. more cash �ows (cash �ow to assets ratio)
5. feature low growth (in terms on annual cash �ow growth)
6. have greater retained earnings to total equity ratio.
Suggested steps:
1. remove stocks with missing �nancial information or IPO date (except for miss-
ing dividends)
2. change the format of all numerical variables (except for IPO date) to `General’
3. rename the variables (optional)
4. save data as CSV �le
5. import the data to SAS
6. generate additional variables: dividend-payer dummy, �rm age, cash �ow to
assets ratio, cash �ow growth, retained earnings to total equity ratio
7. create a pie chart to re�ect the percentage of dividend payers and non-payers
8. report average, median, and other key statistics for all �rms, then for dividend-
payers and non-payers separately
7
9. using the results from the previous step identify variables (you may also per-
form two-sample t-tests) that seem to be very di�erent across the two groups
� they might help identify dividend-non-payers that might initiate dividends
soon
10. use one of the identi�ed variables in the previous step and create histograms
and box plots separately for dividend-payers and non-payers
11. report the correlation matrix for all �rms, then for dividend-payers and non-
payers separately
12. generate 7 dummy (indicator) variables for each of the 6 criteria (�rm size,
�rm age etc.) whether a �rm is above/below the 75th percentile; for example,
if a �rm’s total assets are below the 75th percentile then the dummy variable
is 1 and 0 otherwise:
� use SAS procedure MEANS with option `OUTPUTOUT=’ and `P75(. . . )=’
instead of `MEAN(. . . )=’
� merge the ASX �rm �le with the newly generated �le
� generate dummy variables using `IF. . . THEN. . . ;’ statements
13. create 7 two-way tables for dividend-payers and non-payers using those 7
dummy variables
14. create a few scatter plots as well
15. try to identify a few �rms that currently do not pay dividends but might start
paying dividends soon; the easiest way is to consider factors identi�ed in step
9) and all �rms that currently do not pay dividends, then drop �rms which
assets (assuming that assets are among those identi�ed variables) are below
the 75th percentile, then drop �rms which age (assuming that age is among
those identi�ed variables) is below the 75th percentile and so on.
8
Financial data of ASX firms
Frequency distribution
Summary statistics tables (in the lab and at home)
SAS procedure MEANS
Descriptive statistics (at home)