CORPFIN 2503 – Business Data Analytics
2021 S2, Workshop 2: Visual analytics and data mining
£ius
1 Time series stock market data
Let’s download time series stock market data for all stocks traded on ASX using
Eikon:
1. Search for Screener
2. Filter for stocks traded in Australia (country of exchange)
3. Click button `Launch Data Item Library’ (on the right of `Add Column’ bar)
4. Search for `price close’
5. Choose the following parameters: select `Series’, calendar month (CM), last
26 observations, ordered from oldest to newest
6. Click on Excel icon to save the data as Excel �le.
2 Lags and leads
Unfortunately, creating lagged and lead values of variables is not an easy task. Let’s
create lagged and lead values using the example below.
/* Creating data file: */
DATA work.citi;
SET SAShelp.Citimon;
RUN;
1
/* Creating another file with smaller number of observations: */
DATA work.citi_sh;
SET work.citi (keep = date FSPCOM LHUR EEC EXVUS);
IF FSPCOM ne .;
IF LHUR ne .;
IF EEC ne .;
IF EXVUS ne .;
RUN;
/**** Let’s create a lagged value for FSPCOM: 2-step procedure: ****/
/* 1) Sorting data by date: */
proc sort data=work.citi_sh;
by date;
run;
/* 2) Creating a lagged value using a command ‘LAG(…)’: */
data work.citi_sh;
set work.citi_sh;
lag_FSPCOM=LAG(FSPCOM);
run;
/**** Let’s create a lead value for FSPCOM: 2-step procedure: ****/
/* 1) Sorting data by date but in descending order: */
proc sort data=work.citi_sh;
by descending date;
run;
/* 2) Creating a lead value using a command ‘LAG(…)’: */
data work.citi_sh;
set work.citi_sh;
lead_FSPCOM=LAG(FSPCOM);
run;
/* Let’s sort the data in the original order (i.e., by date): */
proc sort data=work.citi_sh;
by date;
run;
2
3 Procedure CORR
SAS procedure CORR generates correlation coe�cients. The procedure allows to
display correlation matrix in the HTML window as well as to save them as a separate
�le. Let’s consider example below.
/* Let’s find the correlation matrix: */
PROC CORR DATA=work.citi_sh;
RUN;
/* Let’s create a new file with the correlation coeficients of */
/* lead_FSPCOM with the rest of variables (the command ‘OUTP=’ */
/* saves Pearson correlation coefficients as a new file): */
PROC CORR DATA=work.citi_sh
OUTP=work.corr_citi_sh;
WITH lead_FSPCOM;
RUN;
/* Removing redundant variables */
data work.corr_citi_sh;
set work.corr_citi_sh;
if _NAME_=”lead_FSPCOM”;
drop _NAME_;
run;
/* Transposing the data in order we could sort the correlation */
/* coefficients if needed */
proc transpose data=work.corr_citi_sh
out=work.corr_citi_sh2;
run;
/* Renaming the variable */
data work.corr_citi_sh2;
set work.corr_citi_sh2;
rename col1=corr;
run;
3
4 Various plots (in the lab and at home)
For this task, you should familiarize yourself with key plots available on SAS. Below,
I provide the necessary code.
/* Generating a new variable (return on index) */
data work.citi_sh;
set work.citi_sh;
return=FSPCOM/lag_FSPCOM-1;
run;
/* Option PLOTS generates stem-and-leaf, box and qq plots of the variable(s) */
proc univariate data=work.citi_sh PLOTS;
var return;
RUN;
/* Option HISTOGRAM generates histogram */
proc univariate data=work.citi_sh;
var return;
HISTOGRAM / NORMAL (COLOR=RED);
RUN;
/* Creating data file: */
DATA work.car_data;
SET SAShelp.Cars;
RUN;
/* Vertical bar chart */
proc gchart data=work.car_data;
VBAR type / type=percent;
run;
quit;
/* Horizontal bar chart */
proc gchart data=work.car_data;
HBAR type / type=percent;
4
run;
quit;
/* Pie bar chart */
proc gchart data=work.car_data;
PIE type / type=percent;
run;
quit;
/* One can create a stacked bar chart using the SUBGROUP= option: */
proc gchart data=work.car_data;
VBAR type / type=percent
subgroup=DriveTrain;
run;
quit;
/* The same thing for horizontal bar chart: */
proc gchart data=work.car_data;
HBAR type / type=percent
subgroup=DriveTrain;
run;
quit;
/* or */
proc gchart data=work.car_data;
HBAR type / type=percent
group=DriveTrain;
run;
quit;
/* Scatterplot: */
proc gplot data=work.car_data;
title ‘Scatter plot of MSRP and Invoice’;
plot Horsepower* Invoice=1;
run;
quit;
5
/* Scatterplot by group: */
SYMBOL1 V=circle C=black I=none height=2;
SYMBOL2 V=star C=red I=none;
SYMBOL2 V=square C=blue I=none height=2;
proc gplot data=work.car_data;
title ‘Scatter plot of Horsepower and Invoice by Drivetrain’;
plot Horsepower* Invoice=drivetrain;
run;
quit;
/* Line: */
symbol1 color=red interpol=join;
title ‘Return over time’;
proc gplot data=work.citi_sh;
plot return*date / hminor=0;
run;
quit;
/* 2 lines: */
/* First, we need to make labels shorter (otherwise, the graph will be */
/* very small) */
DATA work.citi_sh;
SET work.citi_sh;
label EXVUS=”USD index”;
label FSPCOM=”Stock index”;
RUN;
symbol1 color=black interpol=join;
symbol2 color=green interpol=join;
title ‘USD index and stock market index over time’;
proc gplot data=work.citi_sh;
plot EXVUS*DATE / caxis=black hminor=0
6
noframe;
plot2 FSPCOM*DATE / caxis=green;
run;
quit;
You may also refer to the e-books by on how to make nice graphs:
http://robslink.com/SAS/Home.htm.
5 Data mining (at home)
Suppose today is 30 June 2021. We want to predict whether Telstra’s stock price
will increase or decrease next month (ending on 31/07/2021) using the Excel �le
downloaded in Task 1.
Suggested steps:
1. remove stocks with missing stock price information
2. keep stocks with average stock price across 26 months above or equal to $2
3. you should retain around 400 stocks, including Telstra
4. delete the column `country of exchange’, rename the remaining columns of the
data set to: ticker, name, June-2019 . . . July-2021
5. save the �le as ASX.csv �le and import it to SAS
6. compute the correlation coe�cients between the Telstra’s stock price for 31/07/2019
– 30/06/2021 and other stocks for 30/06/2019 – 31/05/2021:
� transpose the data using the code below (`id ticker;’ option let us retain
the stock tickers; however, ‘.’ are replaced by ‘_’.):
proc transpose data=work.asx
out=work.asx2;
id ticker;
run;
� generate a variable which is equal to observation number; this variable
can be used instead of date in order to create lagged and lead variables:
7
http://robslink.com/SAS/Home.htm
data work.asx2;
set work.asx2;
period=_n_;
run;
� create a variable which is equal to lead values of Telstra’s stock price
� compute correlation coe�cients between the lead values of Telstra’s stock
price and other stocks using the �rst 24 observations
� transpose the data set with correlation coe�cients
7. create the variable for absolute values of correlation coe�cients using command
`ABS()’ (e.g., abs_corr=abs(corr);)
8. identify the stock with the highest absolute correlation coe�cient (you may
sort the data in order to achieve this)
9. plot the lead values of Telstra’s stock price and the price of stock identi�ed in
the previous step using the �rst 25 observations to identify the prediction and
whether the prediction was accurate.
8
Time series stock market data
Lags and leads
Procedure CORR
Various plots (in the lab and at home)
Data mining (at home)