1. The Stock Market Data
You can find the lab in ISLR p.154 f. Smarket is part of ISLR library.
Some info on this data:
help(Smarket)
Classification
Mariia Okuneva
#install.packages(“ISLR”) #in case you haven’t installed it yet
library(ISLR)
fix(Smarket) #puts data frame in the environment attach(Smarket) #puts names in the search path
## starting httpd help server … done
Daily percentage returns for the S&P 500 stock index between 2001 and 2005 (1250 days).
Our features/regressors: Lag1 through Lag5: the percentage returns for each of the five previous trading
days. Volume: the number of shares traded on previous day in billions
Our dependent variable: Direction: whether the market was Up or Down on this date.
The first step is to get some summary of your data:
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
1
##
You can see that Direction is a factor variable (Up or Down). We want to predict the direction of the market by using all the information we have got (returns on the previous trading days, trading Volume).
First of all, let’s have a look at correlations and scatter plot (two-dimensional data vizualization). The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set.
pairs(Smarket, col=Direction) # scatter plot
2001 2005
#cor(Smarket)
−4 2 6
−4 2 6
Direction
1.0 1.6
Year
−4 2 6
Lag1
−4 2 6
Lag3
−4 2 6
−4 2 6
Lag2
Lag4
1.0
0.5
−4 6
−4 6 2001
−4 6
−4 6
−4 6 −4 6
The cor() command gives an error because the Direction variable is qualitative. So, we have to exclude respective column in a dataset.
cor(Smarket[,-9]) # correlation matrix
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
2
Lag5
Volume
0.5 2.5
Today
## Lag2
## Lag3
## Lag4
## Lag5
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
-0.003557949 -0.04338321 -0.010250033
-0.018808338 -0.04182369 -0.002447647
-0.027083641 -0.04841425 -0.006899527
1.000000000 -0.02200231 -0.034860083
You can see that correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. Any large correlation coefficients?
abs(cor(Smarket[,-9]))>0.5 # pairs with high correlation
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## Year TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## Lag1 FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## Lag2 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## Lag3 FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## Lag4 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## Lag5 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## Volume TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## Today FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
The only substantial correlation is between Year and Volume. Let’s plot Volume:
plot(Volume)
Volume
0.5 1.0 1.5 2.0 2.5 3.0
0 200 400 600 800 1000 1200
Index
By plotting the data we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.
3
2. Logistic Regression
Next, we will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
ta = Smarket)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial, da summary(glm.fit)
## Lag1
## Lag2
## Lag3
## Lag4
## Lag5
## Volume
##
-0.073074 0.050167 -1.457 0.145
-0.042301 0.050086 -0.845 0.398
0.011085
0.009359
0.010313
0.135441
0.049939
0.049974
0.049511
0.158360
0.222 0.824
0.187 0.851
0.208 0.835
0.855 0.392
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
Which parameter estimate is associated with the smallest p-value?
coef(summary(glm.fit))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1
## Lag2
## Lag3
## Lag4
## Lag5
## Volume
-0.073073746 0.05016739 -1.4565986 0.1452272
-0.042301344 0.05008605 -0.8445733 0.3983491
0.011085108 0.04993854 0.2219750 0.8243333
0.009358938 0.04997413 0.1872757 0.8514445
0.010313068 0.04951146 0.2082966 0.8349974
0.135440659 0.15835970 0.8552723 0.3924004
coef(summary(glm.fit))[,4]
## (Intercept) Lag1 Lag2
Lag3 Lag4 Lag5
0.8243333 0.8514445 0.8349974
## ## ##
0.6006983 0.1452272 0.3983491
Volume
0.3924004
4
which.min(coef(summary(glm.fit))[,4]) # lowest p-value ## Lag1
## 2
min(coef(summary(glm.fit))[,4]) ## [1] 0.1452272
The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15 the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.
Two ways to access coefficients of the fitted model:
coef(glm.fit)
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
## Volume
## 0.135440659
summary(glm.fit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1
## Lag2
## Lag3
## Lag4
## Lag5
## Volume
-0.073073746 0.05016739 -1.4565986 0.1452272
-0.042301344 0.05008605 -0.8445733 0.3983491
0.011085108 0.04993854 0.2219750 0.8243333
0.009358938 0.04997413 0.1872757 0.8514445
0.010313068 0.04951146 0.2082966 0.8349974
0.135440659 0.15835970 0.8552723 0.3924004
summary(glm.fit)$coef[,1] # choose only coefficients
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
## Volume
## 0.135440659
You can use contrasts() command to see if 1 stands for Up or Down: contrasts(Direction) # which direction was assigned 1
## Up ## Down 0 ##Up 1
The predict() function can be used to predict probability that the market will go up, given values of the predictors.
response option returns conditional probabilities; if no data set is supplied, then the probabilities are computed for the training data:
glm.probs=predict(glm.fit,type=”response”)
Let’s print the first ten probabilities:
glm.probs[1:10]
5
## 1 2 3 4 5 6 7 8
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292
## 9 10
## 0.5176135 0.4888378
In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these probabilities into class labels, Up or Down.
## [1] “Down” “Down” “Down” “Down” “Down” “Down” “Down” “Down” “Down” “Down”
glm.pred = rep(“Down”,1250) # create a vector fo 1250 Down elements glm.pred[1:10]
#glm.probs>0.5
glm.pred[glm.probs>0.5]=”Up” # assign Up whenever the corresponding predicted probability
Given these predictions, the table() function can be used to produce a confusion matrix/contingency table: table(glm.pred,Direction) # contingency table comparing true Directions and the predicted ones
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
How to compute the fraction of days for which the prediction was correct?
(507+145)/1250 # accuracy ## [1] 0.5216
Or you could use mean() function: mean(glm.pred==Direction)
## [1] 0.5216
At first glance, it appears that the logistic regression model is working a litle better than random guessing. However, this result is misleading because we trained and tested the model on the same set of 1250 observations. In other words, 100-52.2 = 47.8% is the training error rate. Training error tends to underestimate the test error rate.
Therefore, we should split the data into training and test set. Then we can examine how well logistic regression predicts the held out data.
We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set (days in 2005).
Now we compare predictions for 2005 to the actual movements of the market over that time period.
is >0.5
train = (Year < 2005) # logical vector excluding year 2005
Smarket.2005 = Smarket[!train,] # test sample, year 2005
Direction.2005 = Direction[!train] # use ! to reverse all of the elements of a Boolean vec dim(Smarket.2005)
## [1] 252 9
tor
ta = Smarket,
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial, da glm.probs = predict(glm.fit,Smarket.2005,type="response")
6
glm.pred = rep("Down",252) glm.pred[glm.probs>0.5]=”Up” table(glm.pred,Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
Hit rate?
(77+44)/252
## [1] 0.4801587
mean(glm.pred==Direction.2005) ## [1] 0.4801587
mean(glm.pred!=Direction.2005) # test set error rate, != means not equal to ## [1] 0.5198413
The results are rather disappointing: the test error rate is 52%, which is worse than random guessing! Unpredictability of stock market returns 🙁
May be, by removing the variables that appear not to be helpful in predicting Direction, we can obtain a more effective model? Predictors that have no relationship with the response lead to an increase in variance without a corresponding decrease in bias.
Refit the model only with Lag1 and Lag2:
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.302 -1.190 1.079 1.160 1.350
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.191213 0.333690 0.573 0.567
ta = Smarket,
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial, da summary(glm.fit)
## Lag1
## Lag2
## Lag3
## Lag4
## Lag5
## Volume -0.116257 0.239618 -0.485 0.628
##
-0.054178 0.051785 -1.046 0.295
-0.045805 0.051797 -0.884 0.377
0.007200 0.051644 0.139 0.889
0.006441 0.051706 0.125 0.901
-0.004223 0.051138 -0.083 0.934
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.1 on 991 degrees of freedom
## AIC: 1395.1
7
##
## Number of Fisher Scoring iterations: 3
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
## [1] 0.5595238
Now the results appear to be more promising: 56% of the daily movements have been correctly predicted.
## [1] 0.5824176
## [1] 0.5
On days when logistic regression predicts that the market will decline, it is only correct 50% of the time. However, on days when it predicts an increase in the market it has a 58% accuracy rate.
Model evaluation: Receiver Operating Curve (ROC).
8
= Smarket, sub
glm.fit = glm(Direction ~ Lag1 + Lag2, family = binomial, data = Smarket, subset=train) glm.probs = predict(glm.fit,Smarket.2005,type=”response”)
glm.pred = rep(“Down”,252)
glm.pred[glm.probs>0.5]=”Up”
table(glm.pred,Direction.2005)
mean(glm.pred==Direction.2005)
106/(106+76) # tp/tp+fn, precision 35/(35+35) # tn/tn+fp
#install.packages(“ROCR”)
library(ROCR)
# Define a function to plot a ROC curve
roc.plot = function(pred,truth, …){ pd = prediction(pred,truth)
pf = performance(pd, “tpr”, “fpr”) plot(pf, …)
}
# Define a function to compute the AUC
roc.auc = function(pred,truth, …){
pd = prediction(pred, truth)
auc = performance(pd, measure = “auc”)@y.values as.numeric(auc)
}
fit1 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial, data
glm.probs1 = predict(fit1,Smarket.2005,type=”response”)
fit2 = glm(Direction ~ Lag1 + Lag2, family = binomial, data = Smarket, subset=train)
glm.probs2 = predict(fit2,Smarket.2005,type=”response”)
roc.plot(glm.probs1, Direction.2005, main = “ROC curve”) par(new=TRUE)
roc.plot(glm.probs2, Direction.2005, main = “ROC curve”, col=”red”)
legend(“bottomright”, c(“logit_6_variables”, “logit_2_variables”), lty=1, col = c(“black”,
ROC curve
“red”))
logit_6_variables logit_2_variables
0.0 0.2
## [1] 0.5197112
roc.auc(glm.probs2, Direction.2005) ## [1] 0.5584308
0.4 0.6 0.8
False positive rate
1.0
#AUC
roc.auc(glm.probs1, Direction.2005)
Suppose that we want to predict the returns associated with particular values of Lag1 and Lag2. In particular, we want to predict Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and -0.8.
predict(glm.fit,newdata=data.frame(Lag1=c(1.2,1.5),Lag2=c(1.1,-0.8)),type=”response”) ## 1 2
## 0.4791462 0.4960939
We use ML (Maximum Likelihood) estimator to fit logistic regression. Let’s write a function which will calculate the log-likelihood of the data at hand. Afterwards, we will use optimization functions nlm and optim to find parameters which maximize the log-likelihood.
# this function calculates -log-likelihood of the data (matrix X and vector y)
# inputs: only parameter vector!!!
loglikeli <- function(b) {
9
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
p = length(b)-1 # the number of regressors excluding intercept
# Check comformability
if (dim(X)[2] != p) stop("X and b are not conformable") # number of parameters (without if (dim(X)[1] != length(y)) stop("X and y are not conformable") #same number of observat # Probabilities
probs = 1/(1+exp(-(b[1]+X %*% b[2:(p+1)])))
# Log-likelihood
loglikeli = sum(y*log(probs)+(1-y)*log(1-probs))
neg.ll = -loglikeli # nlm and optim are designed for minimization return(neg.ll)
}
The function is ready! Now create X matrix with regressors and y vector with dependent dummy variable.
Initial values:
start.b = c(1,1,1)
Use optim function to maximize log-likelihood:
## The estimated coefficients are 0.07401392 -0.07160812 -0.04447275
Compare with glm output:
## (Intercept) Lag1 Lag2
## 0.07424989 -0.07151013 -0.04449594
Compare with nlm function: nlm(loglikeli, start.b)$estimate
## Warning in nlm(loglikeli, start.b): NA/Inf durch größte positive Zahl ersetzt
## Warning in nlm(loglikeli, start.b): NA/Inf durch größte positive Zahl ersetzt
## [1] 0.07424950 -0.07151051 -0.04449632
3. Linear Discriminant Analysis
Now we will perform LDA on the Smarket data. In R, we fit a LDA model using the lda() function, which is part of the MASS library. Syntax is identical to that of lm().
We fit the model using only the observations before 2005.
## Call:
X = cbind(Lag1,Lag2)
y = rep(0, times=length(Direction)) y[Direction=="Up"]=1
intercept) ==
ions in X and
opt.b = optim(start.b,loglikeli)$par
cat("The estimated coefficients are" , opt.b)
glm.fit = glm(Direction ~ Lag1 + Lag2, family = binomial, data = Smarket) coef(glm.fit)
library(MASS)
lda.fit=lda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train) lda.fit
10
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
The LDA output indicates that 49.2% of the training observations correspond to days during which the market went down.
Group means: the average of each predictor within each class, used by LDA as estimates of mu_k. There is the tendency for the previous 2 days’ returns to be negative on days when the market increases. It looks like there may be a slight difference in these two groups. Classes seem to be a little separated.
The coefficient of linear discriminants output provides the linear combination of Lag1 and Lag2 that are used to form the LDA decision rule. If -0.642Lag1-0.514 Lag2 is large, then the LDA classifier will predict a market increase.
The plot() function provides plots of the linear discriminants for each observation. plot(lda.fit)
11
−4 −2 0 2 4
group Down
−4 −2 0 2 4
group Up
These two histograms look very similar, which is not a big surprise, because it is very difficult (if possible) to predict the movement of the stock market.
No we will see how well our rule predicts on year 2005:
## [1] "class" "posterior" "x"
class(lda.pred) ## [1] "list"
lda.pred$class[10:15]
## [1] Up Up Down Up Up Up
## Levels: Down Up
data.frame(lda.pred)[10:15,] # lda.pred is a list, use data.frame to get 5 observations
lda.pred = predict(lda.fit,Smarket.2005) # prediction for test data names(lda.pred)
##
## 1008
## 1009
## 1010 Down
## 1011
## 1012
## 1013
Up Up Up
class posterior.Down posterior.Up LD1
Up Up
0.4844026
0.4906963
0.5119988
0.4895152
0.4706761
0.4744593
0.5155974 0.35068642
0.5093037 0.05897298
0.4880012 -0.92794134
0.5104848 0.11370190
0.5293239 0.98783874
0.5255407 0.81206862
class: predictions about movement of the market
12
0.0 0.5 0.0 0.5
posterior: a matrix whose k-th column contains the posterior probability that the corresponding observation belongs to the k-th class.
x: linear discriminants
## Direction.2005
## lda.class Down Up
## Down 35 35
## Up 76 106
mean(lda.class==Direction.2005)
## [1] 0.5595238
Correct classification rate: 0.56. The LDA and logistic regression predictions are almost identical.
Applying a 50% threshold to the posterior probabilities allows us to recreate the predictions contained in lda.pred$class.
sum(lda.pred$posterior[,1]>=0.5) ## [1] 70
sum(lda.pred$posterior[,1]<0.5) ## [1] 182
More about output:
lda.pred$posterior[1:20,1] #posterior probability that the market will decrease for 20 observations
## 999 1000 1001 1002 1003 1004 1005 1006
## 0.4901792 0.4792185 0.4668185 0.4740011 0.4927877 0.4938562 0.4951016 0.4872861
## 1007 1008 1009 1010 1011 1012 1013 1014
## 0.4907013 0.4844026 0.4906963 0.5119988 0.4895152 0.4706761 0.4744593 0.4799583
## 1015 1016 1017 1018
## 0.4935775 0.5030894 0.4978806 0.4886331
lda.class[1:20] #predictions for 20 observations
## [1] Up Up Up Up Up Up Up Up Up Up Up Down Up Up Up
## [16] Up Up Down Up Up
## Levels: Down Up
Change the treshold: predict market decrease if the posterior probability of market going down is at least 90%:
sum(lda.pred$posterior[,1]>0.9)
## [1] 0
No days in 2005 meet that threshold! Vizualizations:
lda.class=lda.pred$class #prediction about movement of the market table(lda.class,Direction.2005) #confusion matrix
# conditional histograms of the classes
d1<-Lag1[Direction=="Down"] u1<-Lag1[Direction=="Up"] d2<-Lag2[Direction=="Down"] u2<-Lag2[Direction=="Up"]
13
hist(d1, col="red",breaks=30) hist(u1, col="blue",breaks=30, add=T)
Histogram of d1
−4 −2 0 2 4 6
d1
hist(d2, col=rgb(1,0,0,0.5),breaks=30) hist(u2, col=rgb(0,0,1,0.5),breaks=30, add=T)
14
Frequency
0 50 100 150
Histogram of d2
−4 −2 0 2 4
d2
There is the tendency for the previous 2 days’ returns to be lower (negative) on days when the market increases.
plot(density(d1),xlim=c(-4,4),lwd=2) lines(density(u1),col="red",lwd=2)
15
Frequency
0 20 40 60 80 100
density.default(x = d1)
Just for the fun of it:
−4 −2 0 2 4
N = 602 Bandwidth = 0.2318
# 3D histogramm of Lag1 and Lag2 #install.packages("plot3D") library("plot3D")
l1.hist = hist(Lag1,50)
16
Density
0.0 0.1 0.2 0.3 0.4
Histogram of Lag1
l2.hist = hist(Lag2,50)
−4 −2 0 2 4 6
17
Lag1
Frequency
0 20 40 60 80 100
Histogram of Lag2
−4 −2 0 2 4 6
Lag2
#mids: the n cell midpoints
hist3D(x = l1.hist$mids, y = l2.hist$mids, z = l1.hist$density%*%t(l2.hist$density), bty = "g", scale = FALSE, expand = 20, border = "black",
col = gg2.col(100), shade = 0.3, space = 0.1,
theta = 20, phi = 20)
18
Frequency
0 20 40 60 80 100
z
y
x
hist(Lag1,50)$mids
19
0.25
0.20
0.15
0.10
0.05
0.00
Histogram of Lag1
−4 −2 0 2 4 6
Lag1
## [1] -4.9 -4.7 -4.5 -4.3 -4.1 -3.9 -3.7 -3.5 -3.3 -3.1 -2.9 -2.7 -2.5 -2.3 -2.1
## [16] -1.9 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
## [31] 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9
## [46] 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7
l1.hist
## $breaks
## [1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 ## [16] -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 ## [31] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 ## [46] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8
##
## $counts
## [1] 1 0 0 1 1 1 0 3 3 2 3 3 11 11 5 17 24 37 31 ##[20] 43 57 72 75 92109131107100 70 58 40 36 27 21 14 6 13 3 ##[39] 3 3 0 2 3 2 3 2 1 0 1 0 0 0 1 1
##
## $density
## [1] 0.004 0.000 0.000 0.004 0.004 0.004 0.000 0.012 0.012 0.008 0.012 0.012
## [13] 0.044 0.044 0.020 0.068 0.096 0.148 0.124 0.172 0.228 0.288 0.300 0.368
## [25] 0.436 0.524 0.428 0.400 0.280 0.232 0.160 0.144 0.108 0.084 0.056 0.024
## [37] 0.052 0.012 0.012 0.012 0.000 0.008 0.012 0.008 0.012 0.008 0.004 0.000
## [49] 0.004 0.000 0.000 0.000 0.004 0.004
##
## $mids
## [1] -4.9 -4.7 -4.5 -4.3 -4.1 -3.9 -3.7 -3.5 -3.3 -3.1 -2.9 -2.7 -2.5 -2.3 -2.1
20
Frequency
0 20 40 60 80 100
## [16] -1.9 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
## [31] 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9
## [46] 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7
##
## $xname
## [1] "Lag1"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
l1.hist$mids
## [1] -4.9 -4.7 -4.5 -4.3 -4.1 -3.9 -3.7 -3.5 -3.3 -3.1 -2.9 -2.7 -2.5 -2.3 -2.1
## [16] -1.9 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
## [31] 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9
## [46] 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7
cor(Lag1, Lag2)
## [1] -0.02629433
l1.hist$density
## [1] 0.004 0.000 0.000 0.004 0.004 0.004 0.000 0.012 0.012 0.008 0.012 0.012
## [13] 0.044 0.044 0.020 0.068 0.096 0.148 0.124 0.172 0.228 0.288 0.300 0.368
## [25] 0.436 0.524 0.428 0.400 0.280 0.232 0.160 0.144 0.108 0.084 0.056 0.024
## [37] 0.052 0.012 0.012 0.012 0.000 0.008 0.012 0.008 0.012 0.008 0.004 0.000
## [49] 0.004 0.000 0.000 0.000 0.004 0.004
dim(l1.hist$density%*%t(l2.hist$density)) ## [1] 54 54
And a ‘good-to-know’ package for exploratory data analysis.
Renderer: SVG | Canvas Download
4. Quadratic Discriminant Analysis
We will now fit a QDA model to the Smarket data.
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
# ggvis package
#install.packages("ggvis")
library(ggvis)
Smarket %>% ggvis(x=~Lag1,y=~Lag2, fill=~Direction) %>% layer_points
qda.fit=qda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train) qda.fit
21
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
## Direction.2005
## qda.class Down Up
## Down 30 20
## Up 81 121
mean(qda.class==Direction.2005) ## [1] 0.5992063
The ouput contains group means, but it does not contain the coefficients of the linear discriminants, because a QDA classifier involves a quadratic, rather than a linear, function of the predictors.
Interestingly, the QDA predictions are accurate almost 60% of the time. This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression.
5. KNN
## Direction.2005
## knn.pred Down Up
qda.class=predict(qda.fit,Smarket.2005)$class table(qda.class,Direction.2005)
library(class)
train.X=cbind(Lag1,Lag2)[train,] test.X=cbind(Lag1,Lag2)[!train,] train.Direction=Direction[train]
set.seed(1) knn.pred=knn(train.X,test.X,train.Direction,k=1) table(knn.pred,Direction.2005)
## ##
Down 43 58
Up
68 83
knn.pred=knn(train.X,test.X,train.Direction,k=3) table(knn.pred,Direction.2005)
##
## knn.pred Down Up
## Down 48 54
## Up 63 87
Direction.2005
22