Microsoft Word – DaiZhang-MachineLearningInStockPriceTrendForecasting.docx
Machine Learning in Stock Price Trend Forecasting
Yuqing Dai, Yuning Zhang
yuqingd@stanford.edu, zyn@stanford.edu
I. INTRODUCTION
Predicting the stock price trend by interpreting the seemly chaotic market data has
always been an attractive topic to both investors and researchers. Among those popular
methods that have been employed, Machine Learning techniques are very popular due to
the capacity of identifying stock trend from massive amounts of data that capture the
underlying stock price dynamics. In this project, we applied supervised learning methods
to stock price trend forecasting.
According to market efficiency theory, US stock market is semi-strong efficient
market, which means all public information is calculated into a stock’s current share price,
meaning that neither fundamental nor technical analysis can be used to achieve superior
gains in a short-term (a day or a week). Indeed, our initial next-day predication has very
low accuracy around 50%. However, as we tried to predict long-term stock price trend,
our models achieved a high accuracy (79%). Based on our prediction result, we built a
trading strategy on the stock, which significantly outran the stock performance itself.
II. IMPLEMETATION
A. Data Collection
The training data used in our project were collected from Bloomberg Database. In
this project, we picked 3M Stock to apply our method. The data contains daily stock
information ranging from 1/9/2008 to 11/8/2013 (1471 data points). There are 16
features that we can use to apply our learning theory. In addition, we used the daily
labeling as follows: label “1” if the closing price is higher than that of the previous day.
Otherwise label “-1”. For example, if the closing price of stock A on 11/11/2013 is
higher than that on 11/10/2013, and on 11/10/2013, the PE ratio, PX volume, PX
ebitda,…,S&P 500 index are X1, X2,…,X15, so the training data of A on 11/10/2013 is (X,
Y), where X = (X1, X2,…,X15)
T, Y = (+1)
Stock 3M Co (NYSE: MMM)
Features PE ratio, PX volume, PX ebitda, current enterprise value, 2-day
net price change, 10-day volatility, 50-day moving average, 10-day
moving average, quick ratio, alpha overridable, alpha for beta pm,
beta raw overridable, risk premium, IS EPS, and corresponding
S&P 500 index
Data
Source
Bloomberg Data Terminal
B. Model Selection
1. Next-Day Model
In our project, we mainly applied supervised learning theories, i.e. Logistic
Regression, Gaussian Discriminant Analysis, Quadratic Discriminant Analysis, and SVM.
The most important result that we should watch closely is the accuracy of prediction,
which we define as follows:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
=
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑦𝑠 𝑡ℎ𝑎𝑡 𝑚𝑜𝑑𝑒𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑡ℎ𝑒 𝑡𝑒𝑠𝑡𝑖𝑛𝑔 𝑑𝑎𝑡𝑎
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡𝑖𝑛𝑔 𝑑𝑎𝑦𝑠
We used 70% of the data sets as training data and tested our fitted models with the
remaining 30% data sets.
Model Logistic
Regression
GDA QDA SVM
Accuracy 44.5% 46.4% 58.2% 55.2%
It turned out that the next-day prediction has a very low accuracy with the highest
accuracy (QDA) being only 58.2%. We know that by flipping a coin we can probably get
an accuracy of roughly 50% since the investing decision is binomial. Such result can be
explained by the semi-strong efficient market theory, which states that all public
information is calculated into a stock’s current share price, meaning that neither
fundamental nor technical analysis can be used to achieve superior gains.
2. Long-Term Model
Although our next-day prediction isn’t very positive, we believe the financial data of
a particular stock can still provide some insights for the stock’s future movement. After
all, that’ why so many financial institutions/individual investors believe their work is
meaningful.
Especially, we think sometimes because of the existence of market sentiment, some
information will not be reflected in the stock price immediately. Besides, in the eyes of
investors, we also care about the predictions results of longer term to design our
long-term investment strategy.
Here, we define our problem as predicting the sign of difference between
tomorrow’s stock price and that of certain days ago. Again, we used 70% of the data set
as training data and tested our fitted models with the remaining 30% data sets.
From the chart, we can see that for SVM and QDA model, the accuracy increases
when the time window increases. Furthermore, SVM gives the highest accuracy when the
time window is 44 days (79.3%). It’s also the most stable model.
C. Feature Selection
From the chart established by backward stepwise feature selection, we can see that
when we used all of the 16 features, we get our highest prediction accuracy. It makes
sense because with over 1400 data points but only 16 features, there’s no need to reduce
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
10
20
30
40
50
60
A
cu
ur
ac
y
Time
Window
(days)
Long-‐Term
Pedic8on
Accuracy
Accu_GDA
Accu_Logis8cs
Reg.
Accu_SVM
Accu_QDA
50.00%
55.00%
60.00%
65.00%
70.00%
75.00%
80.00%
0
2
4
6
8
10
12
14
16
A
cc
ur
ac
y
Number
of
Features
Feature
Selec8on
the number of features.
III. TRADING STRATEGY
A. Predictor Characteristics
From the previous analysis, we have already determined that the best predicting
model for 3M stock is SVM model. Here we use SVM as our predictor in order to
develop our trading strategy.
Predictor SVM
Kernel Polynomial
Number of Features 16
Time Window 44 days
B. Strategy Implementation
Initially, we used 990 of our 1470 data points to fit our model. Then we used our
model to predict the stock price and made according investment decision on an on-time
basis, meaning we will take in new information and update our predictor every trading
date. Our back-testing of the strategy is over the course of December 2011 to October
2013.
On each day of the beginning 44 days, we will make a decision whether to buy the
stock or not based on our prediction of whether the stock price would go up after 44
days. After the first 44 days, on each day we will make an investment decision again. It’s
better illustrated in the following decision tree:
Equivalently, we can interpret the strategy as if there are 44 traders. Trader i is
responsible for trading his portfolio on i, 44+i,…, 44n + i,…day. Traders are
independent to each other.
Predic8on
Increase
Buy
if
not
holding
it
Hold
if
already
holding
it
Decrease
Do
nothing
if
not
holding
it
Sell
if
already
holding
it
From the plot above, it’s obvious that our strategy has outrun the performance of
the stock, with an annualized return 19.3% vs. 12.5%.
IV. CONCLUSION
In this project, we applied supervised learning techniques in predicting the stock
price trend of a single stock. Our finds can be summarized into three aspects:
1. Various supervised learning models have been used for the prediction and we
found that SVM model can provide the highest predicting accuracy (79%), as
we predict the stock price trend in a long-term basis (44 days).
2. Our feature selection analysis indicates that when use all of the 16 features, we
will get the highest accuracy. That’s because the number of data points is much
bigger than that of the features.
3. The trading strategy based on our prediction achieves very positive results by
significantly outrunning the stock performance.
As for our future work, we believe we can make the following improvements:
1. Test our predictor on different stocks to see its robustness. Try to develop a
“more general” predictor for the stock market.
2. Construct a portfolio of multiple stocks in order to diversify the risk. Take
transaction cost into account when evaluating strategy’s effectiveness
60
70
80
90
100
110
120
130
140
7/18/11
2/3/12
8/21/12
3/9/13
Pr
ic
e
Le
ve
l
Dates
P&L
Comparison
3M
Stock
Performance
Our
Strategy