Machine learning for Stock price prediction
Stock price prediction
Data mining method
COURSE: EM 623
ASSIGNMENT: FINAL PROJECT
INTRODUCTION AND BUSINESS UNDERSTANDING
Business requirements and objectives
Make accurate prediction of future stock price
Make investment decisions according to the prediction
INTRODUCTION AND BUSINESS UNDERSTANDING
Data mining problem definition
Retrieve historical stock price data
Build models to learn the price patterns from the data
Make prediction on future stock price using trained models
DATA UNDERSTANDING
Yahoo Finance provides API to retrieve stock price data
We can retrieve daily price data for given stock within specified time interval
The data is accurate and data available for all stocks and time intervals
Open, high, low, close, volume and adjusted price are provided for each trading day
DATA UNDERSTANDING
Data format
Data preparation
Add target variable which indicate whether the stock price go up or down n days later.
Extract features from original data
Data preparation
Feature extraction
MOM (Momentum)
measures the the amount current stock price changes from previous price of days ago.
SMA (Simple Moving Average)
average stock price of n consecutive trading days
EMA (Exponential Moving Average)
ROCR (Rate of change ratio)
the percentage change between current price and previous price
Data preparation
Feature extraction
LINEARREG (Linear Regression)
It computes the predicted stock price by using linear regression on previous n days stock prices.
BETA
The beta coefficient indicates whether the stock price is more volatile than the market
BBANDS (Bollinger band)
The middle band is the SMA (simple moving average)
the upper band is middle band + 2 ∗ (SMA standard deviation) and the lower band is
middle band − 2 ∗ (SMA standard deviation)
Data preparation
Feature extraction
Modeling
Supervised Learning using historical stock price data
Classifying problem predicting whether stock price will go up or down
Predictive Models
Decision Tree
Random Forest
Logistic Regression
Support Vector Machine
Extra Trees
Modeling
Decision Tree
Model Parameters
Min samples split: 2
Min samples leaf: 1
Max depth: 20
Modeling
Random Forest
Model Parameters
Number of trees : 10
Criterion: Gini
Min samples split: 2
Min samples leaf: 1
Max depth: 20
Modeling
Extra Trees
Model Parameters
Number of trees : 10
Criterion: Gini
Min samples split: 2
Min samples leaf: 1
Max depth: 20
Bootstrap: true
Modeling
Logistic Regression
Model Parameters
penalty : l2
Regularization : 1.0
Max iteration: 100
Modeling
SVM
Model Parameters
kernel : RBF
Penalty Parameter : 1.0
EVALUATION
4-fold cross-validation
area under ROC curve (AUC)
mean value of AUC among different folds
Receiver Operating Characteristic (ROC) Curve Plot
EVALUATION
Decision Tree
EVALUATION
Logistic Regression
EVALUATION
SVM
EVALUATION
Random Forest
EVALUATION
Extra Trees
CONCLUSION
Extra trees are the best model which has mean AUC 0.875
Random forest are the second best model which has mean AUC 0.848
Decision tree has a decent performance
Logistic regression and SVM are not performing well
By using multiple decision trees, ensemble methods such as random forest and extra trees
has a large improvement over decision tree
LOOKING AHEAD
Try more kinds of features and combination
Using feature selection techniques to filter out good features
More complex model such as artificial neural network
/docProps/thumbnail.jpeg