Improved Stock-Price Predictions via Pre-Processing
INFORMS Data Mining Contest 2010 (2nd Place)
Improved Stock Price Predictions
via Pre-Processing
Christopher Hefele
www.linkedin.com/in/christopherhefele
Nov. 9, 2010 1
Annual Meeting 2010
Austin, Texas
http://www.linkedin.com/in/christopherhefele
Contest Description
• Goal: Predict if an unnamed stock
will go up or down in one hour
• Dataset Description
– 609 variables provided
• Other stock prices, sectoral data,
economic data, experts’ predictions,
indices
– Data given for each 5-minute period
– 5922 periods in training set
– 2539 periods in test set
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
2
Solution Overview
• Create returns variables from prices
• Time-of-Day normalization of returns
• Percentile transform of returns
• Forward stepwise variable selection
• Classifier
– Logistic regression with L2 regularization
– SVM w. RBF kernel (used only briefly)
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
3
Create Returns Variables from Prices
• Target Variable is 1 hour price change in an unknown stock…but…
– Are those changes in OPEN or CLOSE prices after 1 hr? Something else?
• Created new returns variables from each stock’s prices, for later variable selection
– Return(t,L) = log Price1(t+Lag) – log Price2(t+60+Lag) , where:
• Price1 & Price2 are OPEN, HIGH, LOW or LAST prices for a given stock
• Lag = one of: -5, 0 or 5 minutes
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
4
t t+5t-5 t+60 t+65t+55. . . .
. . . .
Time (minutes)
St
oc
k
Pr
ic
e ?
?
?
Variable Selection
• Used forward stepwise logistic regression
• Included top 3 selected variables in the final model
(to minimize overfitting)
• L2 regularization also used with an automatic
parameter tuner & K-fold cross-validation
• Why not use L1 regularization to select variables?
• Forward stepwise variable selection + L2
regularization seemed to outperform L1
regularization on this data
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
5
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
6
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00
St
oc
k
Pr
ic
e
Vo
la
ti
lit
y
(S
ta
nd
ar
d
D
ev
ia
ti
on
o
f 5
-M
in
ut
e
Pe
ri
od
P
ri
ce
C
ha
ng
es
)
Hour of Day
Stock Price Volatility vs. Hour of Day
Unusually large price
swings at market open
(>3x avg. std. dev)
Smallest price
swings at market
close / overnight
(~50% avg. std. dev)
Time-of-Day Normalization
• Problem: Volatility variations degrade classifiers’ accuracy
– Total error (& fit) may be dominated by largest swings (aka ‘outliers’)
– Smallest swings may be partially ‘ignored’ if use L1 or L2
regularization (or any other penalty for larger regression weights)
• Solution: Normalize each 5-min time period separately
– Bin each variables values by 5-min time period
– Divide each bin’s values by that bin’s standard deviation
Nov. 9, 2010 7
Annual Meeting 2010
Austin, Texas
*Volatility = standard deviation of the set of price changes or returns
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
8
-10
-8
-6
-4
-2
0
2
4
6
8
10
-4 -3 -2 -1 0 1 2 3 4
Pr
ic
e
Ch
an
ge
D
is
tr
ib
ut
io
n
Q
ua
nt
ile
s
(S
td
. D
ev
ia
ti
on
s)
Normal Distribution Quantiles
(Std. Deviations)
Price Change Distribution vs. Normal Distribution
Price Change Distribution Normal Distribution
Center (~98%) of
distribution closely
follows a Normal
distribution
Tails (~2%) of
distribution show
larger than expected
price changes
E.g. a 9 standard
deviation price move!
(vs. 3.5 for a normal
distribution)
Price Jumps
• Price-change distribution mostly normal,
but…there are infrequent, large price jumps
• “Long-tail” / leptokurtic distributions of returns
often reported in the financial literature
• Power-law distribution of returns often seen in tails
• Typical causes of unusually large price swings include
earnings announcements, press releases, crashes, etc.
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
9
Percentile Transform
• Large price jumps can degrade
classifier accuracy
– Total error (& fit) driven by large swings
• Percentile Transform clamps jumps
– Values in a distribution replaced by their
percentile in that same distribution
• E.g. (-10, 4, 3, 11,80) Æ (0, .5, .25, .75, 1)
– Clamps large price swings to [0,1]
– Provided just a small increase in classifier
accuracy when combined with var.
selection + logistic regression
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
10
0 1
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
11
0.9640
0.9650
0.9660
0.9670
0.9680
0.9690
0.9700
0.9710
0.9720
0.9730
0.9740
1 2 3 4 5 6
Pr
ed
ic
ti
on
A
cc
ur
ac
y
(k
-F
ol
d
RO
C
A
U
C)
Number of Variables Selected (stepwise)
Prediction Accuracy vs. Preprocessors Used with Variable Selection
No Preprocessor Percentile Preprocessor Percentile.NormPeriod Preprocessor
Logistic regression was the classifier used in this example.
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
12
0.93 0.94 0.95 0.96 0.97 0.98
Use SVM instead of
Logistic Regression
Add 2 more variables via
variable selection algorithm
Percentile transform
Time-period normalization
Add time-lags of variable(s)
Logistic Regression on most
correlated variable only
Logistic Regression
on all Variables
Accuracy (K-Fold ROC AUC)
Summary of Improvements
Implementation Details
• Coded in Python utilizing:
– Scikits.Learn (machine learning library)
– SciPy & NumPy (C-extension math libraries)
• OS:
– Ubuntu Linux (Release 10.04, 64-bit)
• Hardware:
– Intel Core2 Quad (2.33 GHz)
– 8GB memory (only ~400M used in the competition)
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
13
Nov. 9, 2010
Annual Meeting 2010
Austin, Texas
14
Questions?