程序代写代做代考 python data mining algorithm Improved Stock-Price Predictions via Pre-Processing

Improved Stock-Price Predictions via Pre-Processing

INFORMS Data Mining Contest 2010 (2nd Place)

Improved Stock Price Predictions
via Pre-Processing

Christopher Hefele

www.linkedin.com/in/christopherhefele

Nov. 9, 2010 1
Annual Meeting 2010

Austin, Texas

http://www.linkedin.com/in/christopherhefele

Contest Description

• Goal: Predict if an unnamed stock
will go up or down in one hour

• Dataset Description
– 609 variables provided

• Other stock prices, sectoral data,
economic data, experts’ predictions,
indices

– Data given for each 5-minute period
– 5922 periods in training set
– 2539 periods in test set

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
2

Solution Overview

• Create returns variables from prices
• Time-of-Day normalization of returns
• Percentile transform of returns
• Forward stepwise variable selection
• Classifier

– Logistic regression with L2 regularization
– SVM w. RBF kernel (used only briefly)

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
3

Create Returns Variables from Prices

• Target Variable is 1 hour price change in an unknown stock…but…
– Are those changes in OPEN or CLOSE prices after 1 hr? Something else?

• Created new returns variables from each stock’s prices, for later variable selection
– Return(t,L) = log Price1(t+Lag) – log Price2(t+60+Lag) , where:

• Price1 & Price2 are OPEN, HIGH, LOW or LAST prices for a given stock
• Lag = one of: -5, 0 or 5 minutes

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
4

t t+5t-5 t+60 t+65t+55. . . .

. . . .

Time (minutes)

St
oc

k
Pr

ic
e ?

?
?

Variable Selection

• Used forward stepwise logistic regression
• Included top 3 selected variables in the final model

(to minimize overfitting)
• L2 regularization also used with an automatic

parameter tuner & K-fold cross-validation

• Why not use L1 regularization to select variables?
• Forward stepwise variable selection + L2

regularization seemed to outperform L1
regularization on this data

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
5

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
6

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00

St
oc

k
Pr

ic
e

Vo
la

ti
lit

y
(S

ta
nd

ar
d

D
ev

ia
ti

on
o

f 5
-M

in
ut

e
Pe

ri
od

P
ri

ce
C

ha
ng

es
)

Hour of Day

Stock Price Volatility vs. Hour of Day

Unusually large price
swings at market open

(>3x avg. std. dev)

Smallest price
swings at market
close / overnight

(~50% avg. std. dev)

Time-of-Day Normalization

• Problem: Volatility variations degrade classifiers’ accuracy
– Total error (& fit) may be dominated by largest swings (aka ‘outliers’)
– Smallest swings may be partially ‘ignored’ if use L1 or L2

regularization (or any other penalty for larger regression weights)

• Solution: Normalize each 5-min time period separately
– Bin each variables values by 5-min time period
– Divide each bin’s values by that bin’s standard deviation

Nov. 9, 2010 7
Annual Meeting 2010

Austin, Texas

*Volatility = standard deviation of the set of price changes or returns

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
8

-10

-8

-6

-4

-2

0

2

4

6

8

10

-4 -3 -2 -1 0 1 2 3 4

Pr
ic

e
Ch

an
ge

D
is

tr
ib

ut
io

n
Q

ua
nt

ile
s

(S
td

. D
ev

ia
ti

on
s)

Normal Distribution Quantiles
(Std. Deviations)

Price Change Distribution vs. Normal Distribution
Price Change Distribution Normal Distribution

Center (~98%) of
distribution closely
follows a Normal

distribution

Tails (~2%) of
distribution show

larger than expected
price changes

E.g. a 9 standard
deviation price move!
(vs. 3.5 for a normal

distribution)

Price Jumps
• Price-change distribution mostly normal,

but…there are infrequent, large price jumps

• “Long-tail” / leptokurtic distributions of returns
often reported in the financial literature

• Power-law distribution of returns often seen in tails
• Typical causes of unusually large price swings include

earnings announcements, press releases, crashes, etc.

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
9

Percentile Transform
• Large price jumps can degrade

classifier accuracy
– Total error (& fit) driven by large swings

• Percentile Transform clamps jumps
– Values in a distribution replaced by their

percentile in that same distribution
• E.g. (-10, 4, 3, 11,80) Æ (0, .5, .25, .75, 1)

– Clamps large price swings to [0,1]
– Provided just a small increase in classifier

accuracy when combined with var.
selection + logistic regression

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
10

0 1

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
11

0.9640

0.9650

0.9660

0.9670

0.9680

0.9690

0.9700

0.9710

0.9720

0.9730

0.9740

1 2 3 4 5 6

Pr
ed

ic
ti

on
A

cc
ur

ac
y

(k
-F

ol
d

RO
C

A
U

C)

Number of Variables Selected (stepwise)

Prediction Accuracy vs. Preprocessors Used with Variable Selection

No Preprocessor Percentile Preprocessor Percentile.NormPeriod Preprocessor

Logistic regression was the classifier used in this example.

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
12

0.93 0.94 0.95 0.96 0.97 0.98

Use SVM instead of
Logistic Regression

Add 2 more variables via
variable selection algorithm

Percentile transform

Time-period normalization

Add time-lags of variable(s)

Logistic Regression on most
correlated variable only

Logistic Regression
on all Variables

Accuracy (K-Fold ROC AUC)

Summary of Improvements

Implementation Details

• Coded in Python utilizing:
– Scikits.Learn (machine learning library)
– SciPy & NumPy (C-extension math libraries)

• OS:
– Ubuntu Linux (Release 10.04, 64-bit)

• Hardware:
– Intel Core2 Quad (2.33 GHz)
– 8GB memory (only ~400M used in the competition)

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
13

Nov. 9, 2010
Annual Meeting 2010

Austin, Texas
14

Questions?