Lecture 11
qCourse Review
1
Open-book Exam
Five Questions
All the methods will not go beyond what you learned in this class (Week 1 ~ Week 10).
Lecture notes & tutorial notes & assignments & quiz!
2/52
Course Map
6
Course Map (Week 2&3)
7
Course Map (Week 4)
8
Course Map (Week 5&6)
9
Course Map (Week 8)
Association Rule
Rule Measurement
Rule Mining
Support Confidence Lift
Frequent rule set mining
Frequent rule mining
10
Course Map (Week 9)
11
Course Map (Week 10)
12
Descriptive Analytics
13/52
Descriptive Analytics
Recall text analytics
14/52
Descriptive Analytics
Measures of Central Tendency
• Mean Median Mode
• Minimum, Maximum, Percentiles, and Quartiles
Measures of Dispersion: Range • Range
• interquartile range (IQR)
• Variance / Standard Deviation /Coefficient of
Variation
15/52
Descriptive Analytics
Measures of Distribution Shape
• Skewness (Coefficient of Skewness, CS) • Kurtosis
16/52
Descriptive Analytics
vRelationship Between Multiple Variables Covariance (Population/Sample)
Correlation
17/52
Descriptive Analytics
18/52
Feature Selection Dimension Reduction
vChoosing Variables to Remove Remove non-informative feature vectors
directly
vPrinciple Component Analysis (PCA)
find a linear combination of the quantitative variables that contains most, even if not all, of the information.
19/52
PCA
X(1) X(2) X(3) X(4) X(5) X(6)
20/52
PCA
Project to a line
Projection error
𝜃
•For 2D-1D, we must find a vector u(1) , onto which you can project the data so as to minimize the projection error.
(u(1) -u(1) are the same)
21/52
PCA
vTo reduce from n dimension to k dimension, Find k vectors (u(1), u(2), … u(k)) onto which to
project the data to minimize the projection error!
22/52
PCA
v Need to compute two things; Compute the u vectors The new planes
v Need to compute the z vectors
z vectors are the new, lower dimensionality feature vectors
23/52
PCA
PCA is to find a transformation of the data that satisfies the
following properties:
v Each pair of new attributes has 0 covariance (for distinct attributes).
v The attributes are ordered with respect to how much of the variance of the data each attribute captures.
v The first attribute captures as much of the variance of the data as possible.
v Each successive attribute captures as much of the remaining variance as possible.
24/52
PCA-Key concepts
v Covariance Matrix S:
Given an m by n data matrix D, whose m rows are data objects and whose n columns are attributes, If the data matrix D is preprocessed so that the mean of each attribute is 0, then S = DTD.
vEigenvalues of S:
Let λ1, . . . , λn be the eigenvalues of S. The eigenvalues are all non-
negative and can be ordered such that λ1 ≥ λ2 ≥ . . . λm−1 ≥ λm.
vEigenvectors of S:
Let U = [u(1), . . . , u(n)] be the matrix of eigenvectors of S. These eigenvectors are ordered so that the ith eigenvector corresponds to the ith
largest eigenvalue.
25/52
PCA-Data Transformation
v The data matrix D′ = DU is the set of transformed data that satisfies the conditions posed above.
v U matrix is also an [n x n] matrix, turns out the columns of U are the u vectors we want!
So to reduce a system from n-dimensions to k-dimensions, just take the first k-vectors from U (first k columns)
Ureduce
D′ = DU
• D is [m x n], U is [n x n]
• D’ is [m x n].
D′ = DUreduce
• Dis[mxn],Ureduce is[nxk]
• D’ is [m x k].
z = (Ureduce)T * x
• x is [n x 1] (original space)
• Z is [k x 1] (principle space)
z1=u(1)T *x
26/52
Use software package to Run PCA
1. Origin dataset [77 x 13]
2. Correlation matrix (it is self-standardized)
27/52
Use software package to Run PCA
3. Get principle components
• Dimension is reduced from 13 to 4
• Only four of them are having Eigenvalue greater than 1.
4. Get PCs (new features z) from origin features
28
Time Series
v Time Series Analysis
v Simple Models (Week 5) Moving Average Exponential Smoothing Random Walk Model
v Component Decomposition (Week 5&6) Trend Analysis
Seasonal Effect
v Regression-based Models (Week 5&6) Auto-regression
Other Casual Effects Additional topics
29
Course Map
30
Simple Moving Average
v Average random fluctuations in a time series to infer short-term changes in direction
v Assumption: future observations will be similar to recent past
v Forecast = average of most recent m observations F =Y+…+Y
t+1 t t-m+1 m
v So, how do you decide m? v What if m=1?
31/52
Weighted Moving Average
v A natural extension to moving average.
Weight the most recent k observations, with weights
that add to 1.0
Higher weights on more recent observations generally provide more responsive forecasts to rapidly changing time series
Rational: recent observation are more similar as what will happen next
F =aY+…+aY
t+1 1 t m t-m+1
a1 +…+am =1
Of course, it is not easy to decide the parameters,
one solution is using linear programming optimization. 32/52
Exponential Smoothing
v So: an easier approach (with one parameter)
Ft+1 = aYt+a (1 – a )Yt-1+a (1 – a )2Yt-2+a (1 – a )3Yt-3+… v Exponential smoothing model:
Ft+1 =Lt=(1–a)Lt-1 +aYt
= Lt-1 +a(Yt –Lt-1)
(higher weights on more recent observations)
Estimate for period t = last prediction + adjustment on most recent forecast error v Ft+1 is the forecast for time period t+1,
v Lt is the level of series at time t, (L0=Y0); an estimate of where the series would be at time t if there were no random noise;
v Yt is the observed value in period t, and
a is a constant between 0 and 1, called the smoothing constant. (How to get it?)
33/52
Random Walk Model
𝑌! = 𝑌!”# + 𝑚 + 𝑒!
Here:
v 𝑌! – Observed value at the time period of t
v 𝑌!”#- Observed value at the time period of t-1
v m – the average differences (this is not 0 – check
autocorrelation)
v𝑒! – a random time series with a mean of 0 and a standard deviation of 1.
34/52
Autoregression
vFirst-order autoregressive model AR(1) Yt = a0 + a1Yt-1 +et
vSecond-order autoregressive model AR(2) Yt = a0 + a1Yt-1 + a2Yt-2 +et
v AR(p)
Which parameter p should we choose? Recall
model assessment.
35/52
Sample Question
We are going to use a time series variable and do some analysis:
Using Origin value Use difference(1) of origin value
Augmented Dickey-Fuller Unit Root Tests
Type
Lags
Rho
Pr < Rho
Tau
Pr < Tau
F
Pr > F
Zero Mean
0
-59.1439
<.0001
-5.97
<.0001
1
-45.1897
<.0001
-4.66
<.0001
2
-25.1469
0.0002
-3.27
0.0012
Single Mean
0
-74.3553
0.0013
-6.86
<.0001
23.53
0.0010
1
-64.7748
0.0013
-5.49
<.0001
15.08
0.0010
2
-38.7515
0.0013
-3.85
0.0031
7.45
0.0010
Trend
0
-74.3509
0.0005
-6.84
<.0001
23.41
0.0010
1
-64.5966
0.0005
-5.47
<.0001
15.03
0.0010
2
-38.2834
0.0008
-3.81
0.0184
7.51
0.0197
Augmented Dickey-Fuller Unit Root Tests
Type
Lags
Rho
Pr < Rho
Tau
Pr < Tau
F
Pr > F
Zero Mean
0
0.9750
0.9071
5.66
0.9999
1
0.9132
0.8965
2.47
0.9969
2
0.8823
0.8908
2.10
0.9916
Single Mean
0
-0.1024
0.9513
-0.26
0.9272
21.27
0.0010
1
-0.3804
0.9346
-0.51
0.8853
4.88
0.0425
2
-0.4918
0.9269
-0.61
0.8643
3.96
0.0911
Trend
0
-1.4094
0.9819
-0.79
0.9635
0.32
0.9900
1
-4.9336
0.8221
-1.45
0.8407
1.08
0.9570
2
-5.4808
0.7807
-1.47
0.8348
1.13
0.9503
! If you are going to run ARIMA model, which time series value is more appropriate for analysis: original value vs. difference (1) of original value?
36
Sample Question
We are going to use a time series variable and do some analysis:
Using Origin value Use difference(1) of origin value
Augmented Dickey-Fuller Unit Root Tests
Type
Lags
Rho
Pr < Rho
Tau
Pr < Tau
F
Pr > F
Zero Mean
0
-59.1439
<.0001
-5.97
<.0001
1
-45.1897
<.0001
-4.66
<.0001
2
-25.1469
0.0002
-3.27
0.0012
Single Mean
0
-74.3553
0.0013
-6.86
<.0001
23.53
0.0010
1
-64.7748
0.0013
-5.49
<.0001
15.08
0.0010
2
-38.7515
0.0013
-3.85
0.0031
7.45
0.0010
Trend
0
-74.3509
0.0005
-6.84
<.0001
23.41
0.0010
1
-64.5966
0.0005
-5.47
<.0001
15.03
0.0010
2
-38.2834
0.0008
-3.81
0.0184
7.51
0.0197
Augmented Dickey-Fuller Unit Root Tests
Type
Lags
Rho
Pr < Rho
Tau
Pr < Tau
F
Pr > F
Zero Mean
0
0.9750
0.9071
5.66
0.9999
1
0.9132
0.8965
2.47
0.9969
2
0.8823
0.8908
2.10
0.9916
Single Mean
0
-0.1024
0.9513
-0.26
0.9272
21.27
0.0010
1
-0.3804
0.9346
-0.51
0.8853
4.88
0.0425
2
-0.4918
0.9269
-0.61
0.8643
3.96
0.0911
Trend
0
-1.4094
0.9819
-0.79
0.9635
0.32
0.9900
1
-4.9336
0.8221
-1.45
0.8407
1.08
0.9570
2
-5.4808
0.7807
-1.47
0.8348
1.13
0.9503
!
If you are going to run ARIMA model, which time series value is more appropriate for analysis: original value vs. difference (1) of original value?
Based on the Dickey-Fuller tests, the original value is non-significant (with p-values for lag 1 and lag 2 as 0.957 and 0.950 respectively). In contast, the difference of ppi (yt-yt-1) has significant Dickey-Fuller test, with p-value for lag1 as 0.001 and for lag2 as 0.02 respectively. Thus, the model should use the difference(1) for ppi.
37
Sample Question
run the model with AR(1), write the equation for the AR(1) model.
Maximum Likelihood Estimation
Parameter
Estimate
Standard Error
t Value
Approx Pr > |t|
Lag
MU
0.45576
0.13136
3.47
0.0005
0
AR1,1
0.55220
0.06466
8.54
<.0001
1
Constant Estimate
0.204089
Variance Estimate
0.591113
Std Error Estimate
0.768839
AIC
390.7895
SBC
397.0374
Number of Residuals
168
Forecasts for variable ppi
Obs
Forecast
Std Error
95% Confidence Limits
170
103.4440
0.7688
101.9371
104.9508
38/52
Sample Question
run the model with AR(1), write the equation for the AR(1) model.
Maximum Likelihood Estimation
Parameter
Estimate
Standard Error
t Value
Approx Pr > |t|
Lag
MU
0.45576
0.13136
3.47
0.0005
0
AR1,1
0.55220
0.06466
8.54
<.0001
1
Constant Estimate
0.204089
Variance Estimate
0.591113
Std Error Estimate
0.768839
AIC
390.7895
SBC
397.0374
Number of Residuals
168
Forecasts for variable ppi
Obs
Forecast
Std Error
95% Confidence Limits
170
103.4440
0.7688
101.9371
104.9508
difference(Yt)=0.204+0.552*difference(Yt-1)+et . Because Diff(Yt)=Yt-Yt-1
thus the formula is:
𝑌! = 0.204 + 1.552𝑌!"# − 0.552𝑌!"$ + 𝑒!
39/52
Association Rule
vWhat are their support, confidence & lift? vHow to interpret them?
40/52
Association Rule Mining
null
A B C D E
Found to be Infrequent
AB
ABC
AC AD
ABD ABE
AE
ACD
ABCE
BC
ACE
ABDE
BD
ADE
BE
BCD
ACDE
CD
BCE
BCDE
CE DE
BDE CDE
ABCD
Pruned supersets
ABCDE
41
Sample Question
If we set minsup=0.2, minconf=0.45, which rules are select? Interpret the rule with largest support/confidence
42/52
Text Mining – Document Representation
Base
43
Term Frequency (TF)
Term Frequency (TF): the frequency of a word appears in a document.
v Let 𝐶𝑜𝑢𝑛𝑡(𝑖, 𝑗) be the count of the term i in the document j.
v Let 𝑇𝐹(𝑖, 𝑗) be the proportion of the count of term i in the document j. 𝑐𝑜𝑢𝑛𝑡(𝑖, 𝑗)
𝑇𝐹 𝑖, 𝑗 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑗
Note: the total number of words is used to normalize the direct counting of Count(i,j). If one document has much more words then the other, we need to normalize the Count(i,j) based on document size.
The size of a document, can also be replaced by the total frequency of terms (keywords) in this document.
44/52
Inverse Document Frequency (IDF)
Inverse document frequency: the number of documents which contain a specific word.
Let N be the count of distinct documents in the corpus. Let Count (i) be the count of documents in the corpus in
which the term i is present.
The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. 𝑁
𝐼𝐷𝐹 𝑖 = log(𝑐𝑜𝑢𝑛𝑡(𝑖))
45/52
Document Mining
vAssociation rules v Clustering
v Classification vTopic models
46/52
NLP Example
47/52
Model Assessment Overview
48/52
Evaluation Metrics
Prediction
• MAE or MAD (mean absolute error/deviation)
• Average error
• MAPE (mean absolute percentage error)
• RMSE (root-mean-squar
ed error)
• Total SSE (total sum of squared error)
49/52
Classification
• Classification matrix
(also known as
confusion matrix)
• Precision
• Recall
• F-measure
• Misclassification errors
Model Evaluation (CV with random subsampling)
Validation error e1 Validation error e2
Validation error eK
50/52
Dataset
1st evaluation
2nd evaluation
......
last evaluation
T
0
X
1.0
Y
3.3
Model Evaluation (Rolling Forecasting)
T
0
X
Y
T
0
XY
T
XY
1.0
3.3
1
1.0
1
2
3
4
5
9
10
11
12
13
14
15
16
17
18
19
3.3
4.9
5.6
6.3
10.5
11.5
12.8
13.0
14.7
15.7
16.2
17.5
18.6
19.4
20.2
10.3
14.8
16.8
19.8
31.8
35.1
38.7
39.9
44.2
47.8
49.0
52.9
56.7
58.5
61.2
1
2.9
9.8
2
3
5
9
10
11
12
13
14
15
16
17
18
19
2.9
3.3
4.9
5.6
10.5
11.5
12.8
13.0
14.7
15.7
9.8
10.3
14.8
16.8
31.8
38.7
39.9
44.2
47.8
2
2.9
3.3
0
1.0
3.3
2.9
3
5
9
10
11
12
13
3.3
10.5 31.8
9.8
10.3
1
2
9.8
3.3
10.3
4
4.9
14.8
3
4.9
14.8
4
5.6
16.8
4
5.6
16.8
6.3
19.8
6.3
19.8
5
6
7.2
21.8
6.3
19.8
6
7.2
21.8
6
7.2
21.8
7
8.3
25.2
7.2
7
8.3
25.2
7
8.3
25.2
6
21.8
8
9.7
29.6
8
9.7
29.6
8
9.7 29.6
7
8.3
25.2
8
9.7
29.6
35.1
11.5 35.1
13.0 39.9
14.7 44.2
15.7 47.8
9
11
12
13
10.5
31.8
12.8 38.7
10
11.5
35.1
12.8
38.7
13.0
39.9
14.7
44.2
16.2
18.6
19.4
20.2
49.0
56.7
58.5
61.2
14
17
18
19
16.2 49.0
18.6 56.7
19.4 58.5
20.2 61.2
14
16
17
18
19
15.7
18.6 56.7 19.4 51/52
20.2
47.8
17.5
52.9
15
16
17.5 52.9
15
16.2
17.5
49.0
52.9
58.5
61.2
Rolling Forecasting
2-class classification Alternative Measures
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Precision (p) = a a+c
Recall (r) = a a+b
= TPR
F-measure(F)= 2rp = 2a r+p 2a+b+c
52/52
ROC Curve
v ROC (Receiver Operating Characteristic) Curves:
the ROC curve plots the pairs (i.e. True Positive Rate on the Y axis, and False Positive Rate on the X axis
The diagonal is the baseline-a random classifier
ROC curves that are closer to the top left corner reflect better performance.
A model with high predictive accuracy will rise quickly (moving from left to right), indicating that higher levels sensitivity can be achieved without sacrificing much specificity.
Sometimes researchers use the Area under the ROC curve as a performance measure-AUC. AUC by definition is between 0.5 and 1.
! No model consistently outperform the other
! M1 is better for small FPR
! M2 is better for large FPR ! Area Under the ROC curve
! Ideal: Area = 1
! Random guess: Area = 0.5
53/52