CS计算机代考程序代写 Lecture 11

Lecture 11
qCourse Review
1

Open-book Exam
Five Questions
All the methods will not go beyond what you learned in this class (Week 1 ~ Week 10).
Lecture notes & tutorial notes & assignments & quiz!
2/52

Course Map
6

Course Map (Week 2&3)
7

Course Map (Week 4)
8

Course Map (Week 5&6)
9

Course Map (Week 8)
Association Rule
Rule Measurement
Rule Mining
Support Confidence Lift
Frequent rule set mining
Frequent rule mining
10

Course Map (Week 9)
11

Course Map (Week 10)
12

Descriptive Analytics
13/52

Descriptive Analytics
Recall text analytics
14/52

Descriptive Analytics
Measures of Central Tendency
• Mean Median Mode
• Minimum, Maximum, Percentiles, and Quartiles
Measures of Dispersion: Range • Range
• interquartile range (IQR)
• Variance / Standard Deviation /Coefficient of
Variation
15/52

Descriptive Analytics
Measures of Distribution Shape
• Skewness (Coefficient of Skewness, CS) • Kurtosis
16/52

Descriptive Analytics
vRelationship Between Multiple Variables Covariance (Population/Sample)
Correlation
17/52

Descriptive Analytics
18/52

Feature Selection Dimension Reduction
vChoosing Variables to Remove Remove non-informative feature vectors
directly
vPrinciple Component Analysis (PCA)
find a linear combination of the quantitative variables that contains most, even if not all, of the information.
19/52

PCA
X(1) X(2) X(3) X(4) X(5) X(6)
20/52

PCA
Project to a line
Projection error
𝜃
•For 2D-1D, we must find a vector u(1) , onto which you can project the data so as to minimize the projection error.
(u(1) -u(1) are the same)
21/52

PCA
vTo reduce from n dimension to k dimension, Find k vectors (u(1), u(2), … u(k)) onto which to
project the data to minimize the projection error!
22/52

PCA
v Need to compute two things; Compute the u vectors The new planes
v Need to compute the z vectors
z vectors are the new, lower dimensionality feature vectors
23/52

PCA
PCA is to find a transformation of the data that satisfies the
following properties:
v Each pair of new attributes has 0 covariance (for distinct attributes).
v The attributes are ordered with respect to how much of the variance of the data each attribute captures.
v The first attribute captures as much of the variance of the data as possible.
v Each successive attribute captures as much of the remaining variance as possible.
24/52

PCA-Key concepts
v Covariance Matrix S:
Given an m by n data matrix D, whose m rows are data objects and whose n columns are attributes, If the data matrix D is preprocessed so that the mean of each attribute is 0, then S = DTD.
vEigenvalues of S:
Let λ1, . . . , λn be the eigenvalues of S. The eigenvalues are all non-
negative and can be ordered such that λ1 ≥ λ2 ≥ . . . λm−1 ≥ λm.
vEigenvectors of S:
Let U = [u(1), . . . , u(n)] be the matrix of eigenvectors of S. These eigenvectors are ordered so that the ith eigenvector corresponds to the ith
largest eigenvalue.
25/52

PCA-Data Transformation
v The data matrix D′ = DU is the set of transformed data that satisfies the conditions posed above.
v U matrix is also an [n x n] matrix, turns out the columns of U are the u vectors we want!
So to reduce a system from n-dimensions to k-dimensions, just take the first k-vectors from U (first k columns)
Ureduce
D′ = DU
• D is [m x n], U is [n x n]
• D’ is [m x n].
D′ = DUreduce
• Dis[mxn],Ureduce is[nxk]
• D’ is [m x k].
z = (Ureduce)T * x
• x is [n x 1] (original space)
• Z is [k x 1] (principle space)
z1=u(1)T *x
26/52

Use software package to Run PCA
1. Origin dataset [77 x 13]
2. Correlation matrix (it is self-standardized)
27/52

Use software package to Run PCA
3. Get principle components
• Dimension is reduced from 13 to 4
• Only four of them are having Eigenvalue greater than 1.
4. Get PCs (new features z) from origin features
28

Time Series
v Time Series Analysis
v Simple Models (Week 5) Moving Average Exponential Smoothing Random Walk Model
v Component Decomposition (Week 5&6) Trend Analysis
Seasonal Effect
v Regression-based Models (Week 5&6) Auto-regression
Other Casual Effects Additional topics
29

Course Map
30

Simple Moving Average
v Average random fluctuations in a time series to infer short-term changes in direction
v Assumption: future observations will be similar to recent past
v Forecast = average of most recent m observations F =Y+…+Y
t+1 t t-m+1 m
v So, how do you decide m? v What if m=1?
31/52

Weighted Moving Average
v A natural extension to moving average.
Weight the most recent k observations, with weights
that add to 1.0
Higher weights on more recent observations generally provide more responsive forecasts to rapidly changing time series
Rational: recent observation are more similar as what will happen next
F =aY+…+aY
t+1 1 t m t-m+1
a1 +…+am =1
Of course, it is not easy to decide the parameters,
one solution is using linear programming optimization. 32/52

Exponential Smoothing
v So: an easier approach (with one parameter)
Ft+1 = aYt+a (1 – a )Yt-1+a (1 – a )2Yt-2+a (1 – a )3Yt-3+… v Exponential smoothing model:
Ft+1 =Lt=(1–a)Lt-1 +aYt
= Lt-1 +a(Yt –Lt-1)
(higher weights on more recent observations)
Estimate for period t = last prediction + adjustment on most recent forecast error v Ft+1 is the forecast for time period t+1,
v Lt is the level of series at time t, (L0=Y0); an estimate of where the series would be at time t if there were no random noise;
v Yt is the observed value in period t, and
a is a constant between 0 and 1, called the smoothing constant. (How to get it?)
33/52

Random Walk Model
𝑌! = 𝑌!”# + 𝑚 + 𝑒!
Here:
v 𝑌! – Observed value at the time period of t
v 𝑌!”#- Observed value at the time period of t-1
v m – the average differences (this is not 0 – check
autocorrelation)
v𝑒! – a random time series with a mean of 0 and a standard deviation of 1.
34/52

Autoregression
vFirst-order autoregressive model AR(1) Yt = a0 + a1Yt-1 +et
vSecond-order autoregressive model AR(2) Yt = a0 + a1Yt-1 + a2Yt-2 +et
v AR(p)
Which parameter p should we choose? Recall
model assessment.
35/52

Sample Question
We are going to use a time series variable and do some analysis:
Using Origin value Use difference(1) of origin value
Augmented Dickey-Fuller Unit Root Tests
Type
Lags
Rho
Pr < Rho Tau Pr < Tau F Pr > F
Zero Mean
0
-59.1439
<.0001 -5.97 <.0001 1 -45.1897 <.0001 -4.66 <.0001 2 -25.1469 0.0002 -3.27 0.0012 Single Mean 0 -74.3553 0.0013 -6.86 <.0001 23.53 0.0010 1 -64.7748 0.0013 -5.49 <.0001 15.08 0.0010 2 -38.7515 0.0013 -3.85 0.0031 7.45 0.0010 Trend 0 -74.3509 0.0005 -6.84 <.0001 23.41 0.0010 1 -64.5966 0.0005 -5.47 <.0001 15.03 0.0010 2 -38.2834 0.0008 -3.81 0.0184 7.51 0.0197 Augmented Dickey-Fuller Unit Root Tests Type Lags Rho Pr < Rho Tau Pr < Tau F Pr > F
Zero Mean
0
0.9750
0.9071
5.66
0.9999
1
0.9132
0.8965
2.47
0.9969
2
0.8823
0.8908
2.10
0.9916
Single Mean
0
-0.1024
0.9513
-0.26
0.9272
21.27
0.0010
1
-0.3804
0.9346
-0.51
0.8853
4.88
0.0425
2
-0.4918
0.9269
-0.61
0.8643
3.96
0.0911
Trend
0
-1.4094
0.9819
-0.79
0.9635
0.32
0.9900
1
-4.9336
0.8221
-1.45
0.8407
1.08
0.9570
2
-5.4808
0.7807
-1.47
0.8348
1.13
0.9503
! If you are going to run ARIMA model, which time series value is more appropriate for analysis: original value vs. difference (1) of original value?
36

Sample Question
We are going to use a time series variable and do some analysis:
Using Origin value Use difference(1) of origin value
Augmented Dickey-Fuller Unit Root Tests
Type
Lags
Rho
Pr < Rho Tau Pr < Tau F Pr > F
Zero Mean
0
-59.1439
<.0001 -5.97 <.0001 1 -45.1897 <.0001 -4.66 <.0001 2 -25.1469 0.0002 -3.27 0.0012 Single Mean 0 -74.3553 0.0013 -6.86 <.0001 23.53 0.0010 1 -64.7748 0.0013 -5.49 <.0001 15.08 0.0010 2 -38.7515 0.0013 -3.85 0.0031 7.45 0.0010 Trend 0 -74.3509 0.0005 -6.84 <.0001 23.41 0.0010 1 -64.5966 0.0005 -5.47 <.0001 15.03 0.0010 2 -38.2834 0.0008 -3.81 0.0184 7.51 0.0197 Augmented Dickey-Fuller Unit Root Tests Type Lags Rho Pr < Rho Tau Pr < Tau F Pr > F
Zero Mean
0
0.9750
0.9071
5.66
0.9999
1
0.9132
0.8965
2.47
0.9969
2
0.8823
0.8908
2.10
0.9916
Single Mean
0
-0.1024
0.9513
-0.26
0.9272
21.27
0.0010
1
-0.3804
0.9346
-0.51
0.8853
4.88
0.0425
2
-0.4918
0.9269
-0.61
0.8643
3.96
0.0911
Trend
0
-1.4094
0.9819
-0.79
0.9635
0.32
0.9900
1
-4.9336
0.8221
-1.45
0.8407
1.08
0.9570
2
-5.4808
0.7807
-1.47
0.8348
1.13
0.9503
!
If you are going to run ARIMA model, which time series value is more appropriate for analysis: original value vs. difference (1) of original value?
Based on the Dickey-Fuller tests, the original value is non-significant (with p-values for lag 1 and lag 2 as 0.957 and 0.950 respectively). In contast, the difference of ppi (yt-yt-1) has significant Dickey-Fuller test, with p-value for lag1 as 0.001 and for lag2 as 0.02 respectively. Thus, the model should use the difference(1) for ppi.
37

Sample Question
run the model with AR(1), write the equation for the AR(1) model.
Maximum Likelihood Estimation
Parameter
Estimate
Standard Error
t Value
Approx Pr > |t|
Lag
MU
0.45576
0.13136
3.47
0.0005
0
AR1,1
0.55220
0.06466
8.54
<.0001 1 Constant Estimate 0.204089 Variance Estimate 0.591113 Std Error Estimate 0.768839 AIC 390.7895 SBC 397.0374 Number of Residuals 168 Forecasts for variable ppi Obs Forecast Std Error 95% Confidence Limits 170 103.4440 0.7688 101.9371 104.9508 38/52 Sample Question run the model with AR(1), write the equation for the AR(1) model. Maximum Likelihood Estimation Parameter Estimate Standard Error t Value Approx Pr > |t|
Lag
MU
0.45576
0.13136
3.47
0.0005
0
AR1,1
0.55220
0.06466
8.54
<.0001 1 Constant Estimate 0.204089 Variance Estimate 0.591113 Std Error Estimate 0.768839 AIC 390.7895 SBC 397.0374 Number of Residuals 168 Forecasts for variable ppi Obs Forecast Std Error 95% Confidence Limits 170 103.4440 0.7688 101.9371 104.9508 difference(Yt)=0.204+0.552*difference(Yt-1)+et . Because Diff(Yt)=Yt-Yt-1 thus the formula is: 𝑌! = 0.204 + 1.552𝑌!"# − 0.552𝑌!"$ + 𝑒! 39/52 Association Rule vWhat are their support, confidence & lift? vHow to interpret them? 40/52 Association Rule Mining null A B C D E Found to be Infrequent AB ABC AC AD ABD ABE AE ACD ABCE BC ACE ABDE BD ADE BE BCD ACDE CD BCE BCDE CE DE BDE CDE ABCD Pruned supersets ABCDE 41 Sample Question If we set minsup=0.2, minconf=0.45, which rules are select? Interpret the rule with largest support/confidence 42/52 Text Mining – Document Representation Base 43 Term Frequency (TF) Term Frequency (TF): the frequency of a word appears in a document. v Let 𝐶𝑜𝑢𝑛𝑡(𝑖, 𝑗) be the count of the term i in the document j. v Let 𝑇𝐹(𝑖, 𝑗) be the proportion of the count of term i in the document j. 𝑐𝑜𝑢𝑛𝑡(𝑖, 𝑗) 𝑇𝐹 𝑖, 𝑗 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑗 Note: the total number of words is used to normalize the direct counting of Count(i,j). If one document has much more words then the other, we need to normalize the Count(i,j) based on document size. The size of a document, can also be replaced by the total frequency of terms (keywords) in this document. 44/52 Inverse Document Frequency (IDF) Inverse document frequency: the number of documents which contain a specific word. Let N be the count of distinct documents in the corpus. Let Count (i) be the count of documents in the corpus in which the term i is present. The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. 𝑁 𝐼𝐷𝐹 𝑖 = log(𝑐𝑜𝑢𝑛𝑡(𝑖)) 45/52 Document Mining vAssociation rules v Clustering v Classification vTopic models 46/52 NLP Example 47/52 Model Assessment Overview 48/52 Evaluation Metrics Prediction • MAE or MAD (mean absolute error/deviation) • Average error • MAPE (mean absolute percentage error) • RMSE (root-mean-squar ed error) • Total SSE (total sum of squared error) 49/52 Classification • Classification matrix (also known as confusion matrix) • Precision • Recall • F-measure • Misclassification errors Model Evaluation (CV with random subsampling) Validation error e1 Validation error e2 Validation error eK 50/52 Dataset 1st evaluation 2nd evaluation ...... last evaluation T 0 X 1.0 Y 3.3 Model Evaluation (Rolling Forecasting) T 0 X Y T 0 XY T XY 1.0 3.3 1 1.0 1 2 3 4 5 9 10 11 12 13 14 15 16 17 18 19 3.3 4.9 5.6 6.3 10.5 11.5 12.8 13.0 14.7 15.7 16.2 17.5 18.6 19.4 20.2 10.3 14.8 16.8 19.8 31.8 35.1 38.7 39.9 44.2 47.8 49.0 52.9 56.7 58.5 61.2 1 2.9 9.8 2 3 5 9 10 11 12 13 14 15 16 17 18 19 2.9 3.3 4.9 5.6 10.5 11.5 12.8 13.0 14.7 15.7 9.8 10.3 14.8 16.8 31.8 38.7 39.9 44.2 47.8 2 2.9 3.3 0 1.0 3.3 2.9 3 5 9 10 11 12 13 3.3 10.5 31.8 9.8 10.3 1 2 9.8 3.3 10.3 4 4.9 14.8 3 4.9 14.8 4 5.6 16.8 4 5.6 16.8 6.3 19.8 6.3 19.8 5 6 7.2 21.8 6.3 19.8 6 7.2 21.8 6 7.2 21.8 7 8.3 25.2 7.2 7 8.3 25.2 7 8.3 25.2 6 21.8 8 9.7 29.6 8 9.7 29.6 8 9.7 29.6 7 8.3 25.2 8 9.7 29.6 35.1 11.5 35.1 13.0 39.9 14.7 44.2 15.7 47.8 9 11 12 13 10.5 31.8 12.8 38.7 10 11.5 35.1 12.8 38.7 13.0 39.9 14.7 44.2 16.2 18.6 19.4 20.2 49.0 56.7 58.5 61.2 14 17 18 19 16.2 49.0 18.6 56.7 19.4 58.5 20.2 61.2 14 16 17 18 19 15.7 18.6 56.7 19.4 51/52 20.2 47.8 17.5 52.9 15 16 17.5 52.9 15 16.2 17.5 49.0 52.9 58.5 61.2 Rolling Forecasting 2-class classification Alternative Measures PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Precision (p) = a a+c Recall (r) = a a+b = TPR F-measure(F)= 2rp = 2a r+p 2a+b+c 52/52 ROC Curve v ROC (Receiver Operating Characteristic) Curves: the ROC curve plots the pairs (i.e. True Positive Rate on the Y axis, and False Positive Rate on the X axis The diagonal is the baseline-a random classifier ROC curves that are closer to the top left corner reflect better performance. A model with high predictive accuracy will rise quickly (moving from left to right), indicating that higher levels sensitivity can be achieved without sacrificing much specificity. Sometimes researchers use the Area under the ROC curve as a performance measure-AUC. AUC by definition is between 0.5 and 1. ! No model consistently outperform the other ! M1 is better for small FPR ! M2 is better for large FPR ! Area Under the ROC curve ! Ideal: Area = 1 ! Random guess: Area = 0.5 53/52

Related Posts