CS计算机代考程序代写 chain flex finance capacity planning GMM algorithm PowerPoint 演示文稿

PowerPoint 演示文稿

MFIN 290 Application of Machine Learning in
Finance: Lecture 4

Yujie He

7/17/2021

Agenda

Recap of last lecture

Unsupervised approach

Dimension Reduction

Overview of different approach families, PCA, SVD

Clustering

Common methods

Evaluation

Real world example use case

Neural Network

Lab: Auto-encoder for Fraud Detection

2

Last Lecture

Classification (Supervised approach)

Introduction

K Nearest Neighbor

Logistic Regression

Multi-class classification

Evaluations

Gradient Descent

3

Unsupervised Learning

4

What is unsupervised learning?

Explore intrinsic characteristics of X

Unsupervised learning is often used as a means of exploratory data analysis, or to facilitate predictive

modeling by segmentation

Dimension reduction

Visualization

Reduce feature space to facilitate training

Example: PCA, SVD, t-SNE (mainly for visualization)

Clustering

K-Means

Density-based clustering

EM Clustering with Gaussian Mixture Models

Hierarchical clustering

Evaluation

5

Why dimension reduction?

Curse of dimensionality

the more features, the more data are needed to represent all combinations of features

Model more likely to overfitting

Remove redundant feature and noise

Extracts signal over noise and “helps” model to learn

Helps model interpretability

Less storage, faster training, wider applicability and scalability (models that do not scale well for high

dimensional training data)

6

Dimension Reduction

Can be done through both feature selection and feature

engineering

Feature selection (manually and programmatically)

Visual check, domain knowledge

LASSO regression

Variance threshold

Univariate selection

Correlation (linear alert!)

ANOVA (continuous)

Chi-square (categorical)

7

Dimension Reduction

Linear methods

PCA

LDA

Nonlinear methods

Manifold assumption in high dimension space

Multi-dimension scaling (MDS) – preserves Euclidean distances

between points

Isometric Feature Mapping (IsoMap) – preserves geodesic

distance

T-SNE – preserves conditional probability distribution and local

neighbors

8

9

10

Dimension Reduction

Auto-encoder (good 6min explanation)

neural network that compress input into a latent-space representation, then reconstruct itself

Applications

Image reconstruction, image denoising

Anomaly detection (commonly used)

11

Principal Component Analysis

Given X, PCA finds low-dimensional representation of it that contains as much as possible of

the variation => Transforms X into a different space (Z)

Each of the dimensions found (PCs, Zk) is a linear combination of original p space (k <= p) 𝜙1 is the first principal component (largest variance) 𝜙1(𝜙11,…,𝜙𝑝1)is the loading of the 1 st PC, 𝑍1 𝑧11,…, 𝑧𝑛1 is the score of the 1 st PC 12 Principal Component Analysis Ploting the score vector (e.g. 𝑍1vs 𝑍2) is same as plotting the original data points projected down onto the subspace spanned by 𝜙1 and 𝜙2 , i.e. the projected sample points Good for continuous initial variables Need to standardize each feature/variable to 𝜇 = 0, 𝜎 = 1 Because PCA is a variance maximization algorithm (variance is scale dependent) High variance variable will dominate the algorithm Normalization makes all variables equal weights 13 Principal Component Analysis Steps Normalize X to 𝜇 = 0, 𝜎 = 1 to get X_std Calculate covariance matrix In finance, correlation matrix is typically used, which can be understood as “normalized covariance matrix” Do eigenvalue decomposition Eigenvectors are components (PCs) Eigen values are explained ratio of variance OR do SVD on X_stdT, i.e. U,S,VT = SVD(X_stdT), U should be the PCs (More efficient in terms of computation, because no need to calculate XTX) HW2 14 X = VSUT XT = (VSUT)T = USVT XTX = USVTVSUT=US2U PCA Bi-Plots Bi-plot = Score plot + Loading plot Scores (dots): projected/transformed samples (bottom + left) Loadings (vectors): how much weight each feature has on that PC (top + right) Angles between loading vectors tell us how features correlate 15 PCA Scree Plot How much variation each principal component captures from the data Use a scree plot to select the principal components to keep Ideal curve is “elbow” like 16 PCA Use Case in Industry As part of exploratory data analysis Data redundancy Dimension reduction Use score vectors (projected data) as new features Compression Visualization Clustering + PCA visualization Preprocessing step for later model training Reduce correlation Reduce data size 17 Singular Value Decomposition A matrix factorization method, 𝑋 = Uᵣ Σᵣ Vᵣᵀ A low rank representation (k < m) Remove noise: a linear combination of (hidden) abstract concepts that minimally span the original data matrix From sparse matrix to dense embeddings: save storage and inference time by shrinking the dimension Computing SVD of X immediately yields PCA of X Use case in industry: Recommender system (collaborative filtering, user-item rating matrix) Latent semantic analysis (word-document co-occurrence matrix) in Natural Language Processing Entries in the matrix better be bounded, otherwise could lead to numerical instable decomposition 18 19 Comparison of different methods 20 Clustering Most important unsupervised learning problem Aims to find structure in unlabeled data Why do clustering Partition is the goal Find intrinsic clusters/segments within data and aid predictive modeling Find true structure in the data As part of exploratory analysis and for sanity check Used to segment customers, products, etc. for profiling and better targeting 21 Clustering – K Means Clustering Goal is to minimize inter-cluster distance and maximize intra-cluster distance Steps Randomly select k initial centers and assign sample to closest cluster Recompute center by taking mean of each cluster Repeat until assignment no change Assumes variance of each feature is spherical and equal 22 Clustering – K Means Clustering Pros: fast, O(n) Cons: assumes variance of each feature is spherical and equal; sensitive to initialization; need to select k K-median, k-medoids, k-means++ To select k Domain knowledge Visualization: “elbow method” 23 Clustering – DBSCAN Density-based method Similar density datapoints clustered together Pros: no assumption of topology, good for manifold space, no assumption of K Cons: Need a distance threshold 𝜀 (cluster vs. noise) and minSample 24 Clustering – Expectation-Maximization (EM) Clustering with GMM Assume data are gaussian distributed Pros: Less restrictive assumption than spherical cluster shape by using mean; support mixed membership through probability Cons: need K; Gaussian assumption 25 Clustering – Hierarchical Clustering Bottom-up: merge pairs of clusters (agglomerate) till reach one cluster (dendrogram) Pros: no need to specify k; flexible choice on distance metric; intrinsic hierarchical structure Cons: less efficient O(n3) 26 27 Clustering – Spectral Clustering 28 2 broad approaches for clustering Compactness Points that lie close to each other fall in the same cluster and are compact around the cluster center. The closeness can be measured by the distance between the observations. E.g.: K-Means Clustering Connectivity Points that are connected or immediately next to each other are put in the same cluster. Even if the distance between 2 points is less, if they are not connected, they are not clustered together. Spectral clustering is a technique that follows this approach. 29 Clustering – Evaluation https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation Has labels? Then use it! Mutual information Otherwise.. Key idea: inter-cluster vs. intra-cluster distance Silhouette Coefficient Pick some clusters and examine the members to see if cluster makes sense – domain knowledge As a metric to select K 30 https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation Tips for doing clustering Know your data Distance metric Distribution assumptions Topology of your data: compactness vs. connectivity Know model assumptions Some methods are restricted to Euclidean distance (e.g. K-means) Evaluate, expert judgement, and iterate 31 Clustering Example – Time Series in Cloud Supply Chain Demand forecast used to drive CapEx (capital expenditure) decisions Small percentage in forecast accuracy improvements leads to $$$ of inventory savings Challenges Thousands of demand usage time series to forecast Different patterns under one resource Fitting one model per time series can be overfitting Fitting one model per resource group can be underfitting 32 Distance metrics for time series Dynamic time warping (DTW) Can match points between 2 time series at different time steps Variants Global alignment kernel Soft-DTW Shape-based distance 33 Centroid Extraction 34 Clustering Algorithms Hierarchical clustering: grouping based on hierarchies K-Medoid (Hastie et al. 2009): select random initial time series as centroid, each iteration update centroid using PAM and corresponding distance metrics Tadpole clustering (Begum et al. 2015): uses lower and upper bound of DTW to find series with dense neighbors as centroids and prunes Fuzzy clustering: assign probabilities of belongings of a time series to a group. E.g. one time series can have [0.5, 0.3, 0.2] belonging probability to cluster [1,2,3] respectively 35 Alternative: Feature-based clustering Instead of using time series DTW distance metrics, use structural characteristics extracted from time series (Wang et al., 2006) Features Date features (day, month, years, weekOfYear, dayOfWeek) Lag features (10 lags) Rolling window features (mean, std, median, kurtosis, skewness etc.) Features fed into clustering algorithms K-means Hierarchical In this example, shape-based clustering yielded similar results to feature-based clustering 36 Model selection and evaluation How to select number of clusters (K)? Cluster Validity Indices How to select clustering algorithm? PCA Test using benchmark dataset 37 Optimal Clustering Number Selection 38 Clustering Algorithm Selection – PCA Score 39 Backtest – Select Robust Clustering Method 40 Business Impact Incorporate clustering methods for short- and mid- term demand forecasting Improved forecast accuracy through automated cluster selection Capacity Planning Leverage short/mid-term forecasts for capacity execution decision (e.g. whether to expedite computing cluster processing) 41 Business Impact 42 Neural Network 43 Building Blocks: Neurons Basic unit of neural network is neurons Activation function (e.g. sigmoid) turns unbounded input to a predictable form 44 Structure of a Neural Network 2-layer neural network, 1 hidden layer Can have any numbers of layers and any number of neurons in those layers Weight matrix (matrices if multiple layers) W dimension [resulting_layer_dim, input_layer_dim] Hidden layer outputs with activation function H1 = f(W1*X.T + b1) f is activation function Output layer O1 = f(W2*H1 + b2) Feedforward, fully connected 45 Compact Representation in Linear Algebra X: [n, 3+1], W1: [4, 4], W2: [3, 4], Output W3: [2, 3] f(W1 * X.T) = dim [4, n] = a1 f(W2 * a1) = dim [3, n] = a2 f(W3 * a2) = dim [2, n] = logits, then do softmax for each sample and get 2 probabilities for a 2-class classification 46 Activation Functions Sigmoid is same as logistic function Not symmetric/normalized, flat gradient Tanh Often used, normalized ReLU and Leaky ReLU Leaky ReLU fixed flat gradient 47 Activation Functions GELU (Gaussian Error Linear Unit) Used in many SOTA Transformer models in NLP Avoids vanishing gradient problem 48 Loss function Regression Mean squared error Classification Cross-entropy (applied on top of a softmax output layer) Softmax turns real valued number into probabilities Can add regularization term 49 Backpropagation Chain rule to calculate partial derivative of loss to parameter Intuitively this is backpropagating error to facilitate weight update 50 Gradient Descent Calculate gradient for every layer (W, weight matrix) Update each layer’s W as backpropagate through to the 1st hidden layer – 1 iteration One pass through full training set is one epoch 51 Gradient Descent Variants 52 Stochastic gradient descent One sample per gradient calc Mini-batch version is better Momentum (momentum-based) Adagrad (scale gradient/stepsize) More stable Accelerated SGD Adadelta (scale gradient/stepsize) Rmsprop (scale gradient/stepsize) Nag (momentum-based) Adam (adaptive moment estimation) Regularization in Neural Networks L1/L2 normalization Modified loss function Dropout Only at training time, avoid over reliance on some nodes Disable dropout at inference time Usually set (ratio) to 0.1 53 Regularization in Neural Networks Data Augmentation Heuristic sample perturbation Model generated adversarial samples Early stopping Avoid overfitting Batch norm Normalization within batch provides some regularization, reducing generalization error 54 Neural Network Intuition Interactive Session Neural network playground Interact and answer following questions (10min): Which feature(s) is most effective in reducing both train and test error How does number of hidden layer impact performance How does number of neuron in each layer impact performance How does noise impact performance 55 https://playground.tensorflow.org/ Lab: Auto-encoder for Fraud Detection Colab 56 https://colab.research.google.com/drive/1q_AuysUon2QB8V55MdzXoXHqPpqZ55WR?usp=sharing Next Step Homework 2 release Mid-term Lecture 5: Prof. Edward Sheng will review Homework 1, review of lectures 1 to 4, and in-class midterm 57 58