Lead Research Scientist, Financial Risk Quantitative Research, SS&C Algorithmics Adjunct Professor, University of Toronto
MIE1624H – Introduction to Data Science and Analytics Lecture 10 – Advanced Machine Learning
University of Toronto March 22, 2022
Copyright By PowCoder代写 加微信 powcoder
Machine learning
Machine learning gives computers the ability to learn without being explicitly programmed
■ Supervised learning: decision trees, ensembles (bagging, boosting, random forests), k-NN, linear regression, Naive Bayes, neural networks, logistic regression, SVM
❑ Classification
❑ Regression (prediction)
■ Unsupervised learning: k-means, c-means, hierarchical clustering, DBSCAN
❑ Clustering
❑ Dimensionality reduction (PCA, LDA, factor analysis, t-SNE) ❑ Association rules (market basked analysis)
■ Reinforcement learning ❑ Dynamic programming
■ Neural nets: deep learning, multilayer perceptron, recurrent neural network (RNN), convolutional neural network (CNN), generative
adversarial network (GAN) 2
Machine learning
Machine learning gives computers the ability to learn without being explicitly programmed
Source: , Intro to Machine Learning
Source: , Intro to Machine Learning
Lecture #5, #7, #10 Lecture #10
Source: , Intro to Machine Learning
Lecture #5, #7, #10 Lecture #10
Source: , Intro to Machine Learning
Lecture #5, #7, #10
Lecture #10
Lecture #11
Source: , Intro to Machine Learning
Unsupervised Machine Learning – Clustering
Cluster analysis (segmentation) ▪ Unsupervised learning algorithm
o Unlabeled data and no “target” variable
▪ Frequently used for segmentation (to identify natural groupings of customers)
o Market segmentation, customer segmentation
▪ Most cluster analysis methods involve the use of a distance measure to calculate
the closeness between pairs of items
o Data points in one cluster are more similar to one another
o Data points in separate clusters are less similar to one another
Cluster #1
Cluster #3
Cluster #2
K-means clustering
K-means clustering
K-means clustering
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Source: , Unsupervised Machine Learning
Clustering: LinkedIn
Clustering: LinkedIn
Cluster analysis – K-means clustering
Source: , Cluster Analysis
Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis
Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis
Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis
Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis
Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis
Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis
Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis
AI for course curriculum design – clustering of skills
AI for course curriculum design – clustering of skills
Deep Learning and Computer Vision (neural nets, deep learning, AI, Python, TensorFlow)
Data Management
(databases, data structures, SQL, noSQL, web-scraping, APIs, intro to Big Data)
Statistical Analysis for Business (statistical modeling, hypothesis testing, SPSS)
Distributed Computing, Big Data Analytics (distributed computing, Cloud, Hadoop, Spark)
Creative Thinking, Design Thinking
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Cluster analysis – DBSCAN
Source: , Cluster Analysis
Main clustering algorithms
■ Partition based (K-means):
❑ Medium and large sized databases (relatively efficient) ❑ Produces sphere-like clusters
❑ Needs number of clusters (K)
■ Partition based (FCM): ❑ Produces fuzzy clusters ❑ Long computational time
■ Hierarchical based (agglomerative): ❑ Produces trees of clusters
■ Density based (DBScan):
❑ Produces arbitrary shaped clusters
❑ Good when dealing with spatial clusters (maps)
Cluster analysis – comparison
Source: , Cluster Analysis
Applications of clustering
■ Retail / Marketing:
❑ Identifying buying patterns of customers
❑ Finding associations among customers demographic characteristics
❑ To recommend a new book, or to new customer by identifying clusters of books or clusters of customer preferences
■ Education:
❑ Education professionals may want to know the likes and dislikes of their students, they can create and understand the different groups and then package and market the various courses
■ Banking:
❑ Clustering normal transactions to find patterns of fraudulent credit card use ❑ Identifying clusters of customers, e.g., loyal
❑ Determining credit card spending by customer groups
Applications of clustering
■ Insurance:
❑ Fraud detection in claims analysis ❑ Insurance risk of customers
■ Publishing / Media:
❑ Automatically categorizing news based on their content ❑ Recommending similar news articles
❑ Tagging news
❑ Automatic fact checking
■ Medicine:
❑ Characterizing patient behaviour based on similar characteristics ❑ Identifying successful medical therapies for different illnesses
■ Biology:
❑ Clustering genetic markers to identify family tries
Other Machine Learning Algorithms
Machine learning
Source: Rahul “Artificial Intelligence Demystified”, http://www.analyticsvidhya.com/blog/2016/12/artificial-intelligence-demystified/
Machine learning
Association rules – unsupervised machine learning
▪ Frequently called Market Basket Analysis is an unsupervised learning algorithm (no target variable)
▪ Detects associations (affinities) between variables (items or events)
▪ If customer purchased bread and bananas, s/he has an 80% probability to
purchase milk during the same trip
▪ Multiple applications: o Cross-sell and up-sell o Targeted Promotions o Product bundling
o Store planograms
o Assortment optimization
Ensemble Learning
Source: , Ensemble Learning
Source: , Ensemble Learning
12345𝑋12345
Source: , Ensemble Learning
Multiple models are built on training data
12345𝑋12345
Regression
Source: , Ensemble Learning
Multiple models are built on training data
12345𝑋12345
Regression
Regression
Source: , Ensemble Learning
Multiple models are built on training data
12345𝑋12345
Regression
Regression
Source: , Ensemble Learning
Multiple models are built on training data
12345𝑋12345
Regression
Regression
Source: , Ensemble Learning
Multiple models are built on training data
Regression
Regression
Source: , Ensemble Learning
Multiple models are built on training data
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Average the predictions
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
mean(3.18, 3.23, 3.57) Average the
predictions
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
mean(5.55, 5.17, 4) Average the
predictions
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Average the predictions
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression
Regression
Basic ensemble
Source: , Ensemble Learning
In 1907, 787 villagers tried to guess the weight of ox
None of them guessed it correctly, but the average guess (542.9 kg) was very close to actual weight of ox (543.4 kg)
, The Wisdom of Crowds Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Similar be appliedforlassification
Regression
Multiple models are built on training data
Regression
Regression
Regression
Basic ensemble
Source: , Ensemble Learning
1.6 + 0.79*x
Regression
Multiple models are built on training data
Training data
1.94 + Due
0.64*x to ra
ndom weight 2
initialization, and data partition, trained models usually turn out to be
Source: , Ensemble Learning
slightly different
Basic ensemble
Regression
Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning
Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning
Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning
Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning
Multiple models are built on training data
Training data
Majority voting
Source: , Ensemble Learning
Multiple models are built on training data
Training data
Dog – 2, Cat – 1 Majority voting
Source: , Ensemble Learning
Multiple models are built on training data
Training data
Dog wins! Majority voting
Source: , Ensemble Learning
Training data
Majority voting
Source: , Ensemble Learning
Ensemble Learning – Random Forest
Random Forest
What is Random Forest?
▪ a supervised learning algorithm
▪ build predictive model for both classification and regression
▪ work as large collection of uncorrelated decision trees
Applications:
Random Forest
Advantage:
▪ Can be used for both regression and classification tasks ▪ With enough trees the classifier won’t overfit the model ▪ Wide diversity, better model
Disadvantage:
▪ Large number of trees may slow the algorithm and ineffective for real-time predictions
Random Forest
How it works – Stage 1
▪ Randomly selecting “m” features out of total “M” features. (m<