编程辅导 MIE1624H – Introduction to Data Science and Analytics Lecture 10 – Advanced

Lead Research Scientist, Financial Risk Quantitative Research, SS&C Algorithmics Adjunct Professor, University of Toronto
MIE1624H – Introduction to Data Science and Analytics Lecture 10 – Advanced Machine Learning
University of Toronto March 22, 2022

Copyright By PowCoder代写 加微信 powcoder

Machine learning
Machine learning gives computers the ability to learn without being explicitly programmed
■ Supervised learning: decision trees, ensembles (bagging, boosting, random forests), k-NN, linear regression, Naive Bayes, neural networks, logistic regression, SVM
❑ Classification
❑ Regression (prediction)
■ Unsupervised learning: k-means, c-means, hierarchical clustering, DBSCAN
❑ Clustering
❑ Dimensionality reduction (PCA, LDA, factor analysis, t-SNE) ❑ Association rules (market basked analysis)
■ Reinforcement learning ❑ Dynamic programming
■ Neural nets: deep learning, multilayer perceptron, recurrent neural network (RNN), convolutional neural network (CNN), generative
adversarial network (GAN) 2

Machine learning
Machine learning gives computers the ability to learn without being explicitly programmed

Source: , Intro to Machine Learning

Source: , Intro to Machine Learning

Lecture #5, #7, #10 Lecture #10
Source: , Intro to Machine Learning

Lecture #5, #7, #10 Lecture #10
Source: , Intro to Machine Learning

Lecture #5, #7, #10
Lecture #10
Lecture #11
Source: , Intro to Machine Learning

 Unsupervised Machine Learning – Clustering

Cluster analysis (segmentation) ▪ Unsupervised learning algorithm
o Unlabeled data and no “target” variable
▪ Frequently used for segmentation (to identify natural groupings of customers)
o Market segmentation, customer segmentation
▪ Most cluster analysis methods involve the use of a distance measure to calculate
the closeness between pairs of items
o Data points in one cluster are more similar to one another
o Data points in separate clusters are less similar to one another
Cluster #1
Cluster #3
Cluster #2

K-means clustering

K-means clustering

K-means clustering

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Source: , Unsupervised Machine Learning

Clustering: LinkedIn

Clustering: LinkedIn

Cluster analysis – K-means clustering
Source: , Cluster Analysis

Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis

Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis

Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis

Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis

Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis

Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis

Cluster analysis – Fuzzy C-means clustering (FCM)
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

Cluster analysis – Hierarchical clustering
Source: , Cluster Analysis

AI for course curriculum design – clustering of skills

AI for course curriculum design – clustering of skills
Deep Learning and Computer Vision (neural nets, deep learning, AI, Python, TensorFlow)
Data Management
(databases, data structures, SQL, noSQL, web-scraping, APIs, intro to Big Data)
Statistical Analysis for Business (statistical modeling, hypothesis testing, SPSS)
Distributed Computing, Big Data Analytics (distributed computing, Cloud, Hadoop, Spark)
Creative Thinking, Design Thinking

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Cluster analysis – DBSCAN
Source: , Cluster Analysis

Main clustering algorithms
■ Partition based (K-means):
❑ Medium and large sized databases (relatively efficient) ❑ Produces sphere-like clusters
❑ Needs number of clusters (K)
■ Partition based (FCM): ❑ Produces fuzzy clusters ❑ Long computational time
■ Hierarchical based (agglomerative): ❑ Produces trees of clusters
■ Density based (DBScan):
❑ Produces arbitrary shaped clusters
❑ Good when dealing with spatial clusters (maps)

Cluster analysis – comparison
Source: , Cluster Analysis

Applications of clustering
■ Retail / Marketing:
❑ Identifying buying patterns of customers
❑ Finding associations among customers demographic characteristics
❑ To recommend a new book, or to new customer by identifying clusters of books or clusters of customer preferences
■ Education:
❑ Education professionals may want to know the likes and dislikes of their students, they can create and understand the different groups and then package and market the various courses
■ Banking:
❑ Clustering normal transactions to find patterns of fraudulent credit card use ❑ Identifying clusters of customers, e.g., loyal
❑ Determining credit card spending by customer groups

Applications of clustering
■ Insurance:
❑ Fraud detection in claims analysis ❑ Insurance risk of customers
■ Publishing / Media:
❑ Automatically categorizing news based on their content ❑ Recommending similar news articles
❑ Tagging news
❑ Automatic fact checking
■ Medicine:
❑ Characterizing patient behaviour based on similar characteristics ❑ Identifying successful medical therapies for different illnesses
■ Biology:
❑ Clustering genetic markers to identify family tries

Other Machine Learning Algorithms

Machine learning
Source: Rahul “Artificial Intelligence Demystified”, http://www.analyticsvidhya.com/blog/2016/12/artificial-intelligence-demystified/

Machine learning

Association rules – unsupervised machine learning
▪ Frequently called Market Basket Analysis is an unsupervised learning algorithm (no target variable)
▪ Detects associations (affinities) between variables (items or events)
▪ If customer purchased bread and bananas, s/he has an 80% probability to
purchase milk during the same trip
▪ Multiple applications: o Cross-sell and up-sell o Targeted Promotions o Product bundling
o Store planograms
o Assortment optimization

Ensemble Learning

Source: , Ensemble Learning

Source: , Ensemble Learning

12345𝑋12345
Source: , Ensemble Learning

Multiple models are built on training data
12345𝑋12345
Regression
Source: , Ensemble Learning

Multiple models are built on training data
12345𝑋12345
Regression
Regression
Source: , Ensemble Learning

Multiple models are built on training data
12345𝑋12345
Regression
Regression

Source: , Ensemble Learning

Multiple models are built on training data
12345𝑋12345
Regression
Regression

Source: , Ensemble Learning

Multiple models are built on training data
Regression
Regression

Source: , Ensemble Learning

Multiple models are built on training data
Regression
Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression
Average the predictions

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression
mean(3.18, 3.23, 3.57) Average the
predictions

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression
mean(5.55, 5.17, 4) Average the
predictions

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression
Average the predictions

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Regression
Regression

Regression
Basic ensemble

Source: , Ensemble Learning

In 1907, 787 villagers tried to guess the weight of ox
None of them guessed it correctly, but the average guess (542.9 kg) was very close to actual weight of ox (543.4 kg)
, The Wisdom of Crowds Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Similar be appliedforlassification
Regression
Multiple models are built on training data
Regression
Regression
Regression
Basic ensemble

Source: , Ensemble Learning

1.6 + 0.79*x
Regression
Multiple models are built on training data
Training data
1.94 + Due
0.64*x to ra
ndom weight 2
initialization, and data partition, trained models usually turn out to be
Source: , Ensemble Learning
slightly different
Basic ensemble
Regression

Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning

Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning

Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning

Multiple models are built on training data
Training data
Basic ensemble
Source: , Ensemble Learning

Multiple models are built on training data
Training data
Majority voting
Source: , Ensemble Learning

Multiple models are built on training data
Training data
Dog – 2, Cat – 1 Majority voting
Source: , Ensemble Learning

Multiple models are built on training data
Training data
Dog wins! Majority voting
Source: , Ensemble Learning

Training data
Majority voting
Source: , Ensemble Learning

Ensemble Learning – Random Forest

Random Forest
What is Random Forest?
▪ a supervised learning algorithm
▪ build predictive model for both classification and regression
▪ work as large collection of uncorrelated decision trees
Applications:

Random Forest
Advantage:
▪ Can be used for both regression and classification tasks ▪ With enough trees the classifier won’t overfit the model ▪ Wide diversity, better model
Disadvantage:
▪ Large number of trees may slow the algorithm and ineffective for real-time predictions

Random Forest
How it works – Stage 1
▪ Randomly selecting “m” features out of total “M” features. (m<CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com