Tutorial 6
Machine Learning
Motivation
Copyright By PowCoder代写 加微信 powcoder
Focus more on the project.
What kinds of questions can you answer with ML?
How to choose the appropriate ML model?
If time allows, we will go over an exercise from a previous assignment.
Project Analytics/ML
Decide what statistics will be displayed on the webpage. Student:
• You can use student experiences from previous semesters.
• Create scores to measure: how well students adhere to their schedule,
how often they are behind/ahead.
• Correlate with their expected vs. attained grades.
Department:
• Track students’ time management.
• Find patterns that correlate with learning outcomes/grades and
work-life balance.
• Trends in grades over time.
• Trends in specific courses/assignments for expected workload vs.
actual workload.
It is NOT enough to display histograms, pie charts, etc.!
Data Mining vs. Machine Learning
The main difference is that Machine Learning includes prediction, whereas Data Mining does not.
For project purposes, you can:
do either in the project as long as it automatically adapts to new data as the system is used,
often use the same methods in either context.
Machine Learning
1. Supervised Learning 1.1 Classification
Naive Bayes
Decision Trees Neural Networks
1.2 Regression
Linear Regression
Logistic Regression Neural Networks
2. Unsupervised Learning 2.1 Clustering
Hierarchical Agglomerative Clustering 2.2 Dimensionality reduction
Principal Component Analysis (PCA)
How does the data look like?
Obs.: Just because it *may* be possible to predict GPA, doesn’t mean it adds value to the system.
Supervised Learning – Classification
Naive Bayes
For y = 0, 1 :
P(y = 0)P(oi|y = 0) P(y = 1)P(oi|y = 1)
Classify y = 0 if left-hand side is greater, or y = 1 otherwise.
Huge assumption: features are independent! Decision Trees
• Split training data into branches based on some logical rule. • Maximize information gain at each split.
Supervised Learning – Regression
Linear Regression y is a scalar
y = β0 + β1o1 + β2o2 + … + βmom + ε Huge assumption: y has a normal distribution!
Logistic Regression For y = 0, 1 :
P(y = 1|o1,o2,…,om) = 1
1 + e−(β0+β1o1+β2o2+…+βmom)
Regression gives the probability of success y = 1.
Unsupervised Learning – Clustering
Hierarchical Agglomerative Clustering
How to decide which model to pick?
Decide on the question you would like to answer.
• What would you like to predict? Is it scalar or binary/categorical? • Or you’d like to cluster similar samples?
Query the necessary data from the database in R.
Look at the data first to find patterns (e.g., if data is normally distributed, linear relationships, etc) by plotting columns of the data, histograms, etc.
Choose a model based on the question and data characteristics.
Make sure the data satisfies the assumptions for the chosen model.
Just picking a method and hoping it works is a waste of time and could be incorrect.
Steps for Classification
STEP 1: Clean the dataset.
Check for NAs. Remove or fill in missing data. Check for outliers and remove them.
STEP 2: (Optional) Apply PCA to the data to reduce dimensionality.
STEP 3: Partition the dataset into training and testing data.
STEP 4: Learn the parameters using training data.
STEP 5: Evaluate the performance of the models on testing data (unseen data).
Confusion Matrix; ROC curve.
Steps for Clustering
STEP 1: Clean the dataset.
Check for NAs. Remove or fill in missing data. Check for outliers and remove them.
STEP 2: (Optional) Apply PCA to the data to reduce dimensionality. STEP 3: Decide parameters or the model (e.g., distance functions).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com