留学生考试辅导 Model evaluation

Model evaluation
1. Classification evaluation
a) Model evaluation
i. In a supervised machine learning framework

b) Model selection
i. Choosing the best model from a number of possibilities
2. Evaluation strategies
a) Holdout
i. Definition
1. Each instance is randomly assigned as either a training instance or a testing instance.
2. No overlap between datasets- the data is partitioned (数据被分区)
3. Commonly 90-10;80-20;50-50
ii. Advantage
1. Simple to work with and implement
2. Fairly high reproducibility (可复制性)
a) Reproducibility: 容易把之前跑出来的结果再拍一遍
iii. Disadvantage
1. Size of the split affects the estimate of the model’s behaviour
a) Lots of test instances, few training instance
i. Learner doesn’t have enough information to build an accurate model
b) Lots of training instance, few test instances
i. Learner builds an accurate model, but test data might not be representative

b) Repeated random subsampling
i. Definition
1. Like holdout, but iterated（重复） multiple times
2. Evaluate by averaging (chosen metric) across the iterations
ii. Advantages
1. Averaging holdout method tends to product more reliable result

iii. Disadvantage
1. More difficult to produce 不容易复现，随机性太大
2. Slower than holdout (by a constant factor)
3. Wrong choice of training set-test set size can still lead to highly misleading result

c) Cross-validation (usually preferred alternative)
i. Definition
1. Data is progressively split into a number of partitions m (m >2)
2. Iteratively – one partition is used as test data/ m-1 as training data
3. Evaluation metrics is aggregated across m test partitions

ii. Advantage: why is this better than holdout/repeated random subsampling
1. Every instance is a test instance, for some partition
a) Without dataset overlap
b) Evaluation metrics are calculated with respect to an entire dataset
具有更高的普遍性，更generalise relative to Repeated random subsampling
2. Take same amount of time as Repeated random subsampling
3. Very reproducible
4. Can be shown to minimise bias and variance of out estimated of the classifier’s performance 因为每一个instance都test了，所以误差下降
iii. How big is m?
1. Number of folds directly impacts runtime and size of datasets
a) m小: less folders, more instance per partition, more variance in more estimates
b) m大: more folders, fewer instances per partition, less variance but slower
2. Most common choice of m: 10
3. Best choice: m = N (known as leave-one-out-cross-validation)
a) Maximises training data for the model
b) Mimics (模仿) actual testing behaviour
c) Way too slow

Inductive biases and no free lunch
a) The no free lunch theorem
a) NO superior classifier – 除非对data做出假设
b) 因为无假设无法得出suitable classifier
b) Inductive learning hypothesis
a) 在training set 中表现好，同时在testing set 中也能表现好
c) Inductive biases
a) Definition: assumption must be made about the data to build a model and make predictions
b) Different assumptions will lead to different predictions
c) Example:
i. NB- maximum conditional independence
ii. KNN – nearest neighbour
iii. Generic assumption-Stratification
1. Assume that class distribution of unseen instance will be the same as distribution of seen instances

Evaluation measures
1. Error E- the fraction of incorrect predictions

2. Accuracy

3. Two types of errors
a) Contingency tables
⬇️y, estimate y

b) Not all error is equal
i. Some problems want to strongly penalize false negative error
1. i.e covid diagnoses/ medical diagnosis – avoid type II errror
ii. Other problems want to strongly penalize false positive errors
1. Spam filter; positive is 垃圾邮件； negative is important email; we need to avoid the NP – type I error
iii. Attach a ‘cost’ to each type of error
c) Accuracy
4. Precision and recall
i. With respect to just the interesting class:

1. precision: how often are we correct when we predict that an instance is interesting? 当我们预测instance为 interesting class时，有多大几率正确
b) When FN is less of a concern, precision is better
2. Recall: what proportion of the truly interesting instances have we correctly identified as interesting? 在真正的interesting class instance中，我们预测出了多少
b) 在面对cancer model的时候, 为了避免FN， Recall is a better measure than precision.
ii. Problem
1. Precision / recall are in an inverse relationship when we set up our classifier:
a) The classifier has high precision, but low recall
b) The classifier has high recall, but low precision
2. How can I set up a classifier with both P/R high?
a) A popular metrics is F-score
iii. 代表recall是precision 几倍重要，当，代表recall与precision一样重要
3. 什么时候需要P/R都高呢？？什么场景？

5. Capturing trade-offs between different objectives
a) Receiver operating characteristics (ROC- curve)
i. Goal: maximize the are under ROC curve
1. Minimise FP
2. Maximize TP
6. Metrics
7. Multi-class evaluation
a) Confusion matrix
i. Assume an interesting class I and several uninteresting classes (U1, U2…)
b) Accuracy
i. One vs rest
c) Precision/ recall/F-score
i. Step1: Calculated per-class
ii. Step2: Must be averaged
1. Macro-averaging

2. Micro-averaging

3. Weighted averaging

d) How to average your predictions?
i. Macro average treats all classes equal: emphasizes small classes.
ii. Micro average is dominated by large classes.
问题Macro-averaged F-score: is it the F-score of macro-averaged P (over classes) and macro-averaged R (over classes)? Or the macro-average (over classes) of the F-score for each class?
• If we are doing Repeated Random Subsampling, and want weighted-averaged Precision, do we average the weighted Precision (over classes) for each iteration of Random Subsampling? or do we take the weighted average (over classes) of the Precision averaged over the iterations of Subsampling? Or the weighted average over the instances aggregated over the iterations?

8. Model comparison
a) Baselines vs Benchmarks
i. Baseline = naïve method which we would expect any reasonably well-developed method to better（简单算法）
1. e.g. for a novice marathon runner, the time to walk 42km
ii. Benchmark = established rival technique which we are pitching our method against (竞争算法，假想敌，我的算法要超越的对象)
1. e.g. for a marathon runner, the time of our last marathon run/the world record time/3 hours/…
b) The importance of baselines
i. Baseline are valuable in getting a sense(了解) for the intrinsic difficulty of a given task
ii. 重要性：判断建立的模型是否有效，问题的困难度，提升空间等
iii. 如果连baseline 都超不过，那我们的算法就没有意义
iv. In formulating a baseline, we need to be sensitive to the importance of positives and negatives in the classification task
v. 如果简单算法已经accuracy 99% 提升空间已经很有限。如果是20%，还有很大的提升空间，所以你的算法比较有意义。

c) Random baseline
i. Method1: randomly assign a class to each test instance
1. Often the only option in unsupervised / semi-supervised contexts
2. i.e. c1, c2,c3 = 0.33, 0.33, 0.33
ii. Mothod2: randomly assign a class to each test instance, weighting the class assignment according to . 依照prior distribution 来随机分配
1. i.e., c1, c2, c3 = 0.1, 0.2, 0.7
2. 要求：we know the prior probabilities
d) Zero-R (zero rules)
i. 不利用任何一个attribute, 而是所有数据都预测到出现次数最多的Class
ii. The most used baseline in machine learning
iii. Inappropriate if the majority class is FALSE and the learning task is to identify needles in the haystack (当training set的多数类是错误的，草堆里判断针头，training set肯定大部分的label是草堆，那我们的majority class baseline就是草堆了。)
v. Assign to majority class in training data
e) One-R (One rule) – decision stump
i. Definition: 了解步骤，了解运算error rate。
ii. Advantages:
1. Easy to implement
2. Easy to comprehend
3. Good result (选择了最有用的attribute)
iii. Disadvantages:
1. Cannot capture attribute interaction (因为一个一个用，无按组合)
2. Bias towards high arity attributes（能取很多值）

由于attributes中的分类太多的话会导致error rate 特别小，所以他会偏向于选择分类很多的attribute,但是这个选出来的attribute是没有意义的。因为本意是为了找error rate 小的attribute有利于我们prediction,但现在关注点却是看谁的分类最多。

Lecture 8: feature selection and analysis

Feature selection
What makes features good?
1. Lead to better models: better performance according to some evaluation metric
2. Side-goal 1
a) Seeing important features can suggest other important features
b) Tell us interesting things about the problem
3. Side-goal 2
a) Fewer features -> smaller models -> faster answer
i. More accurate answer >> faster answer
Iterative feature selection: Wrappers
1. Definition: choose subset attributes that give best performance on the development data
a) 一个一个feature加上去
2. Advantage: feature set with optimal performance on the development data
3. Disadvantage: takes a long time
4. How long does the full wrapper method take?
a) Too long: only practical for very small data sets
5. Greedy approach (more practical wrapper methods)
先一轮单个最有，然后第二轮两个一组的找最优，以此类推
a) Definition:
1. Train and evaluate model on each single attribute(m 个 attribute {A,B,C,D})
2. Choose best attribute 选一个最好的attribute
3. Iterate until performance (e.g., accuracy) stops increasing(直到表现不明显的时候)
b) Advantage: more quickly
c) Disadvantage:
1. Might be a sub-optimal solution
2. Need to assume feature independence ,没考虑features之间interaction的情况
6. Ablation (more practical wrapper methods)
a) Definition:
1. Start with all attributes
2. Remove one attributes, train and evaluate model
3. Until divergence
4. Termination when performance starts to degrade too much
b) Advantage: mostly removes irrelevant attribute at the start
c) Disadvantage:
1. Assume independence of attribute
2. Time consuming
3. No feasible on non-trivial dataset 在所有features都重要的dataset 上不可行。

Feature filtering
独立的评估单个feature的好坏
1. Definition: evaluate the ‘goodness’ of each feature, separate from other features
2. Advantage:
a) Consider each feature separately: linear time in number of attributes
意思是比较省时间？？
b) Possible to control for inter-dependence of features
c) Typically, popular strategy
a) Pointwise mutual information (PMI)
i. Goal: 找和class 相关的attributes, PMI 越高越好
ii. Formular:
iii. Judgement:
1. PMI>>0; attribute and class occur together much more often -> useful attribute
2. PMI ==0; independent -> useless attribute
3. PMI << 0; attribute and class are negatively correlated b) Mutual information i. Definition: weight average of PMI ii. Contingency tables: compact representation of these frequency counts iii. Formular: c) Chi-square i. 假设independence , 计算expected value, 并与observed value 比较 1. 当chi-square 很大，说明假设独立假设不成立，attribute 与class 相关 2. 当chi-square = 0 或者差不多，假设成立，他们是Independent iii. Formular: 4. What makes a single feature good? a) Well correlated with class i. Knowing let us Class with more confidence b) Reverse correlated with class i. Knowing lets us predict class with more confidence c) Well correlated with not class i. Knowing lets us predict with more confidence 5. What makes a feature bad? a) Irrelevant b) Correlated with other features c) Good at only predicting on class Common issue- 前面都用binary attribute (True/ False),现在开始对不同原始数据结构作分析 ##multi- features a. Type of attribute: a) Nominal attributes b) Continuous attributes c) Ordinal attributes b. Nominal attributes a) Two common strategies i. Treat as multiple binary attributes ii. Modify contingency tables and formulae (公式) c. Continuous attributes a) Usually dealt with by estimating probability based on a gaussian distribution b) With large number of values, most random variables are normally distributed due to the central limit theorem c) For small data sets or pathological features(病理特征), we may need to use binominal / multinominal distributions d. Ordinal attributes a) Three possibilities, roughly in order of popularity i. Treat as binary ii. Treat as continuous iii. Treat as. nominal ##multi-class Multiclass problem- PPT 看太懂 So far, we’ve only looked at binary (Y/N) classification tasks. Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult. PMI, MI, chi-square are calculated per-class-> still valid but not intuitive
Sone feature selection metrics such as information gain works for all class at once

MI的multi-class例子中
1. Intuitive features: 这些features能过证明出class
2. Features for predicting not class: 这些features 能证明不能预测这个class
3. Unintuitive features: rarely occur 可能是type error

这个例子能说明MI 的缺点
Mutual information is biased toward rare, uninformative features

A few other common approaches to feature selection
1. A common unsupervised alternative
a) Term frequency inverse document frequency (TFIDF)
i. Detect important words/ NLP
ii. TF = frequent enough
iii. IDF = special enough
2. Embedded methods
a) Decision trees
b) Regression models with regularization
i. Regularization: nudges the weight of unimportant features towards zero
3. More strategies
a) https://scikit- learn.org/stable/modules/classes.html# module-sklearn.feature_selection

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts