代写代考 Lecture 8: Feature Selection and Analysis

Lecture 8: Feature Selection and Analysis
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2021. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
Acknowledgement: , &

Features in Machine Learning

Machine Learning Workflow
Evaluation Task
??? ? Model
Depends on features
Evaluation Task

Data Preparation vs Feature Selection
GIGO: Garbage In, Garbage Out
Data Preparation and Cleaning (discussed before)
• Data Cleaning
• Data Aggregation
• Dealing with missing values
• Transformation (e.g., log transform) • Binarization
• Scaling or Normalization
Feature Selection (this lecture)
• Wrapper methods (aka recursive elimination) • Filtering (aka univariate filtering)
• Glance into some other common approaches

Data Preparation vs Feature Selection
Our job as Machine Learning experts:
• Inspect / clean the data
• Choose a model suitable for classifying the data according to the attributes
• Choose attributes suitable for classifying the data according to the model
• Inspection • Intuition

Feature Selection

What makes features good?
Lead to better models
• Better performance according to some evaluation metric
Side-goal 1
• Seeing important features can suggest other important features • Tell us interesting things about the problem
Side-goal 2
• Fewer features → smaller models → faster answer • More accurate answer >> faster answer

Iterative feature selection: Wrappers

Choosing a good feature set
“Wrapper” methods
• Choose subset of attributes that give best performance on the development data
• For example: for the Weather data set:
• Train model on {Outlook}
• Train model on {Temperature}
• Train model on {Outlook, Temperature}
• Train model on {Outlook, Temperature, Humidity}
• Train model on {Outlook, Temperature, Humidity, Windy}

Choosing a good feature set
“Wrapper” methods
• Choose subset of attributes that give best performance on the development data
• For example: for the Weather data set:
• Evaluate model on {Outlook}
• Evaluate model on {Temperature}
• Evaluate model on {Outlook, Temperature}
• Evaluate model on {Outlook, Temperature, Humidity}
• Evaluate model on {Outlook, Temperature, Humidity, Windy}
• Best performance on data set → best feature set

Choosing a good feature set
“Wrapper” methods
• Choose subset of attributes that give best performance on the development data
• Advantages:
• Feature set with optimal performance on development data
• Disadvantages:
• Takes a long time

Aside: how long does the full wrapper method take?
Assume we have a fast method (e.g. Naive Bayes) over a data set of non-trivial size (∼10K instances):
• Assume: train–evaluate cycle takes 10 sec to complete How many cycles? For m features:
• 2m subsets = 2m minutes 6
• m=10→3hours
• m = 60 → heat death of universe
Only practical for very small data sets.

More practical wrapper methods: Greedy search
Greedy approach
• Train and evaluate model on each single attribute
• Choose best attribute
• Until convergence:
• Train and evaluate model on best attribute(s), plus each remaining
single attribute
• Choose best attribute out of the remaining set
• Iterate until performance (e.g. accuracy) stops increasing

More practical wrapper methods: Greedy search
Greedy approach
• Bad news:
• Takes 21 m2 cycles, for m attributes • In theory, 386 attributes → days
• Good news:
• In practice, converges much more quickly than this
• Bad news again:
• Convergences to a sub-optimal (and often very bad) solution

More practical wrapper methods: Ablation
“Ablation” approach
• Start with all attributes
• Remove one attribute, train and evaluate model
• Until divergence:
• From remaining attributes, remove each attribute, train and evaluate model
• Remove attribute that causes least performance degradation
• Termination condition usually: performance (e.g. accuracy) starts to degrade by more than ε

More practical wrapper methods: Ablation
“Ablation” approach
• Good news:
• Mostly removes irrelevant attributes (at the start)
• Bad news:
• Assumes independence of attributes
(Actually, both approaches do this)
• Takes O(m2) time; cycles are slower with more attributes • Not feasible on non-trivial data sets.

Feature Filtering

Feature filtering
Intuition: Evaluate the “goodness” of each feature, separate from other features
• Consider each feature separately: linear time in number of attributes • Possible (but difficult) to control for inter-dependence of features
• Typically most popular strategy

Feature “goodness”
What makes a feature set single feature good?

Toy example
a1 a2 c YYY YNY NYN NNN
Which of a1, a2 is good?

Toy example
a1 a2 c YYY YNY NYN NNN

Pointwise Mutual Information
Discrepancy between the observed joint probability of two random variables A and C and the expected joint probability if A and C were independent.
Recall independence: P(C|A) = P(C)
PMI is defined as
We want to find attributes that are not independent of the class.
• If PMI >> 0, attribute and class occur together much more often than randomly.
• If RHS ∼ 0, attribute and class occur together as often as we would expect from random chance
• If RHS << 0, attribute and class are negatively correlated. (More on that later!) Attributes with greatest PMI: best attributes PMI(A, C) = log P(A, C) 2 P(A)P(C) Toy example, revisited a1 a2 c YYY YNY NYN NNN Calculate PMI of a1, a2 with respect to c Toy example, revisited a1 a2 c YYY YNY NYN NNN P(a1) = P(c) = P(a1,c) = PMI(a1,c) = Toy example, revisited a1 a2 c YYY YNY NYN NNN P(a1) = P(c) = P(a1,c) = PMI(a1,c) = Toy example, revisited a1 a2 c YYY YNY NYN NNN P ( a 2 ) = 42 P ( c ) = 42 P ( a 2 , c ) = 14 Toy example, revisited a1 a2 c YYY YNY NYN NNN P ( a 2 ) = 42 P ( c ) = 42 P ( a 2 , c ) = 14 PMI(a2,c) = = log2(1) = 0 Feature “goodness”, revisited What makes a single feature good? • Well correlated with class • Knowing a lets us predict c with more confidence • Reverse correlated with class • Knowing a ̄ lets us predict c with more confidence • Well correlated (or reverse correlated) with not class • Knowing a lets us predict c ̄ with more confidence • Usually not quite as good, but still useful Mutual Information • Expected value of PMI over all possible events • For our example: Combine PMI of all possible combinations: a, a ̄, c, c ̄ Aside: Contingency tables Contingency tables: compact representation of these frequency counts a ̄ Total σ(c) σ(c ̄) σ(a, c) σ(a ̄, c) σ(a, c ̄) σ(a ̄, c ̄) P(a, c) = σ(a,c) , etc. N Aside: Contingency tables Contingency tables for toy example: a1 a=Y a=N Total c=Y 2 c=N 2 Total 4 Total c=Y 1 1 2 c=N 1 1 2 Total 2 2 4 22 a=Y a=N Mutual Information Combine PMI of all possible combinations: a, a ̄, c, c ̄ MI(A, C) =P(a, c)PMI(a, c) + P(a ̄, c)PMI(a ̄, c)+ P(a, c ̄)PMI(a, c ̄) + P(a ̄, c ̄)PMI(a ̄, c ̄) MI(A, C) =P(a, c) log P(a, c) 2 P(a)P(c) P(a, c ̄) log P(a, c ̄) 2 P(a)P(c ̄) + P(a ̄, c) log + P(a ̄, c ̄) log P(a ̄, c) + 2 P(a ̄)P(c) P(a ̄, c ̄) 2 P(a ̄)P(c ̄) Mutual Information Combine PMI of all possible combinations: a, a ̄, c, c ̄ MI(A, C) =P(a, c)PMI(a, c) + P(a ̄, c)PMI(a ̄, c)+ P(a, c ̄)PMI(a, c ̄) + P(a ̄, c ̄)PMI(a ̄, c ̄) MI(A, C) =P(a, c) log P(a, c) 2 P(a)P(c) P(a, c ̄) log P(a, c ̄) 2 P(a)P(c ̄) Often written more compactly as: MI(A,C) = 􏰂 􏰂 i∈{a,a ̄} j∈{c,c ̄} We define that 0log0 ≡ 0. + P(a ̄, c) log + P(a ̄, c ̄) log P(a ̄, c) + 2 P(a ̄)P(c) P(a ̄, c ̄) 2 P(a ̄)P(c ̄) P(i,j) 2 P(i)P(j) Mutual Information Example Contingency Table for attribute a1 a1 a=Y a=N Total 2 Total 2 2 4 P(a,c)= 24; P(a)= 42; P(c)= 42; P(a,c ̄)=0 P(a ̄,c ̄) = 42; P(a ̄) = 42; P(c ̄) = 42; P(a ̄,c) = 0 Mutual Information Example Contingency Table for attribute a1 a1 a=Y a=N Total 2 Total 2 2 4 P(a,c)= 24; P(a)= 42; P(c)= 42; P(a,c ̄)=0 P(a ̄,c ̄) = 42; P(a ̄) = 42; P(c ̄) = 42; P(a ̄,c) = 0 MI(A1, C) = P(a1, c) log P(a1, c) 2 P(a1)P(c) P(a1, c ̄) log P(a1, c ̄) 2 P(a1)P(c ̄) + P(a ̄1, c) log + P(a ̄1, c ̄) log P(a ̄1, c) + 2 P(a ̄1)P(c) P(a ̄1, c ̄) 2 P(a ̄1)P(c ̄) Mutual Information Example Contingency Table for attribute a1 a1 a=Y a=N Total 2 Total 2 2 4 P(a,c)= 24; P(a)= 42; P(c)= 42; P(a,c ̄)=0 P(a ̄,c ̄) = 42; P(a ̄) = 42; P(c ̄) = 42; P(a ̄,c) = 0 MI(a1, C) = P(a1, c) log P(a1, c) 2 P(a1)P(c) P(a1, c ̄) log P(a1, c ̄) 2 P(a1)P(c ̄) + P(a ̄1, c) log + P(a ̄1, c ̄) log P(a ̄1, c) + 2 P(a ̄1)P(c) P(a ̄1, c ̄) 2 P(a ̄1)P(c ̄) 112 00112 = 2log2 11 +0log2 11 +0log2 11 +2log2 11 22222222 = 12(1)+0+0+12(1)=1 Mutual Information Example continued Contingency Table for attribute a2 a2 a=Y a=N Total 2 Total 2 2 4 Mutual Information Example continued Contingency Table for attribute a2 a2 a=Y a=N Total 2 Total 2 2 4 P(a) = 42; P(c) = 42; P(a ̄,c) = 41 P(a ̄) = 42; P(c ̄) = 42; P(a,c ̄) = 41 P(a,c) = 14; P(a ̄,c ̄) = 41; Mutual Information Example continued Contingency Table for attribute a2 a2 a=Y a=N Total 2 Total 2 2 4 P(a) = 42; P(c) = 42; P(a ̄,c) = 41 P(a ̄) = 42; P(c ̄) = 42; P(a,c ̄) = 41 P(a,c) = 14; P(a ̄,c ̄) = 41; = P(a2, c) log P(a2, c) 2 P(a2)P(c) P(a2, c ̄) log P(a2, c ̄) 2 P(a2)P(c ̄) + P(a ̄2, c) log + P(a ̄2, c ̄) log P(a ̄2, c) + 2 P(a ̄2)P(c) P(a ̄2, c ̄) 2 P(a ̄2)P(c ̄) 1 14 1 14 1 41 1 14 = 4log2 11 +4log2 11 +4log2 11 +4log2 11 22222222 = 41(0)+ 14(0)+ 14(0)+ 41(0) = 0 Mutual Information Example continued Contingency Table for attribute a2 a2 a=Y a=N Total 2 Total 2 2 4 P(a) = 42; P(c) = 42; P(a ̄,c) = 41 P(a ̄) = 42; P(c ̄) = 42; P(a,c ̄) = 41 P(a,c) = 14; P(a ̄,c ̄) = 41; χ2 (“Chi-square”) Similar idea, different solution: Contingency table (shorthand): σ(c) σ(c ̄) N σ(a, c) σ(a ̄, c) σ(a, c ̄) σ(a ̄, c ̄) c W+X c ̄ Y+Z Total W+Y X+Z N=W+X+Y+Z If a, c were independent (uncorrelated), what value would you expect in W ? Denote the expected value as E(W). χ2 (“Chi-square”) If a, c were independent, then P(a, c) = P(a)P(c) P(a, c) = P(a)P(c) σ(a, c) = σ(a) σ(c) NNN σ(a,c) = E(W) = σ(a)σ(c) N (W +Y)(W +X) W+X+Y+Z χ2 (“Chi-square”) Compare the value we actually observed O(W) with the expected value E(W): • If the observed value is much greater than the expected value, a occurs more often with c than we would expect at random — predictive • If the observed value is much smaller than the expected value, a occurs less often with c than we would expect at random — predictive • If the observed value is close to the expected value, a occurs as often with c as we would expect randomly — not predictive Similarly with X, Y, Z χ2 (“Chi-square”) Actual calculation (to fit to a chi-square distribution) 2 (O(W) − E(W))2 (O(X) − E(X))2 χ = E(W) + E(X) + (O(Y) − E(Y))2 (O(Z) − E(Z))2 E(Y) + E(Z) r c (Oi,j −Ei,j)2 i=1 j=1 Ei,j • i sums over rows and j sums over columns. • Because the values are squared, χ2 becomes much greater when |O−E |islarge,evenifE isalsolarge. Chi-square Example Contingency table for toy example (observed values): a1 a=Y a=N Total c =Y 2 Total 2 2 4 Contingency table for toy example (expected values): a1 a=Y a=N Total 2 Total 2 2 4 Chi-square Example (Oa,c −Ea,c)2 Ea,c (Oa,c ̄ − Ea,c ̄)2 Ea,c ̄ (Oa ̄,c −Ea ̄,c)2 Ea ̄,c (Oa ̄,c ̄ − Ea ̄,c ̄)2 Ea ̄,c ̄ (2−1)2 (0−1)2 (0−1)2 (2−1)2 =1+1+1+1 = 1+1+1+1=4 χ2(A2, C) is obviously 0, because all observed values are equal to expected values. Common Issues Types of Attribute So far, we’ve only looked at binary (Y/N) attributes: • Nominal attributes • Continuous attributes • Ordinal attributes Types of Attributes: Nominal Two common strategies 1. Treat as multiple binary attributes: • e.g. sunny=Y, overcast=N, rainy=N, etc. • Can just use the formulae as given • Results sometimes difficult to interpret • Forexample,Outlook=sunnyisuseful,butOutlook=overcast and Outlook=rainy are not useful... Should we use Outlook? 2. Modify contingency tables (and formulae) Osor c=Y U V W c=N X Y Z Types of Attributes: Nominal Modified MI: MI(O,C) = 􏰂 􏰂 P(i,j)log P(i,j) i∈{s,o,r} j∈{c,c ̄} = P(s,c)log P(s,c) 2 P(i)P(j) + P(s,c ̄)log + P(o, c ̄) log 2 P(s)P(c) P(o, c) log P(o, c) P(r, c) log P(s,c ̄ + 2 P(s)P(c ̄) P(o, c ̄) + 2 P(o)P(c ̄) P(r, c ̄) 2 P(r)P(c ̄) 2 P(o)P(c) P(r, c) + P(r, c ̄) log 2 P(r)P(c) • Biased towards attributes with many values. Types of Attributes: Nominal Chi-square can be used as normal, with 6 observed/expected values. • To control for score inflation, we need to consider “number of degrees of freedom”, and then use the significance test explicitly (beyond the scope of this subject) Types of Attributes: Continuous Continuous attributes • Usually dealt with by estimating probability based on a Gaussian (normal) distribution • With a large number of values, most random variables are normally distributed due to the Central Limit Theorem • For small data sets or pathological features, we may need to use binomial/multinomial distributions All of this is beyond the scope of this subject Types of Attributes: Ordinal Three possibilities, roughly in order of popularity: 1. Treat as binary • Particularly appropriate for frequency counts where events are low-frequency (e.g. words in tweets) 2. Treat as continuous • The fact that we haven’t seen any intermediate values is usually not important • Does have all of the technical downsides of continuous attributes, however 3. Treat as nominal (i.e. throw away ordering) Multi-class problems So far, we’ve only looked at binary (Y/N) classification tasks. Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult. What makes a single feature good? • Highly correlated with class • Highly reverse correlated with class • Highly correlated (or reverse correlated) with not class ... What if there are many classes? What makes a feature bad? • Irrelevant • Correlated with other features • Good at only predicting one class (but is this truly bad?) Multi-class problems So far, we’ve only looked at binary (Y/N) classification tasks. Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult. • PMI, MI, χ2 are all calculated per-class • (Some other feature selection metrics, e.g. Information Gain, work for all classes at once) • Need to make a point of selecting (hopefully uncorrelated) features for each class to give our classifier the best chance of predicting everything correctly. Multi-class problems So far, we’ve only looked at binary (Y/N) classification tasks. Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult. Actual example (MI): LA NY C At SF chicago atlanta sf hollywood atlanta atlanta yankees httpbitlyczmk lol san cubs u u la georgia lol chi chicago save bears atl il ga httpdealnaycom Multi-class problems So far, we’ve only looked at binary (Y/N) classification tasks. Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult. Intuitive features: LA NY C At SF la angeles los chicago hollywood atlanta lakers nyc chicago atlanta sf york bears atl ny il ga httpdealnaycom francisco chicago httpbitlyczmk lol san atlanta cubs u u yankees la sf chi georgia lol chicago save Multi-class problems So far, we’ve only looked at binary (Y/N) classification tasks. Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult. Features for predicting not class: LA NY C At SF la angeles los chicago hollywood atlanta lakers nyc chicago atlanta sf york bears atl ny il ga httpdealnaycom chicago httpbitlyczmk lol san atlanta cubs u u yankees la sf chi georgia lol chicago save Multi-class problems So far, we’ve only looked at binary (Y/N) classification tasks. Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult. Unintuitive features: LA NY C At SF chicago atlanta sf hollywood atlanta atlanta yankees cubs u u la georgia lol chi chicago save bears atl il ga httpdealnaycom francisco httpbitlyczmk lol san What’s going on with MI? Mutual Information is biased toward rare, uninformative features • All probabilities: no notion of the raw frequency of events • If a feature is seen rarely, but always with a given class, it will be seen as “good” • Best features in the Twitter dataset only had MI of about 0.01 bits; 100th best for a given class had MI of about 0.002 bits Glance into a few other common approaches to feature selection A common (unsupervised) alternative Term Frequency Inverse Document Frequency (TFIDF) • Detect important words / Natural Language Processing • Find words that are relevant to a document in a given document collection • To be relevant, a word should be • Frequent enough in the corpus (TF). A word that occurs only 5 times in a corpus of 5,000,000 words is probably not too interesting • Special enough (IDF). A very common word (“the”, “you”, ...) that occurs in (almost) every document is probably not too interesting A common (unsupervised) alternative Term Frequency Inverse Document Frequency (TFIDF) • Detect important words / Natural Language Processing • Find words that are relevant to a document in a given document collection • To be relevant, a word should be • Frequent enough in the corpus (TF). A word that occurs only 5 times in a corpus of 5,000,000 words is probably not too interesting • Special enough (IDF). A very common word (“the”, “you”, ...) that occurs in (almost) every document is probably not too interesting tfidf(d,t,D)=tf +idf tf =log(1+freq(t,d)) idf = log􏰃 |D| 􏰄 count(d ∈ D : t ∈ d) d=document, t=term, D=document collection; |D|=number of documents in D Embedded Methods Some ML models include feature selection inherently 1. Decision trees: Generalization of 1-R 2. Regression models with regularization house price = β0 +β1 ×size+β2 ×location+β3 ×age Regularization (or ‘penalty’) nudges the weight β of unimportant features towards zero https://towardsdatascience.com/a- beginners- guide- to- decision- tree- classification- 6d3209353ea?gi=e0 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts