CS计算机代考程序代写 algorithm python decision tree COMP20008 Elements of Data Processing Experimental Design

COMP20008 Elements of Data Processing Experimental Design
Semester 1, 2021
Chris Ewin

Where are we now?
2

Supervised vs Unsupervised Learning
3

Experimental Design
• Feature Selection
• Dimensionality Reduction • Performance Evaluation
4

Feature Selection
5

Feature selection for machine learning
• A series of data wrangling steps result in a set of features • Features are inputs to a model
• Choose attributes suitable for classifying the data according to the model • Train the model
• Classify the data
• Evaluate the results

Good feature sets for machine learning
• Throw everything we can think of at the model and let the algorithm decide?
• Good features give better model performance
• Choose features suitable for classifying the data according to the model • Inspection
• Intuition
• Neither may be possible in practice
• Feature Selection Methods

Feature selection 1 – wrapper
“Wrapper” method:
Choose subset of attributes that give best performance on the data
https://www.analyticsvidhya.com/

Feature selection 1 – wrapper
Example: classification of Iris data, train and evaluate a model on each subset of attributes:
– {petal length} ; {petal width} ; {sepal width} ; {sepal length} – {petal length, petal width}; {petal length, sepal length}; …
– {petal length, petal width, sepal length}; {sepal width}; …
– {petal length, petal width, sepal length, sepal width}
• Choose the subset giving the best performance
• Disadvantages: time consuming
• Only practical for very small data sets

Practical wrapper approach – Greedy
“Greedy” approach:
• Train and evaluate model on each single feature • Choose the best feature
• Repeat until performance stops increasing
• Train and evaluate model on best feature(s) plus each remaining single feature • Add the best feature to the best feature set

Practical wrapper approach – Greedy
“Greedy” approach:
• Assumes independence of features
• Quicker than the naïve wrapper approach
• May converge to a sub-optimal (and can be very bad) solution

Practical wrapper approach – Decremental
“Decremental” approach:
• Start with all features
• Remove one feature, train and evaluate model
• Repeat until performance degrades too much:
• From remaining features, remove one at a time, train and evaluate model • Remove the feature that causes the least performance degradation

Practical wrapper approach – Decremental
“Decremental” approach:
• Assumes independence of features
• Mostly removes irrelevant features
• Slower than the greedy approach with more features

Feature selection 2 – filtering
Intuition: possible to evaluate “goodness” of each feature, separate from other features
• Consider each feature separately: linear time in number of attributes • Typically most popular strategy
Possible (but difficult) to control for inter-dependence of features

Feature filtering
What makes a single feature good?
• Well correlated with class (ratio, interval, ordinal)
• Not independent of class (ordinal, nominal)
• A simple example: Which of 𝑎1, 𝑎2is a good feature?

Feature filtering – Mutual information
What makes a single feature good?
• Well correlated with class Mutual Information:
•𝑀𝐼𝑋,𝑌 =𝐻𝑌 −𝐻𝑌𝑋
•𝑀𝐼𝑐,𝑎1 =𝐻𝑐 −𝐻𝑐𝑎1 =1−0=1
• 𝐻 𝑐 = − σ 𝑝 log 𝑝 = − 1 × log 1 + 1 × log 1 = 1 𝑖𝜖𝑌,𝑁𝑖2𝑖 2 222 22
•𝐻𝑐𝑎1 =σ𝑥𝜖𝑌,𝑁 𝑝𝑎1=𝑥𝐻𝑐𝑎1=𝑥 =1×0+1×0=0 22
• High information gain: a1 can strongly predict c; keep a1 in the feature set

Feature filtering – Mutual information
What makes a single feature good?
• Well correlated with class Mutual Information:
•𝑀𝐼𝑋,𝑌 =𝐻𝑌 −𝐻𝑌𝑋
•𝑀𝐼𝑐,𝑎2 =𝐻𝑐 −𝐻𝑐𝑎2 =1−1=0
• 𝐻 𝑐 = − σ 𝑝 log 𝑝 = − 1 × log 1 + 1 × log 1 = 1 𝑖𝜖𝑌,𝑁𝑖2𝑖 2 222 22
•𝐻𝑐𝑎2 =σ𝑥𝜖𝑌,𝑁 𝑝𝑎2=𝑥𝐻𝑐𝑎2=𝑥 =1×1+1×1=1 22
• No information gain: a2 cannot predict c; remove a2 from the feature set

Feature filtering – Mutual Information
• Can select the group of features that have the highest MI with the class label
• Difficult to understand the significance of the MI number
• Difficult to assess whether features are ‘good’ or just better than other features

Feature filtering – Chi-square 2
What makes a single feature good?
• Not independent of class
• Remove a feature if it is independent of the output class
Chi-square test is a statistical hypothesis test for quantifying the independence of pairs of categorical (ordinal, nominal) variables.
• H0: two variables are independent
• Takes into account sample size & has a significance test (MI does not)

Feature filtering – Chi-square 2
• Summarise frequencies of feature and the class in a contingency table
• Is there a difference between observed and expected frequencies?
Observed frequencies of a and c, e.g. 𝑂 = 2 1 (𝑐=𝑌,𝑎1=𝑌)

Chi-square 2 – Null Hypothesis
H0:twovariablesareindependent:𝑃 𝑥∩𝑦 =𝑃 𝑥 ×𝑃 𝑦 • Find expected frequencies for all (a1, c) pairs
P:(𝑐=𝑌and𝑎 =𝑌)=2×2 =1;Expectedvalue𝐸
1 4 4 4 (𝑐=𝑌,𝑎1=𝑌)
P:(𝑐=𝑌and𝑎1 =𝑁)=1;Expectedvalue=1 4
P:(𝑐=𝑁and𝑎1 =𝑌)=1;Expectedvalue=1 4
P:(𝑐=𝑁and𝑎1 =𝑁)=1;Expectedvalue=1 4
=4×1 =1 4
11
1
1

Chi-square 2 – calculate for a1
• Calculate Difference between observed and expected values
•𝜒2 𝑎1,𝑐 =σ σ (𝑂𝑖,𝑗−𝐸𝑖,𝑗)2 𝑖𝜖{𝑐= 𝑌,𝑐=𝑁} 𝑗𝜖{𝑎1=𝑌,𝑎1=𝑁} 𝐸𝑖,𝑗
•𝝌𝟐𝒂𝟏,𝒄= 2−12+0−12+0−12+2−12 =𝟒 1111
Contingency table (observed)
Contingency table (expected)
11
1
1

Chi-square 2 – test for a1
The Chi-square statistic for (𝑎1, 𝑐) is: 𝜒2 𝑎1, 𝑐 = 4
Are 𝑎1 and 𝑐 independent? (the null hypothesis H0) • With 95% confidence (alpha = 0.05)
• Degree of freedom = 2 − 1 × 2 − 1 = 1 (𝑎1 has 2 values, 𝑐 has 2 values) • The Chi-Square critical value is 3.84 (lookup this value)
• Reject H0
• 𝑎1 is NOT independent of c • 𝑎1 is part of the feature set

Chi-square test in Python
from scipy.stats import chi2
Will practise in workshop

Feature Selection 3 – embedded
“Embedded” methods:
• Some models perform feature selection as part of the algorithm • Linear classifiers, e.g. regression
• To some degree: Decision Trees
• Often still benefit from other feature selection approaches

Feature Selection – potential issues
Difficult to control for inter-dependence of features
• Feature filtering of single features may remove important features. For example, where the class is the XOR of some features:
• Given all the features, the class is totally predictable. • Given one of them, the MI or chi-square statistic is 0.
• In practice, feature engineering also used, i.e. construct new features out of existing ones, e.g. ratio / difference between features
Income Expenditure
i>e Credit
120 100 1 Good
50 30 1 Good
50 70 0 Bad
200 40 1 Good
200 210 0 Bad
…………
160 150 1 Good

Dimensionality Reduction
29

Dimensionality reduction
Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by algorithms • Allow data to be more easily visualised
• May help to eliminate irrelevant features or reduce noise
• Fewer features → smaller machine learning models → faster answer • Better feature set by transformation → more accurate answer

Motivation: High dimensional data
• The curse of dimensionality: “Data analysis techniques which work well at lower dimensions (fewer features), often perform poorly as the dimensionality of the analysed data increases (lots of features)”
• As dimensionality increases, data becomes increasingly sparse and all the distances between pairs of points begin to look the same. Impacts any algorithm that is based on distances between objects.

Dimensionality reduction
Input: a dataset with 𝑁 features and 𝐾 objects
Output: a transformed dataset with n ≪ 𝑁 features and 𝐾 objects
• if 𝑛 2 or 3 transformed dataset can be easily visualised Example: 𝑛 = 2
Object id
Feature1
Feature2
….
FeatureN
1
…
…
…
…
…
….
…
…
…
K
….
….
…
…
Transformation
Object id
NewFeatureA
NewFeatureB
1
…
…
…
…
…
K
…
…

Transforming from N to 𝑛 ≪ 𝑁 dimensions
• The transformation should preserve the characteristics of the data,
e.g. distances between pairs of objects
• If a pair of objects is close before the transformation, they should still
be close after the transformation
• If a pair of objects is far apart before the transformation, they should
still be far apart after the transformation
• The set of nearest neighbors of an object before the transformation should ideally be the same after the transformation

Question
• Suppose we are given a dataset with the following N features, describing individuals in this class. Which two features would you select to represent people, in such a way that “distances” between pairs of people are likely to be preserved in the reduced dataset?
• Input: N=7 features
1. Weighted average mark (WAM)
2. Age (years)
3. Height (cm)
4. Weight (kg)
5. Number of pets owned
6. Number of subjects passed so far
7. Amount of sleep last night (0=little, 1=medium, 2=a lot)
• Output: Select 2 of the above features

Dimensionality reduction
• Basic method: To reduce dimensionality perform feature selection.
• Scatter plots for Iris dataset shown earlier – 2D visualisations of a 4D dataset. 2
features were selected from the original 4.
• In general, when transforming a dataset from 𝑁 to 𝑛 (𝑛 ≪ 𝑁) features
• The output 𝑛 features do not need to be a subset of the input 𝑁 features. Rather, they can be new features whose values are constructed using some function applied to the input 𝑁 features

Principal components analysis (PCA)
• Find a new set of features that better captures the variability of the data
• First dimension chosen to capture as much of the variability as possible;
• The second dimension is orthogonal to the first, and subject to that constraint, captures as much of the remaining variability as possible,
• The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible,
• Etc. Linear weighted combination of features (feature engineering). • We will not be covering the mathematical details
• Many tutorials available on the web. Nice application of linear algebra.
• A good visualisation: http://setosa.io/ev/principal-component-analysis

Contrast PCA against a scatter matrix
Scatter plots for iris dataset

Iris dataset: Reduced from 4 to 2 dimensions using PCA

AFL football dataset: From http://afltables.com/afl/stats/
# Player
32 Dangerfield,Patrick 9 Sloane, Rory
5 Thompson, Scott
33 Smith, Brodie
10 Jaensch, Matthew 26 Douglas, Richard 11 Wright, Matthew 24 Jacobs, Sam
14 Mackay, David
18 Betts, Eddie
1 Podsiadly, James 16 Brown, Luke
2 Crouch, Brad 36 Martin, Brodie 12 Talia, Daniel 29 Laird, Rory
4 Jenkins, Josh 13 Walker, Taylor
3 Reilly, Brent 17 Kerridge, Sam
22 276 74 272 22 269 105 252 19 257 69 262 22 287 108 209 22 297 126 166 19 266 52 147 20 224 89 150 22 193 90 165 19 168 58 174 22 167 53 123 21 189 119 101 22 138 55 148 11 125 26 147 17 155 65 109 22 167 105 93 16 126 65 129 20 170 86 64 15 138 84 82 10 130 65 63 14 72 33 84
548 24.91 521 23.68 519 27.32 496 22.55 463 21.05 413 21.74 374 18.70 358 16.27 342 18.00 290 13.18 290 13.81 286 13.00 272 24.73 264 15.53 260 11.82 255 15.94 234 11.70 220 14.67 193 19.30 156 11.14
17 22 13 9 3 7 11 8 7 5 11 8 14 8 7 3 11 7 51 22 26 14 11
28 78 33 104 136
66 34 19 50 26 15 61 19 22 45 96 34 19 10 38 22 17 27 30 6 33 11 15 31 22 13 39 19 16 63 14 25 18 13 5 30 86 40 13 11 29 11 12 31 88 36 12 8 47 10 21 30 3 13 25 3 14
21 10 14 4
6
4 4
3 5
341 210 275 256 224 280 142 319 106 325 182 228 141 227 150 189 127 224 149 136 132 165 81 205 114 156 97 174 79 183 81 177
25 16 35 97 64 35 21
18 10 83.7
5 21 87.2 17 81.7 0/2
[Stats Main][AFL Main] [2013 Stats][2015 Stats]
2014 Player Stats
[2014 Stats Summary]
[Adelaide][Brisbane Lions][Carlton][Collingwood][Essendon][Fremantle][Geelong][Gold Coast][Greater Western Sydney][Hawthorn] [Melbourne][North Melbourne][Port Adelaide][Richmond][St Kilda][Sydney][West Coast][Western Bulldogs]
[All Teams]
Abbreviations key
Adelaide [Game by Game]
GM KI MK HB DI DA GL BH HO TK RB IF CL CG FF FA BR CP UP CM MI 1% BO GA %P SU
561 8 15
1 22 40 26
34 22 10 1
19 147 45 99
92 118 18 7 91 39 69 32 30 2 8 56 23
15 11
8 26
2
4 763
2
86 28 77 35 109 76 54 89 54 91 21 96 68 22 47 46 20 40 77 30 62 74 8 37 37 17 52 54 37 16 61 22 40 45 30 38 24 45 25 37 21 34 27 13 46 24 50 19 32 17 52 10 23
72 16 1 26
4 12 19 4 5 3
3 29 41 35 11 12 7 12
13 11
56 46 7 87.0
57 34 3 81.3
36 13 11 86.4
26 6 17 80.0 1/2 63 1 10 87.9 0/1 34 37 8 81.1 0/2 21 8 29 87.7
60 1 16 90.1
42 1 4 84.5
17 9 6 83.7 0/1 34 11 4 69.2 2/1
149 1 2 90.0 0/1 25 2 2 75.4 2/0 48 10 7 90.6
20 17 90.3
55
97 140 21 102 120 23 46 139 7
32 31
19 24 1 81.0
54 97 2 9 9 4 5 83.7 0/1

AFL football dataset: PCA in 2D

Principal components analysis in Python
sklearn.decomposition
Will practise in workshop

Performance Evaluation
42

Performance Evaluation
• Metrics for Performance Evaluation • Training versus Testing Data
• Sampling and Cross-Validation
43

Unsupervised algorithm evaluation
How would you evaluate the performance of a k-means clustering?
1. If a clustering ’looks good’?
2. Compare to some ‘known points’?
Thoughts?
44

Supervised algorithm Evaluation
For an accurate decision tree classifier, we want to minimise both: • False positives (saying yes when we should say no)
• False negatives (saying no when we should say yes)
Thoughts?
45

Metrics for Performance Evaluation
For supervised algorithms – can be summarized in a Confusion Matrix (contingency table)
• Actual class: {yes, no, yes, yes, …} • Predicted class: {no, yes, yes, no…}
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a
b
Class=No
c
d
a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
46

Metrics for Performance Evaluation
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Accuracy= a+d = TP+TN a+b+c+d TP+TN+FP+FN
47

Metrics for Performance Evaluation
• Actual class: {yes, no, yes, yes, no, yes, no, no} • Predicted: {no, yes, yes, no, yes, no, no, yes}
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a= 1
(TP)
b=3
(FN)
Class=No
c=3
(FP)
d=1
(TN)
Accuracy=?
48

Limitations of accuracy
• Consider a 2-class problem
• Number of Class 0 examples = 9990 • Number of Class 1 examples = 10
• If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading because we do not detect any class 1 example • Other metrics can be used instead of accuracy to address this problem
49

Imbalanced data – Bootstrapping
Procedure:
1. Draw k datasets from original data by sampling with replacement
2. Run the algorithm for each dataset and compute accuracy
3. Return the mean and standard deviation of accuracies
• Can estimate confidence in a prediction y=f(x)
• Can handle imbalanced data sets – biased sampling.
50

Why do we split the dataset into training and testing for evaluating accuracy?
51

Why do we split the dataset into training and testing for evaluating accuracy?
Why do we need to split? Overfitting
52

Decision tree: training and testing
• Divide data into:
• Training set (e.g. 2/3) • Testing set (e.g. 1/3)
• Learn decision tree using the training set
• Evaluate performance of decision tree on the testing set
53

Common Splitting Strategies
k-fold cross-validation
Dataset
Train Test
54

Common Splitting Strategies
Leave-one-out (n-fold cross validation)
Dataset
Test Train
55

Computational complexity
• k-fold cross validation requires
• k training steps on n(k-1)/k data points • k testing steps on n/k data points
• We typically also repeat the above 10-100 times to reduce any bias from random number choices
• Report mean results and perform statistical comparisons, e.g. T-tests
56

Points to remember
• Understand the difference between supervised and unsupervised algorithms
• Understand the principles of experimental design: Feature Filtering, Dimensionality Reduction, Performance Evaluation
• Go from average to great data analysis
57

Related Posts