COMP20008 Elements of Data Processing Experimental Design
Semester 1, 2020
Contact: uwe.aickelin@unimelb.edu.au
Where are we now?
2
Supervised vs Unsupervised Learning
3
Experimental Design
• Feature Selection
• Dimensionality Reduction
• Performance Evaluation
• Real-world Example – Medical Datamining
4
Feature Selection
5
Feature selection for machine learning
• A series of data wrangling steps result in a set of features • Features are inputs to a model
• Choose attributes suitable for classifying the data according to the model • Train the model
• Classify the data
• Evaluate the results
Good feature sets for machine learning
• Throw everything we can think of at the model and let the algorithm decide?
• Good features give better model performance
• Choose features suitable for classifying the data according to the model • Inspection
• Intuition
• Neither may be possible in practice
• Feature Selection Methods
Feature selection 1 – wrapper
“Wrapper” method:
Choose subset of attributes that give best performance on the data
https://www.analyticsvidhya.com/
Feature selection 1 – wrapper
Example: classification of Iris data, train and evaluate a model on each subset of attributes:
– {petal length} ; {petal width} ; {sepal width} ; {sepal length} – {petal length, petal width}; {petal length, sepal length}; …
– {petal length, petal width, sepal length}; {sepal width}; …
– {petal length, petal width, sepal length, sepal width}
• Choose the subset giving the best performance
• Disadvantages: time consuming
• Only practical for very small data sets
Practical wrapper approach – Greedy
“Greedy” approach:
• Train and evaluate model on each single feature • Choose the best feature
• Repeat until performance stops increasing
• Train and evaluate model on best feature(s) plus each remaining single feature • Add the best feature to the best feature set
Practical wrapper approach – Greedy
“Greedy” approach:
• Assumes independence of features
• Quicker than the naïve wrapper approach
• May converge to a sub-optimal (and can be very bad) solution
Practical wrapper approach – Decremental
“Decremental” approach:
• Start with all features
• Remove one feature, train and evaluate model
• Repeat until performance degrades too much:
• From remaining features, remove one at a time, train and evaluate model • Remove the feature that causes the least performance degradation
Practical wrapper approach – Decremental
“Decremental” approach:
• Assumes independence of features
• Mostly removes irrelevant features
• Slower than the greedy approach with more features
Feature selection 2 – filtering
Intuition: possible to evaluate “goodness” of each feature, separate from other features
• Consider each feature separately: linear time in number of attributes • Typically most popular strategy
Possible (but difficult) to control for inter-dependence of features
Feature filtering
What makes a single feature good?
• Well correlated with class (ratio, interval, ordinal)
• A simple example: Which of 𝑎𝑎 , 𝑎𝑎 is a good feature? 12
• Not independent of class (ordinal, nominal)
Feature filtering – Mutual information
What makes a single feature good?
• Well correlated with class Mutual Information:
•𝑀𝑀𝑀𝑀𝑋𝑋,𝑌𝑌 =𝐻𝐻𝑌𝑌 −𝐻𝐻𝑌𝑌𝑋𝑋
•𝑀𝑀𝑀𝑀𝑐𝑐,𝑎𝑎1 =𝐻𝐻𝑐𝑐 −𝐻𝐻𝑐𝑐𝑎𝑎1 =1−0=1
• 𝐻𝐻 𝑐𝑐 =−∑𝑖𝑖𝜖𝜖 𝑌𝑌,𝑁𝑁 𝑝𝑝𝑖𝑖log2𝑝𝑝𝑖𝑖 =− 12×log212 +12×log212 =1 •𝐻𝐻𝑐𝑐𝑎𝑎1 =∑𝑥𝑥𝜖𝜖𝑌𝑌,𝑁𝑁 𝑝𝑝𝑎𝑎1=𝑥𝑥𝐻𝐻𝑐𝑐𝑎𝑎1=𝑥𝑥 =12×0+12×0=0
• High information gain: a1 can strongly predict c; keep a1 in the feature set
Feature filtering – Mutual information
What makes a single feature good?
• Well correlated with class Mutual Information:
•𝑀𝑀𝑀𝑀𝑋𝑋,𝑌𝑌 =𝐻𝐻𝑌𝑌 −𝐻𝐻𝑌𝑌𝑋𝑋
•𝑀𝑀𝑀𝑀𝑐𝑐,𝑎𝑎2 =𝐻𝐻𝑐𝑐 −𝐻𝐻𝑐𝑐𝑎𝑎2 =1−1=0
• 𝐻𝐻 𝑐𝑐 =−∑𝑖𝑖𝜖𝜖 𝑌𝑌,𝑁𝑁 𝑝𝑝𝑖𝑖log2𝑝𝑝𝑖𝑖 =− 12×log212 +12×log212 =1 •𝐻𝐻𝑐𝑐𝑎𝑎2 =∑𝑥𝑥𝜖𝜖𝑌𝑌,𝑁𝑁 𝑝𝑝𝑎𝑎2=𝑥𝑥𝐻𝐻𝑐𝑐𝑎𝑎2=𝑥𝑥 =12×1+12×1=1
• No information gain: a2 cannot predict c; remove a2 from the feature set
Feature filtering – Chi-square χ2
What makes a single feature good?
• Not independent of class
• Remove a feature if it is independent of the output class
Chi-square test is a statistical hypothesis test for quantifying the independence of pairs of categorical (ordinal, nominal) variables.
• H0: two variables are independent
• Takes into account sample size & has a significance test (MI does not)
Feature filtering – Chi-square χ2
• Summarise frequencies of feature and the class in a contingency table
• Is there a difference between observed and expected frequencies? Observed frequencies of a1 and c, e.g. 𝑂𝑂(𝑐𝑐=𝑌𝑌,𝑎𝑎1=𝑌𝑌) = 2
Chi-square χ2 – Null Hypothesis
H0:twovariablesareindependent:𝑃𝑃 𝑥𝑥∩𝑦𝑦 =𝑃𝑃 𝑥𝑥 ×𝑃𝑃 𝑦𝑦 • Find expected frequencies for all (a , c) pairs
P:(𝑐𝑐=𝑌𝑌and𝑎𝑎 =𝑌𝑌)= × = ;Expectedvalue𝐸𝐸 =4× =1 1 24 24 14 1 ( 𝑐𝑐 = 𝑌𝑌 , 𝑎𝑎 1 = 𝑌𝑌 ) 14
P:(𝑐𝑐=𝑌𝑌and𝑎𝑎1 =𝑁𝑁)=14;Expectedvalue=1 P:(𝑐𝑐=𝑁𝑁and𝑎𝑎1 =𝑌𝑌)=14;Expectedvalue=1 P:(𝑐𝑐=𝑁𝑁and𝑎𝑎1 =𝑁𝑁)=14;Expectedvalue=1
1 1 1
Chi-square χ2 – calculate for a1
• Calculate Difference between observed and expected values
•𝝌𝝌𝒂𝒂,𝒄𝒄= + +1 1+ 𝐸𝐸𝑖𝑖,𝑗𝑗=𝟒𝟒 1111
• 𝜒𝜒2 𝑎𝑎1, 𝑐𝑐 = ∑𝑖𝑖𝜖𝜖{𝑐𝑐= 𝑌𝑌,𝑐𝑐=𝑁𝑁} ∑𝑗𝑗𝜖𝜖{𝑎𝑎 =𝑌𝑌,𝑎𝑎 =𝑁𝑁} (𝑂𝑂𝑖𝑖,𝑗𝑗 − 𝐸𝐸𝑖𝑖,𝑗𝑗)2 𝟐𝟐 𝟏𝟏 2−1 2 0−1 2 0−1 2 2−1 2
Contingency table (observed)
Contingency table (expected)
11
1
1
Chi-square χ2 – test for a
The Chi-square statistic for (𝑎𝑎 , 𝑐𝑐) is: 𝜒𝜒 1
𝑎𝑎 , 𝑐𝑐 = 4 1
Are 𝑎𝑎1 and 𝑐𝑐 independent? (the null hypothesis H0)
2
1
• With 95% confidence (alpha = 0.05)
• Degreeoffreedom= 2−1 × 2−1 =1(𝑎𝑎 has2values,𝑐𝑐has2values)
1
• The Chi-Square critical value is 3.84 (lookup this value)
• 𝑎𝑎 is NOT independent of c
• Reject H0
• 𝑎𝑎1 is part of the feature set 1
Chi-square test in Python
from scipy.stats import chi2
Will practise in workshop
Feature Selection 3 – embedded
“Embedded” methods:
• Some models perform feature selection as part of the algorithm • Linear classifiers, e.g. regression
• To some degree: Decision Trees
• Often still benefit from other feature selection approaches
Feature Selection – potential issues
Difficult to control for inter-dependence of features
• Feature filtering of single features may remove important features. For example, where the class is the XOR of some features:
• Given all the features, the class is totally predictable. • Given one of them, the MI or chi-square statistic is 0.
• In practice, feature engineering also used, i.e. construct new features out of existing ones, e.g. ratio / difference between features
Income Expenditure
i>e Credit
120 100 1 Good
50 30 1 Good
50 70 0 Bad
200 40 1 Good
200 210 0 Bad
…………
160 150 1 Good
Dimensionality Reduction
26
Dimensionality reduction
Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by algorithms • Allow data to be more easily visualised
• May help to eliminate irrelevant features or reduce noise
• Fewer features → smaller machine learning models → faster answer • Better feature set by transformation → more accurate answer
Motivation: High dimensional data
• The curse of dimensionality: “Data analysis techniques which work well at lower dimensions (fewer features), often perform poorly as the dimensionality of the analysed data increases (lots of features)”
• As dimensionality increases, data becomes increasingly sparse and all the distances between pairs of points begin to look the same. Impacts any algorithm that is based on distances between objects.
Dimensionality reduction
Input: a dataset with 𝑁𝑁 features and 𝐾𝐾 objects
Output: a transformed dataset with n ≪ 𝑁𝑁 features and 𝐾𝐾 objects
• if 𝑛𝑛 2 or 3 transformed dataset can be easily visualised Example: 𝑛𝑛 = 2
Object id
Feature1
Feature2
….
FeatureN
1
…
…
…
…
…
….
…
…
…
K
….
….
…
…
Transformation
Object id
NewFeatureA
NewFeatureB
1
…
…
…
…
…
K
…
…
Transforming from N to 𝑛𝑛 ≪ 𝑁𝑁 dimensions
• The transformation should preserve the characteristics of the data,
e.g. distances between pairs of objects
• If a pair of objects is close before the transformation, they should still
be close after the transformation
• If a pair of objects is far apart before the transformation, they should
still be far apart after the transformation
• The set of nearest neighbors of an object before the transformation should ideally be the same after the transformation
Question
• Suppose we are given a dataset with the following N features, describing individuals in this class. Which two features would you select to represent people, in such a way that “distances” between pairs of people are likely to be preserved in the reduced dataset?
• Input: N=7 features
1. Weighted average mark (WAM)
2. Age (years)
3. Height (cm)
4. Weight (kg)
5. Number of pets owned
6. Number of subjects passed so far
7. Amount of sleep last night (0=little, 1=medium, 2=a lot)
• Output: Select 2 of the above features
Dimensionality reduction
• Basic method: To reduce dimensionality perform feature selection.
• Scatter plots for Iris dataset shown earlier – 2D visualisations of a 4D dataset. 2
features were selected from the original 4.
• In general, when transforming a dataset from 𝑁𝑁 to 𝑛𝑛 (𝑛𝑛 ≪ 𝑁𝑁) features
• The output 𝑛𝑛 features do not need to be a subset of the input 𝑁𝑁 features. Rather, they can be new features whose values are constructed using some function applied to the input 𝑁𝑁 features
Principal components analysis (PCA)
• Find a new set of features that better captures the variability of the data
• First dimension chosen to capture as much of the variability as possible;
• The second dimension is orthogonal to the first, and subject to that constraint, captures as much of the remaining variability as possible,
• The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible,
• Etc. Linear weighted combination of features (feature engineering). • We will not be covering the mathematical details
• Many tutorials available on the web. Nice application of linear algebra.
• A good visualisation: http://setosa.io/ev/principal-component-analysis
Iris dataset: Reduced from 4 to 2 dimensions using PCA
Contrast PCA against a scatter matrix
Scatter plots for iris dataset
AFL football dataset: From http://afltables.com/afl/stats/
AFL football dataset: PCA in 2D
Principal components analysis in Python
sklearn.decomposition
Will practise in workshop
Performance Evaluation
39
Performance Evaluation
• Metrics for Performance Evaluation • Training versus Testing Data
• Sampling and Cross-Validation
40
Unsupervised algorithm evaluation
How would you evaluate the performance of a k-nearest neighbour clustering?
1. Ifaclustering’looksgood’?
2. Comparetosome‘knownpoints’?
Thoughts?
41
Supervised algorithm Evaluation
For an accurate decision tree classifier, we want to minimise both: • False positives (saying yes when we should say no)
• False negatives (saying no when we should say yes)
Thoughts?
42
Metrics for Performance Evaluation
For supervised algorithms – can be summarized in a Confusion Matrix (contingency table)
• Actual class: {yes, no, yes, yes, …} • Predicted class: {no, yes, yes, no…}
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a
b
Class=No
c
d
a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
43
Metrics for Performance Evaluation
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Accuracy= a+d = TP+TN a+b+c+d TP+TN +FP+FN
44
Metrics for Performance Evaluation
• Actual class: {yes, no, yes, yes, no, yes, no, no} • Predicted: {no, yes, yes, no, yes, no, no, yes}
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a= 1
(TP)
b=3
(FN)
Class=No
c=3
(FP)
d=1
(TN)
Accuracy=?
45
Limitations of accuracy
• Consider a 2-class problem
• Number of Class 0 examples = 9990 • Number of Class 1 examples = 10
• If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading because we do not detect any class 1 example • Other metrics can be used instead of accuracy to address this problem
46
Imbalanced data – Bootstrapping
Procedure:
1. Drawkdatasetsfromoriginaldatabysamplingwithreplacement 2. Runthealgorithmforeachdatasetandcomputeaccuracy
3. Returnthemeanandstandarddeviationofaccuracies
• Can estimate confidence in a prediction y=f(x)
• Can handle imbalanced data sets – biased sampling.
47
Why do we split the dataset into training and testing for evaluating accuracy?
48
Why do we split the dataset into training and testing for evaluating accuracy?
Why do we need to split? Overfitting
49
Decision tree: training and testing
• Divide data into:
• Training set (e.g. 2/3) • Testing set (e.g. 1/3)
• Learn decision tree using the training set
• Evaluate performance of decision tree on the testing set
50
Common Splitting Strategies
k-fold cross-validation
Dataset
Train Test
51
Common Splitting Strategies
Leave-one-out (n-fold cross validation)
Dataset
Test Train
52
Computational complexity
• k-fold cross validation requires
• k training steps on n(k-1)/k data points • k testing steps on n/k data points
• We typically also repeat the above 10-100 times to reduce any bias from random number choices
• Report mean results and perform statistical comparisons, e.g. T-tests
53
2017 exam question 3f
• (2 marks) What is the purpose of separating a dataset into training and test sets, when evaluating the performance of a classifier?
54
Points to remember
• Understand the difference between supervised and unsupervised algorithms
• Understand the principles of experimental design: Feature Filtering, Dimensionality Reduction, Performance Evaluation
• Go from average to great data analysis
55
Cutting-edge research: Semi-Supervised Learning
JM Reps, JM Garibaldi, U Aickelin, D Soria, JE Gibson, RB Hubbard: “A Novel Semi- Supervised Algorithm for Rare Prescription Side Effect Discovery (DRESS)”, IEEE Journal of Biomedical and Health Informatics 18 (2), 537-547, 2014.
Can we automate the detection of adverse prescription side effects by utilising large databases of electronic health records?
56
Motivation
• Help patients
• Existing methods (unsupervised) have a high false positive rate
• Supervised methods may have better performance – but labels are rare • If we use knowledge of known data, can generate ‘a few more’ labels? • Semi-supervised methods?!
57
DRESS algorithm overview (1)
• The number of known Adverse Drug reactions for a specific drug is limited.
• When labels are limited, semi- supervised methods may be suitable.
• Semi-supervised clustering incorporates the limited additional knowledge to bias the clustering process
Semi-Supervised Algorithms
58
DRESS algorithm overview (2)
Step 1: Generate attributes (feature selection) Step 2: Extract labels (special feature engineering) Step 3: Metric learning (dimensionality reduction) Step 4: Semi-supervised K-means (clustering) Step 5: Filter and Rank (performance evaluation)
59
DRESS algorithm 1- Generate Attributes
Need to generate attributes that create a space where Adverse Drug Reactions are separable from non-Adverse Drug Reactions
60
DRESS algorithm 1 – Generate attributes
• The attributes are based on existing methods
• The Before/After ratios 1: how often the medical event occurs during the month before the drug compared to the month after
• The Before/After ratios 2: a comparison between how often the drug occurs before the medical event or after
AFTER
BEFORE
61
DRESS algorithm 1 – Generate Attributes
Before and after ratio features: • month after / month before
• year after / year before
• same day / year before
• month after and not in month before / month after
• month after / month before (first 3 Read code elements) • month after /month before (first 2 Read code elements)
(A Read code is a UK hierarchical classification code of a medical event)
62
DRESS algorithm 1 – Generate Attributes
What the data looks like at this point:
63
DRESS algorithm 2 – extract labels
Labels
64
DRESS algorithm 2 – extract labels
Extract known Adverse Drug Reactions by mining the web. Search for medical events listed as Adverse Drug Reactions to the drug of interest on medical websites (e.g. bnf.org, patient.com)
• Do a string match to find the corresponding Read Codes • Validate
65
DRESS algorithm 2 – extract labels
Extract indicators by mining the web. Search for medical events listed as indications (cause of taking the drug) to the drug of interest on medical websites (e.g. bnf.org, patient.com)
• Do a string match to find the corresponding Read Codes • Validate
Indication: Diabetes
66
DRESS algorithm 2 – extract labels
Extract noise events using the hierarchal Read Code structure. Find Read Code branches via a manual search that are highly unlikely to correspond to acute side effects (e.g. cancer, chronic illnesses, family history,…)
67
DRESS algorithm 2 – extract labels
What the data look like at this stage:
68
DRESS algorithm 3 – Metric learning
Use the attributes and labels generated previously to find the suitable metric that transforms the attribute space into one that optimises the class separations.
Ref: Y. Ying and P. Li “Distance metric learning with eigenvalue optimization”, J. Mach. Learn. Res., vol. 13, pp.1 -26 2012
69
DRESS algorithm 3 – Metric Learning
Before Metric Learning
70
DRESS algorithm 3 – Metric Learning
After Metric Learning
71
DRESS algorithm 4 – Semi-supervised K-Means
• In the transformed space apply constrained k-means*
• The seeds are the known label sets (seed 1: Adverse drug reactions,
seed 2: indications and seed 3: noise)
• The initial k-means centers are created using the labeled data averages for each seed
• K-means then assigns the unlabeled data point to the closest cluster and the labeled points remain fixed
*S. Basu , A. Banerjee and R. J. Mooney “Semi-supervised clustering by seeding”, Proc. 19th Int. Conf. Mach. Learn., vol. 2, pp.27 -34 2002
72
DRESS algorithm 4 – Semi-supervised K-Means
Step 4: Apply Semi-supervised K-means
73
DRESS algorithm 4 – Semi-supervised K-Means
What the data look like at this stage:
74
DRESS algorithm 5 – Filter and Rank
• Filter Read codes that cannot be immediately occurring ADRs (cancers, admin, …)
• Filter Read codes in the indicator cluster
75
DRESS algorithm 5 – Filter and Rank
Order by: month after not month before/month before * (1/weight), weight = 1 for Read codes in ADR cluster and weight = 3 for Read codes in noise cluster
76
DRESS algorithm 5 – Filter and Rank
Final example of data:
77
DRESS – RESULTS
78
Acknowledgements
• Materials are partially adopted from …
• Previous COMP2008 slides including material produced by James Bailey,
Pauline Lin, Chris Ewin, Uwe Aickelin and others
• “Data Mining Concepts and Techniques”, Han et al, 2nd edition 2006.
• “Introduction to Data Mining”, Tan et al 2005.
• https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
• http://people.duke.edu/~kh269/teaching/b351/20evaluatinglearning.pptx
79