Where are we now?
XML Template (2011) [10.8.2011–6:17pm] [1–18] K:/IVI/IVI 415994.3d (IVI) [PREPRINTER stage]
Kandel et al. 3
Figure 1. The iterative process of wrangling and analysis. One or more initial data sets may be used and new versions may come later. The wrangling and analysis phases overlap. While wrangling tools tend to be separated from the visual analysis tools, the ideal system would provide integrated tools (light yellow). The purple line illustrates a typical iterative process with multiple back and forth steps. Much wrangling may need to take place before the data can be loaded within visualization and analysis tools, which typically immediately reveals new problems with the data. Wrangling might take
2
place at all the stages of analysis as users sort out interesting insights from dirty data, or new data become available or
Supervised vs Unsupervised Learning
3
Experimental Design
• Feature Selection
• Dimensionality Reduction
• Performance Evaluation
• Real-world Example – Medical Datamining
4
Feature Selection
5
Feature selection for machine learning
• A series of data wrangling steps result in a set of features • Features are inputs to a model
• Choose attributes suitable for classifying the data according to the model • Train the model
• Classify the data
• Evaluate the results
Good feature sets for machine learning
• Throw everything we can think of at the model and let the algorithm decide?
• Good features give better model performance
• Choose features suitable for classifying the data according to the model
• Inspection
• Intuition
• Neither may be possible in practice
• Feature Selection Methods
Feature selection 1 – wrapper
“Wrapper” method:
Choose subset of attributes that give best performance on the data
https://www.analyticsvidhya.com/
Feature selection 1 – wrapper
Example: classification of Iris data, train and evaluate a model on each subset of attributes:
– {petal length} ; {petal width} ; {sepal width} ; {sepal length} – {petal length, petal width}; {petal length, sepal length}; …
– {petal length, petal width, sepal length}; {sepal width}; … – {petal length, petal width, sepal length, sepal width}
• Choose the subset giving the best performance
• Disadvantages: time consuming
• Only practical for very small data sets
Practical wrapper approach – Greedy
“Greedy” approach:
• Train and evaluate model on each single feature • Choose the best feature
• Repeat until performance stops increasing
• Train and evaluate model on best feature(s) plus each remaining single feature • Add the best feature to the best feature set
Practical wrapper approach – Greedy
“Greedy” approach:
• Assumes independence of features
• Quicker than the naïve wrapper approach
• May converge to a sub-optimal (and can be very bad) solution
Practical wrapper approach – Decremental
“Decremental” approach:
• Start with all features
• Remove one feature, train and evaluate model
• Repeat until performance degrades too much:
• From remaining features, remove one at a time, train and evaluate model • Remove the feature that causes the least performance degradation
Practical wrapper approach – Decremental
“Decremental” approach:
• Assumes independence of features
• Mostly removes irrelevant features
• Slower than the greedy approach with more features
Feature selection 2 – filtering
Intuition: possible to evaluate “goodness” of each feature, separate from other features
• Consider each feature separately: linear time in number of attributes • Typically most popular strategy
Possible (but difficult) to control for inter-dependence of features
Feature filtering
What makes a single feature good?
• Well correlated with class (ratio, interval, ordinal)
• Not independent of class (ordinal, nominal)
• A simple example: Which of 𝑎!, 𝑎”is a good feature?
Feature filtering – Mutual information
What makes a single feature good?
• Well correlated with class Mutual Information:
•𝑀𝐼𝑋,𝑌 =𝐻𝑌 −𝐻𝑌𝑋
•𝑀𝐼𝑐,𝑎! =𝐻𝑐 −𝐻𝑐𝑎! =1−0=1
• 𝐻 𝑐 =−∑!” #,% 𝑝!log&𝑝! =− ‘&×log&’& +’&×log&’& =1 •𝐻𝑐𝑎’ =∑(“#,% 𝑝𝑎’=𝑥𝐻𝑐𝑎’=𝑥 =’&×0+’&×0=0
• High information gain: a1 can strongly predict c; keep a1 in the feature set
Feature filtering – Mutual information
What makes a single feature good?
• Well correlated with class Mutual Information:
•𝑀𝐼𝑋,𝑌 =𝐻𝑌 −𝐻𝑌𝑋
•𝑀𝐼𝑐,𝑎” =𝐻𝑐 −𝐻𝑐𝑎” =1−1=0
• 𝐻 𝑐 =−∑!” #,% 𝑝!log&𝑝! =− ‘&×log&’& +’&×log&’& =1 •𝐻𝑐𝑎& =∑(“#,% 𝑝𝑎&=𝑥𝐻𝑐𝑎&=𝑥 =’&×1+’&×1=1
• No information gain: a2 cannot predict c; remove a2 from the feature set
Feature filtering – Chi-square c!
What makes a single feature good?
• Not independent of class
• Remove a feature if it is independent of the output class
Chi-square test is a statistical hypothesis test for quantifying the independence of pairs of categorical (ordinal, nominal) variables.
• H0: two variables are independent
• Takes into account sample size & has a significance test (MI does not)
Feature filtering – Chi-square c!
• Summarise frequencies of feature and the class in a contingency table
• Is there a difference between observed and expected frequencies? Observed frequencies of a1 and c, e.g. 𝑂(“#$,&!#$) = 2
Chi-square c! – Null Hypothesis
H0: two variables are independent: 𝑃 𝑥 ∩ 𝑦 = 𝑃 𝑥 ×𝑃 𝑦 • Find expected frequencies for all (a1, c) pairs
P:(𝑐=𝑌and𝑎(=𝑌)=)*×)*=(*;Expectedvalue𝐸(“#$,&!#$) =4×(*=1 P:(𝑐=𝑌and𝑎( =𝑁)=(*;Expectedvalue=1
P:(𝑐=𝑁and𝑎( =𝑌)=(*;Expectedvalue=1
P:(𝑐=𝑁and𝑎( =𝑁)=(*;Expectedvalue=1
1 1 1
Chi-square c! – calculate for a1
• Calculate Difference between observed and expected values
•𝜒” 𝑎!,𝑐 =∑ ∑ #${&’ (,&’*}
(/”,$01″,$)% ,${-!'(,-!’*} 1″,$
•𝝌𝟐𝒂𝟏,𝒄= “0!%+50!%+50!%+”0!% =𝟒 !!!!
Contingency table (observed) Contingency table (expected)
11
1
1
Chi-square c! – test for a1
The Chi-square statistic for (𝑎1, 𝑐) is: 𝜒” 𝑎!, 𝑐 = 4
Are 𝑎1 and 𝑐 independent? (the null hypothesis H0)
• With 95% confidence (alpha = 0.05)
• Degreeoffreedom= 2−1 × 2−1 =1(𝑎1 has2values,𝑐has2values) • The Chi-Square critical value is 3.84 (lookup this value)
• Reject H0
• 𝑎1 is NOT independent of c • 𝑎1 is part of the feature set
Chi-square test in Python
from scipy.stats import chi2
Will practise in workshop
Feature Selection 3 – embedded
“Embedded” methods:
• Some models perform feature selection as part of the algorithm • Linear classifiers, e.g. regression
• To some degree: Decision Trees
• Often still benefit from other feature selection approaches
Feature Selection – potential issues
Difficult to control for inter-dependence of features
• Feature filtering of single features may remove important features. For example, where the class is the XOR of some features:
• Given all the features, the class is totally predictable. • Given one of them, the MI or chi-square statistic is 0.
• In practice, feature engineering also used, i.e. construct new features out of existing ones, e.g. ratio / difference between features
Income Expenditure i>e Credit
120 100 1 Good
50 30 1 Good
50 70 0 Bad
200 40 1 Good
200 210 0 Bad
…………
160 150 1 Good
Dimensionality Reduction
26
Dimensionality reduction
Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by algorithms • Allow data to be more easily visualised
• May help to eliminate irrelevant features or reduce noise
• Fewer features → smaller machine learning models → faster answer • Better feature set by transformation → more accurate answer
Motivation: High dimensional data
• The curse of dimensionality: “Data analysis techniques which work well at lower dimensions (fewer features), often perform poorly as the dimensionality of the analysed data increases (lots of features)”
• As dimensionality increases, data becomes increasingly sparse and all the distances between pairs of points begin to look the same. Impacts any algorithm that is based on distances between objects.
Dimensionality reduction
Input: a dataset with 𝑁 features and 𝐾 objects
Output: a transformed dataset with n ≪ 𝑁 features and 𝐾 objects
• if 𝑛 2 or 3 transformed dataset can be easily visualised Example: 𝑛 = 2
Object id
Feature1
Feature2
….
FeatureN
1
…
…
…
…
…
….
…
…
…
K
….
….
…
…
Transformation
Object id
NewFeatureA
NewFeatureB
1
…
…
…
…
…
K
…
…
Transforming from N to 𝑛 ≪ 𝑁 dimensions
• The transformation should preserve the characteristics of the data,
e.g. distances between pairs of objects
• If a pair of objects is close before the transformation, they should still
be close after the transformation
• If a pair of objects is far apart before the transformation, they should
still be far apart after the transformation
• The set of nearest neighbors of an object before the transformation should ideally be the same after the transformation
Question
• Suppose we are given a dataset with the following N features, describing individuals in this class. Which two features would you select to represent people, in such a way that “distances” between pairs of people are likely to be preserved in the reduced dataset?
• Input: N=7 features
1. Weighted average mark (WAM)
2. Age (years)
3. Height (cm)
4. Weight (kg)
5. Number of pets owned
6. Number of subjects passed so far
7. Amount of sleep last night (0=little, 1=medium, 2=a lot)
• Output: Select 2 of the above features
Dimensionality reduction
• Basic method: To reduce dimensionality perform feature selection.
• Scatter plots for Iris dataset shown earlier – 2D visualisations of a 4D dataset. 2
features were selected from the original 4.
• In general, when transforming a dataset from 𝑁 to 𝑛 (𝑛 ≪ 𝑁) features
• The output 𝑛 features do not need to be a subset of the input 𝑁 features. Rather, they can be new features whose values are constructed using some function applied to the input 𝑁 features
Principal components analysis (PCA)
• Find a new set of features that better captures the variability of the data
• First dimension chosen to capture as much of the variability as possible;
• The second dimension is orthogonal to the first, and subject to that constraint, captures as much of the remaining variability as possible,
• The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible,
• Etc. Linear weighted combination of features (feature engineering). • We will not be covering the mathematical details
• Many tutorials available on the web. Nice application of linear algebra.
• A good visualisation: http://setosa.io/ev/principal-component-analysis
Iris dataset: Reduced from 4 to 2 dimensions using PCA
Contrast PCA against a scatter matrix
Scatter plots for iris dataset
AFL football dataset: From http://afltables.com/afl/stats/
[Stats Main][AFL Main] [2013 Stats][2015 Stats]
2014 Player Stats
[2014 Stats Summary]
[Adelaide][Brisbane Lions][Carlton][Collingwood][Essendon][Fremantle][Geelong][Gold Coast][Greater Western Sydney][Hawthorn] [Melbourne][North Melbourne][Port Adelaide][Richmond][St Kilda][Sydney][West Coast][Western Bulldogs]
[All Teams]
Abbreviations key
GM KI MK HB DI DA GL BH HO TK RB IF CL CG FF FA BR CP UP CM MI 1% BO GA %P SU
# Player
32 Dangerfield,Patrick 9 Sloane, Rory
5 Thompson, Scott
33 Smith, Brodie
10 Jaensch, Matthew 26 Douglas, Richard 11 Wright, Matthew 24 Jacobs, Sam
14 Mackay, David
18 Betts, Eddie
1 Podsiadly, James 16 Brown, Luke
2 Crouch, Brad 36 Martin, Brodie 12 Talia, Daniel 29 Laird, Rory
4 Jenkins, Josh 13 Walker, Taylor
3 Reilly, Brent 17 Kerridge, Sam
22 276 74 272 22 269 105 252 19 257 69 262 22 287 108 209 22 297 126 166 19 266 52 147 20 224 89 150 22 193 90 165 19 168 58 174 22 167 53 123 21 189 119 101 22 138 55 148 11 125 26 147 17 155 65 109 22 167 105 93 16 126 65 129 20 170 86 64 15 138 84 82 10 130 65 63 14 72 33 84
548 24.91 521 23.68 519 27.32 496 22.55 463 21.05 413 21.74 374 18.70 358 16.27 342 18.00 290 13.18 290 13.81 286 13.00 272 24.73 264 15.53 260 11.82 255 15.94 234 11.70 220 14.67 193 19.30 156 11.14
17 22 13 9 3 7 11 8 7 5 11 8 14 8 7 3 11 7 51 22 26 14 11
28 78 33 104 136
66 34 19 50 26 15 61 19 22 45 96 34 19 10 38 22 17 27 30 6 33 11 15 31 22 13 39 19 16 63 14 25 18 13 5 30 86 40 13 11 29 11 12 31 88 36 12 8 47 10 21 30 3 13 25 3 14
21 10 14 4
6
4 4
3 5
341 210 275 256 224 280 142 319 106 325 182 228 141 227 150 189 127 224 149 136 132 165 81 205 114 156 97 174 79 183 81 177
25 16 35 97 64 35 21
18 10 83.7
5 21 87.2 17 81.7 0/2
Adelaide [Game by Game]
19 147 45 99
92 118 18 7 91 39 69 32 30 2 8 56 23
15 11
8 26
2
4 763
2
86 28 77 35 109 76 54 89 54 91 21 96 68 22 47 46 20 40 77 30 62 74 8 37 37 17 52 54 37 16 61 22 40 45 30 38 24 45 25 37 21 34 27 13 46 24 50 19 32 17 52 10 23
72 16 1 26
4 12 19 4 5 3
3 29 41 35 11 12 7 12
13 11
56 46 7 87.0
57 34 3 81.3
36 13 11 86.4
26 6 17 80.0 1/2 63 1 10 87.9 0/1 34 37 8 81.1 0/2 21 8 29 87.7
60 1 16 90.1
42 1 4 84.5
17 9 6 83.7 0/1 34 11 4 69.2 2/1
149 1 2 90.0 0/1 25 2 2 75.4 2/0 48 10 7 90.6
20 17 90.3
561 8 15
1 22 40 26
34 22 10 1
55
97 140 21 102 120 23 46 139 7
32 31
19 24 1 81.0
54 97 2 9 9 4 5 83.7 0/1
AFL football dataset: PCA in 2D
Principal components analysis in Python
sklearn.decomposition
Will practise in workshop
Performance Evaluation
39
Performance Evaluation
• Metrics for Performance Evaluation • Training versus Testing Data
• Sampling and Cross-Validation
40
Unsupervised algorithm evaluation
How would you evaluate the performance of a k-nearest neighbour clustering?
1. Ifaclustering’looksgood’?
2. Comparetosome‘knownpoints’?
Thoughts?
41
Supervised algorithm Evaluation
For an accurate decision tree classifier, we want to minimise both: • False positives (saying yes when we should say no)
• False negatives (saying no when we should say yes)
Thoughts?
42
Metrics for Performance Evaluation
For supervised algorithms – can be summarized in a Confusion Matrix (contingency table)
• Actual class: {yes, no, yes, yes, …} • Predicted class: {no, yes, yes, no…}
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a
b
Class=No
c
d
a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
43
Metrics for Performance Evaluation
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Accuracy= a+d = TP+TN a+b+c+d TP+TN+FP+FN
44
Metrics for Performance Evaluation
• Actual class: {yes, no, yes, yes, no, yes, no, no} • Predicted: {no, yes, yes, no, yes, no, no, yes}
PREDICTED CLASS
ACTUAL CLASS
Class=Yes
Class=No
Class=Yes
a= 1
(TP)
b=3
(FN)
Class=No
c=3
(FP)
d=1
(TN)
Accuracy=?
45
Limitations of accuracy
• Consider a 2-class problem
• Number of Class 0 examples = 9990 • Number of Class 1 examples = 10
• If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading because we do not detect any class 1 example • Other metrics can be used instead of accuracy to address this problem
46
Imbalanced data – Bootstrapping
Procedure:
1. Drawkdatasetsfromoriginaldatabysamplingwithreplacement 2. Runthealgorithmforeachdatasetandcomputeaccuracy
3. Returnthemeanandstandarddeviationofaccuracies
• Can estimate confidence in a prediction y=f(x)
• Can handle imbalanced data sets – biased sampling.
47
Why do we split the dataset into training and testing for evaluating accuracy?
48
Why do we split the dataset into training and testing for evaluating accuracy?
Why do we need to split? Overfitting
49
Decision tree: training and testing
• Divide data into:
• Training set (e.g. 2/3) • Testing set (e.g. 1/3)
• Learn decision tree using the training set
• Evaluate performance of decision tree on the testing set
50
Common Splitting Strategies
k-fold cross-validation
Dataset
Train Test
51
Common Splitting Strategies
Leave-one-out (n-fold cross validation)
Dataset
Test Train
52
Computational complexity
• k-fold cross validation requires
• k training steps on n(k-1)/k data points • k testing steps on n/k data points
• We typically also repeat the above 10-100 times to reduce any bias from random number choices
• Report mean results and perform statistical comparisons, e.g. T-tests
53
2017 exam question 3f
• (2 marks) What is the purpose of separating a dataset into training and test sets, when evaluating the performance of a classifier?
54
Points to remember
• Understand the difference between supervised and unsupervised algorithms
• Understand the principles of experimental design: Feature Filtering, Dimensionality Reduction, Performance Evaluation
• Go from average to great data analysis
55
Cutting-edge research: Semi-Supervised Learning
JM Reps, JM Garibaldi, U Aickelin, D Soria, JE Gibson, RB Hubbard: “A Novel Semi- Supervised Algorithm for Rare Prescription Side Effect Discovery (DRESS)”, IEEE Journal of Biomedical and Health Informatics 18 (2), 537-547, 2014.
Can we automate the detection of adverse prescription side effects by utilising large databases of electronic health records?
56
Motivation
• Help patients
• Existing methods (unsupervised) have a high false positive rate
• Supervised methods may have better performance – but labels are rare • If we use knowledge of known data, can generate ‘a few more’ labels? • Semi-supervised methods?!
57
DRESS algorithm overview (1)
• The number of known Adverse Drug reactions for a specific drug is limited.
• When labels are limited, semi- supervised methods may be suitable.
• Semi-supervised clustering incorporates the limited additional knowledge to bias the clustering process
Semi-Supervised Algorithms
58
DRESS algorithm overview (2)
Step 1: Generate attributes (feature selection) Step 2: Extract labels (special feature engineering) Step 3: Metric learning (dimensionality reduction) Step 4: Semi-supervised K-means (clustering) Step 5: Filter and Rank (performance evaluation)
59
DRESS algorithm 1- Generate Attributes
Need to generate attributes that create a space where Adverse Drug Reactions are separable from non-Adverse Drug Reactions
60
DRESS algorithm 1 – Generate attributes
• The attributes are based on existing methods
• The Before/After ratios 1: how often the medical event occurs during the month before the drug compared to the month after
AFTER
• The Before/After ratios 2: a comparison between how often the drug occurs before the medical event or after
BEFORE
61
DRESS algorithm 1 – Generate Attributes
Before and after ratio features: • month after / month before
• year after / year before
• same day / year before
• month after and not in month before / month after
• month after / month before (first 3 Read code elements) • month after /month before (first 2 Read code elements)
(A Read code is a UK hierarchical classification code of a medical event)
62
DRESS algorithm 1 – Generate Attributes
What the data looks like at this point:
63
DRESS algorithm 2 – extract labels
Labels
64
DRESS algorithm 2 – extract labels
Extract known Adverse Drug Reactions by mining the web. Search for medical events listed as Adverse Drug Reactions to the drug of interest on medical websites (e.g. bnf.org, patient.com)
• Do a string match to find the corresponding Read Codes • Validate
65
DRESS algorithm 2 – extract labels
Extract indicators by mining the web. Search for medical events listed as indications (cause of taking the drug) to the drug of interest on medical websites (e.g. bnf.org, patient.com)
• Do a string match to find the corresponding Read Codes • Validate
Indication: Diabetes
66
DRESS algorithm 2 – extract labels
Extract noise events using the hierarchal Read Code structure. Find Read Code branches via a manual search that are highly unlikely to correspond to acute side effects (e.g. cancer, chronic illnesses, family history,…)
67
DRESS algorithm 2 – extract labels
What the data look like at this stage:
68
DRESS algorithm 3 – Metric learning
Use the attributes and labels generated previously to find the suitable metric that transforms the attribute space into one that optimises the class separations.
Ref: Y. Ying and P. Li “Distance metric learning with eigenvalue optimization”, J. Mach. Learn. Res., vol. 13, pp.1 -26 2012
69
DRESS algorithm 3 – Metric Learning
Before Metric Learning
70
DRESS algorithm 3 – Metric Learning
After Metric Learning
71
DRESS algorithm 4 – Semi-supervised K-Means
• In the transformed space apply constrained k-means*
• The seeds are the known label sets (seed 1: Adverse drug reactions,
seed 2: indications and seed 3: noise)
• The initial k-means centers are created using the labeled data averages for each seed
• K-means then assigns the unlabeled data point to the closest cluster and the labeled points remain fixed
*S. Basu , A. Banerjee and R. J. Mooney “Semi-supervised clustering by seeding”, Proc. 19th Int. Conf. Mach. Learn., vol. 2, pp.27 -34 2002
72
DRESS algorithm 4 – Semi-supervised K-Means
Step 4: Apply Semi-supervised K-means
73
DRESS algorithm 4 – Semi-supervised K-Means
What the data look like at this stage:
74
DRESS algorithm 5 – Filter and Rank
• Filter Read codes that cannot be immediately occurring ADRs (cancers, admin, …)
• Filter Read codes in the indicator cluster
75
DRESS algorithm 5 – Filter and Rank
Order by: month after not month before/month before * (1/weight), weight = 1 for Read codes in ADR cluster and weight = 3 for Read codes in noise cluster
76
DRESS algorithm 5 – Filter and Rank
Final example of data:
77
DRESS – RESULTS
78