COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 5, Part 1: Correlation – what and why
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Where are we now?
XML Template (2011) [10.8.2011–6:17pm] [1–18] K:/IVI/IVI 415994.3d (IVI) [PREPRINTER stage]
Kandel et al. 3
Figure 1. The iterative process of wrangling and analysis. One or more initial data sets may be used and new versions
may come later. The wrangling and analysis phases overlap. While wrangling tools tend to be separated from the visual
analysis tools, the ideal system would provide integrated tools (light yellow). The purple line illustrates a typical iterative
© University of Melbourne 2021 2
process with multiple back and forth steps. Much wrangling may need to take place before the data can be loaded within visualization and analysis tools, which typically immediately reveals new problems with the data. Wrangling might take
place at all the stages of analysis as users sort out interesting insights from dirty data, or new data become available or
Plan
• Discuss correlations between pairs of features in a dataset • Why useful and important
• Pitfalls
• Methods for computing correlation
• Pearson correlation
• Mutual information (another method to compute correlation)
© University of Melbourne 2021 3
Correlations in life
As a child grows, so does his clothing size.
People who suffer from depression have higher rates of suicide than those who do not.
As you drink more coffee, the number of hours you stay awake increases.
The longer amount of time you spend in the bath, the more wrinkly your skin becomes.
Pure water should have constant boiling point. – no correlation.
The older a man gets, the less hair that he has.
© University of Melbourne 2021
https://towardsdatascience.com/3-beautiful-real-life- correlations-fed9e855da52,
https://examples.yourdictionary.com/positive-correlation- examples.html, https://examples.yourdictionary.com/negative-correlation- 5 examples.html
What is Correlation?
Correlation is used to detect pairs of variables that might have some relationship.
AND
How strong is the relationship?
https://www.mathsisfun.com/data/correlation.html
© University of Melbourne 2021 6
What is Correlation?
Visually can be identified via inspecting scatter plots
https://www.mathsisfun.com/data/correlation.html
© University of Melbourne 2021
7
What is Correlation?
Linear relations
© University of Melbourne 2021
8
https://www.mathsisfun.com/data/correlation.html
What is Correlation?
Correlation strength
https://images.app.goo.gl/zZXtjBLR2BcjRpK79
© University of Melbourne 2021 9
Example of non-linear correlation
https://www.mathsisfun.com/data/correlation.html It gets so hot that people aren’t going near the
shop, and sales start dropping
© University of Melbourne 2021 10
Example of non-linear correlation
https://www.mathsisfun.com/data/correlation.html It gets so hot that people aren’t going near the
shop, and sales start dropping
© University of Melbourne 2021 11
Example of Correlated Variables
• Greater understanding of data
• Can hint at potential causal
relationships
• Business decision based on correlation: increase electricity production when temperature increases
© University of Melbourne 2021 12
Example of Correlated Variables
Correlation does not necessarily imply causality!
© University of Melbourne 2021 13
Example rank correlation
“If a university has a higher-ranked football team, then is it likely to have a higher-ranked basketball team?”
Football ranking
University team
1
Melbourne
2
Monash
3
Sydney
4
New South Wales
5
Adelaide
6
Perth
Basketball ranking
University team
1
Sydney
2
Melbourne
3
Monash
4
New South Wales
5
Perth
6
Adelaide
© University of Melbourne 2021 14
Microarray data
Each chip contains thousands of tiny probes corresponding to the genes (20k – 30k genes in humans). Each probe measures the activity (expression) level of a gene
Gene 1 expression
Gene 2 expression
…
Gene 20K expression
0.3
1.2
…
3.1
© University of Melbourne 2021 15
Microarray dataset
Gene 1
Gene 2
Gene 3
…
Gene n
Time 1
2.3
1.1
0.3
…
2.1
Time 2
3.2
0.2
1.2
…
1.1
Time 3
1.9
3.8
2.7
…
0.2
…
…
…
…
…
…
Time m
2.8
3.1
2.5
…
3.4
• Each row represents measurements at some time
• Each column represents levels of a gene
© University of Melbourne 2021 16
Correlation analysis on Microarray data
Can reveal genes that exhibit similar patterns ⇨ similar or related functions ⇨ Discover functions of unknown genes
© University of Melbourne 2021 17
Genetic network
Connect genes with high correlation – understand behaviour of groups of genes
© University of Melbourne 2021 18
Salt Causes High Blood Pressure
Intersalt: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion.
British Medical Journal; 297: 319-328, 1988.
90#
80#
70#
60#
50#
0# 50#
100# 150# 200# 250#
Median’urinary’sodium’excre8on'(mmol/24hr)’
© University of Melbourne 2021 19
Median’diastolic’blood’pressure'(mmHg)’
Or Does It!?
Intersalt: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion.
British Medical Journal; 297: 319-328, 1988.
If we exclude these four ‘outliers’, which are non-industrialised countries with non-salt diets, we get a quite different result!
90#
80#
70#
60#
50#
0# 50#
100# 150# 200# 250#
Median’urinary’sodium’excre8on'(mmol/24hr)’
© University of Melbourne 2021 20
Median’diastolic’blood’pressure'(mmHg)’
Spurious Correlation
Correlation ≠ Causality
https://assets.businessinsider.com/re al-maps-ridiculous-correlations-2014- 11?jwsource=cl
© University of Melbourne 2021
21
https://images.app.goo.gl/FVr8BhxWmQMxCB5f7
Why is correlation important?
• Discover relationships
• One step towards discovering causality A causes B
Example: Smoking causes lung cancer
Feature ranking: for building better predictive models
A good feature to use, is a feature that has high correlation with the outcome we are trying to predict
© University of Melbourne 2021 22
Break
From the time before, 2019
© University of Melbourne 2021
COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 5, Part 2: Correlation – Pearson correlation
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Correlation methods
© University of Melbourne 2021 27
Problems of Euclidean distance
• Objects can be represented with different measure scales
Day 1
Day 2
Day 3
…
Day m
Temperature
20
22
16
…
33
#Ice-creams
50223
55223
45098
…
78008
#Electricity
102034
105332
88900
…
154008
d(temp,ice-cr)= 540324 d(temp,elect)= 12309388
• Euclidean distance: does not give a clear intuition about how well variables are correlated
© University of Melbourne 2021 29
Problems of Euclidean distance
Cannot discover variables with similar behaviours/dynamics but at different scale
© University of Melbourne 2021 30
Problems of Euclidean distance
Cannot discover variables with similar behaviours/dynamics but in the opposite direction (negative correlation)
© University of Melbourne 2021 31
Pearson
© University of Melbourne 2021 32
Assessing linear correlation – Pearson correlation
• We will define a correlation measure rxy, assessing samples from two features x and y
• Assess how close their scatter plot is to a straight line (a linear relationship)
Pearson Product Moment Correlation
• Range of rxy lies within [-1,1]:
• 1 for perfect positive linear correlation
• -1 for perfect negative linear correlation
• 0 means no correlation
• Absolute value |r| indicates strength of linear correlation
• http://www.bc.edu/research/intasc/library/correlation.shtml
© University of Melbourne 2021 33
Pearson’s correlation coefficient (r)
© University of Melbourne 2021 34
Pearson coefficient example
Height (x)
Weight (y)
1.6
50
1.7
66
1.8
77
1.9
94
• How do the values of x and y move (vary) together?
• Big values of x with big values of y?
• Small values of x with small values of y?
© University of Melbourne 2021 35
Pearson coefficient example
© University of Melbourne 2021 36
Interpreting Pearson correlation values
In general it depends on your domain of application. Jacob Cohen has suggested
• 0.5 is large
• 0.3-0.5 is moderate
• 0.1-0.3 is small
• less than 0.1 is trivial
© University of Melbourne 2021 38
Properties of Pearson’s correlation
• Range within [-1,1]
• Scale invariant: r(x,y)= r(x, Ky)
• Multiplying a feature’s values by a constant K makes no difference • Location invariant: r(x,y)= r(x, K+y)
• Adding a constant K to one feature’s values makes no difference • Can only detect linear relationships
y = a.x + b + noise
• Cannot detect non-linear relationships y=x3 +noise
© University of Melbourne 2021 39
Pearson correlation examples
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
© University of Melbourne 2021 40
2017 Exam question 3a
• http://www.bc.edu/research/intasc/library/correlation.shtml
© University of Melbourne 2021 41
2016 exam question 2a)
© University of Melbourne 2021 42
Break
From the time before, 2019
© University of Melbourne 2021
COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 5, Part 3: Correlation – Mutual Information (Entropy)
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Correlation – Mutual Information
Mutual Information
© University of Melbourne 2021 47
Recap: Pearson correlation – assess linear correlation between two features
© University of Melbourne 2021
48
https://www.mathsisfun.com/data/correlation.html
What about non-linear correlation?
Pearson correlation is not suitable for this scenario (value less than 0.1)
https://www.mathsisfun.com/data/correlation.html
© University of Melbourne 2021 49
Introduction to Mutual Information
A correlation measure that can detect non-linear relationships
• It operates with discrete features
• Pre-processing: continuous features are first discretised into bins (categories). E.g. small [0,1.4], medium (1.4,1.8), large [1.8,3.0]
Object
Height
Discretised Height
1
2.03
large
2
1.85
large
3
1.23
small
4
1.31
small
5
1.72
medium
6
1.38
small
7
0.94
small
© University of Melbourne 2021 50
A recap on logarithms (to the base 2)
• y = log2x (y is the solution to the question “To what power do I need to raise 2, in order to get x?’’)
• 2*2*2*2=16 which means log216 = 4 (16 is 2 to the power 4) •log2 32=5
• log2 30 = 4.9
• log2 1.2 = 0.26
• log2 0.5 = -1
In what follows, we’ll write log instead of log2 to represent binary log
© University of Melbourne 2021 51
Entropy
Given a feature X. Then H(X) is its entropy. Assuming X has a number of categories (bins)
*sometimes written as p(i) instead of p
© University of Melbourne 2021 52
Entropy – intuition
• Entropy is a measure used to quantify the amount of uncertainty in an outcome
• Randomly select an element from
a. {1,1,1,1,1,1,1,2} versus b. {1,2,2,3,3,4,5}
• In which case are you more uncertain about the outcome of the selection? Why?
• More uncertain, less predictable => high entropy (b.) • Less uncertain, more predictable => low entropy (a.)
© University of Melbourne 2021 53
Another example
• Consider the sample of all people in this subject. Each person is labelled young (<30 years) or old (>30 years)
• Randomly select a person and inspect whether they are young or old. • How surprised am I likely to be by the outcome?
• Suppose I repeat the experiment using a random sample of people catching the train to the city in peak hour?
• How surprised am I likely to be by the outcome?
© University of Melbourne 2021 54
Entropy
Information theory – Claude Shannon • Abit==0or1
• 1 bit of information reduces uncertainty by a factor of 2.
Flipping a coin: head – 50%, tail – 50%
• the machine tells you that the outcome will be tail • Uncertainty for tail reduced by a factor of 2.
• 1 bit of information.
• the machine tells you that the outcome will be head • Uncertainty for head reduced by a factor of 2.
• 1 bit of information.
• On average, machine transmits 0.5*1 + 05*1 = 1 bit (Entropy)
© University of Melbourne 2021 55
Entropy
Flipping a special coin: head – 75%, tail – 25%
• the machine tells you that the outcome will be tail • Uncertainty for tail reduced by a factor of 4 (1/0.25 = 4) • 2 bits of information (log2 4 = – log2 0.25 = 2)
• the machine tells you that the outcome will be head
• Uncertainty for head reduced by a factor of 4/3 (1/0.75 = 4/3). • 0.41 bit of information (log2 4/3 = – log2 0.75 = 0.41)
• On average, machine transmits 0.75*0.41 + 0.25 * 2 = 0.81 (Entropy)
© University of Melbourne 2021 56
Entropy
Given a feature X. Then H(X) is its entropy. Assuming X has a number of categories (bins)
*sometimes written as p(i) instead of p
© University of Melbourne 2021 57
Entropy example
A
B
B
A
C
C
C
C
A
We have 3 categories/bins (A,B,C) for a feature X 9 objects, each in exactly one of the 3 bins
What is the entropy of this sample of 9 objects? Answer: H(X)=1.53
© University of Melbourne 2021 60
How would you compute the entropy for the “Likes to sleep” feature?
Person
Likes to sleep
1
Yes
2
No
3
Maybe
4
Never
5
Yes
6
Yes
7
Never
8
Yes
© University of Melbourne 2021 61
Properties of entropy
• H(X) ≥ 0
• Entropy is maximized for uniform distribution (highly uncertain what value a randomly selected object will have)
• Entropy – when using log base 2 – measures uncertainty of the outcome in bits. This can be viewed as the information associated with learning the outcome
© University of Melbourne 2021 63
Variable discretisation: Techniques
• Domain knowledge: assign thresholds manually Car speed:
• 0-40: slow
• 40-60: medium • >60: high
• Equal-length bin
Divide the range of the continuous feature into equal length intervals (bins). If speed ranges from 0-100, then the 10 bins are [0,10), [10,20), [20,30), …[90,100]
Sort
• Equal frequency bin
Divide range of continuous feature into equal frequency intervals (bins).
the values and divide so that each bin has same number of objects
© University of Melbourne 2021
64
Discretisation example
• Given the values 2, 2, 3, 10, 13, 15, 16, 17, 19 19, 20, 20, 21 • Show a 3 bin equal length discretisation
• Show a 3 bin equal frequency discretisation
© University of Melbourne 2021 65
Break
From the time before, 2019
© University of Melbourne 2021
COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 5, Part 4: Correlation – Mutual Information – cont.
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Conditional entropy – intuition
Suppose I randomly sample a person. I check if they wear glasses – how surprised am I by their age?
Person
WearGlasses(X)
Age (Y)
1
No
young
2
No
young
3
No
young
4
No
young
5
Yes
old
6
Yes
old
7
Yes
old
© University of Melbourne 2021 68
Conditional entropy H(Y|X)
Measures how much information is needed to describe outcome Y, given that outcome X is known. Suppose X is Height and Y is Weight.
H(Y |X) = X p(x)H(Y |X = x)
x2X
© University of Melbourne 2021 69
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
Conditional entropy
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
H(Y |X) = X p(x)H(Y |X = x) x2X
H(Y|X) = 2/7 * H(Y|X=big) + 5/7 * h(Y|X=small)
= 2/7(-0.5log 0.5 -0.5 log 0.5) + 5/7(-0.8log 0.8-0.2log 0.2)
= 0.801
© University of Melbourne 2021 70
Mutual information definition
MI(X, Y ) = H(Y ) H(Y |X) = H(X) H(X|Y )
• WhereXandYarefeatures(columns)inadataset
• MI(mutualinformation)isameasureofcorrelation
• TheamountofinformationaboutYwegainbyknowingX,or • TheamountofinformationaboutXwegainbyknowingY
© University of Melbourne 2021 71
Mutual information example 1
Object
Height (X)
Weight (Y)
1
small
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
M I ( X , Y ) X= H ( Y ) H ( Y | X ) = H(X) H(X|Y )
H(Y |X) = p(x)H(Y |X = x) x2X
H(Y) = 0.8631205 H(Y|X) = 0.5572
H(Y)-H(Y|X) = 0.306 (how much information about Y is gained by knowing X)
© University of Melbourne 2021 72
Mutual information example 2
MI(X, Y ) = H(Y ) H(Y |X) =XH ( X ) H ( X | Y )
H(Y |X) = p(x)H(Y |X = x) x2X
H(Y) = 1.379 H(Y|X) = 0.965
H(Y)-H(Y|X) = 0.414
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
jumbo
5
medium
light
6
medium
light
7
small
heavy
© University of Melbourne 2021 73
Properties of Mutual Information
• The amount of information shared between two variables X and Y
• MI(X,Y)
• large: X and Y are highly correlated (more dependent)
• small: X and Y have low correlation (more independent)
• 0 ≤ MI(X,Y) ≤ ∞
• Sometimes also referred to as ‘Information Gain’
© University of Melbourne 2021 74
Mutual information: normalisation
• MI(X,Y) is always at least zero, may be larger than 1
• In fact, one can show it is true that
• 0≤ MI(X,Y)≤ min(H(X),H(Y))
• (where min(a,b) indicates the minimum of a and b)
• Thus if want a measure in the interval [0,1], we can define normalised mutual information (NMI)
• NMI(X,Y) = MI(X,Y) / min(H(X),H(Y))
• NMI(X,Y)
• large: X and Y are highly correlated (more dependent)
• small: X and Y have low correlation (more independent)
© University of Melbourne 2021 75
Pearson correlation=-: -0.0864
Normalised mutual information (NMI)= 0.43 (3-bin equal spread bins)
76
Pearson correlation=-0.1
Normalised mutual information (NMI)=0.84
77
Pearson correlation=-0.05
Normalised mutual information (NMI)=0.35
78
Examples
• Pearson? • NMI?
79
Examples
• Pearson: 0.08 • NMI: 0.009
80
Computing MI with class features
Identifying features that are highly correlated with a class feature
HoursSleep
HoursExercise
HairColour
HoursStudy
Happy
(class feature)
12
20
Brown
low
Yes
11
18
Black
low
Yes
10
10
Red
medium
Yes
10
9
Black
medium
Yes
10
10
Red
high
No
7
11
Red
high
No
6
15
Brown
high
No
2
13
Brown
high
No
Compute MI(HoursSleep, Happy), MI(HoursExercise, Happy),
and MI(HoursStudy, Happy), MI(HairColour, Happy). Retain most predictive feature(s)
81
Advantages and disadvantages of MI
• Advantages
• Can detect both linear and non-linear dependencies (unlike Pearson)
• Applicable and very effective for use with discrete features (unlike Pearson correlation)
• Disadvantages
• If feature is continuous, it first must be discretised to compute mutual
information. This involves making choices about what bins to use.
• This may not be obvious. Different bin choices will lead to different estimations of mutual information.
83
Choose the statement that is not true
• Mutual information can’t detect the existence of linear relationships in the data, but Pearson correlation can
• Mutual information can be used to assess how much one feature is associated with another
• Mutual information can detect the existence of non-linear relationships in the data
• Mutual information doesn’t indicate the “direction” of a relationship, just its strength
84
Question 2aii) from 2016 exam
a) Richard is a data wrangler. He does a survey and constructs a dataset recording average time/day spent studying and average grade for a population of 1000 students:
Student Name
Average time per day spent studying
Average Grade
…
….
….
ii) Richard separately discretises the two features Average time per day spent studying and Average grade, each into 2 bins. He then computes the normalised mutual information between these two features and obtains a value of 0.1, which seems surprisingly low to him. Suggest two reasons that might explain the mismatch between the normalised mutual information value of 0.1 and the Pearson Correlation coefficient of 0.85. Explain any assumptions made.
86
End of Correlation topic
From the time before, 2019
© University of Melbourne 2021
Acknowledgements
• Materials are partially adopted from …
• Previous COMP2008 slides including material produced by James Bailey,
Pauline Lin, Chris Ewin, Uwe Aickelin and others • Interactive correlation calculator:
http://www.bc.edu/research/intasc/library/correlation.shtml
• Correlation <> Causality: http://tylervigen.com/spurious-correlations
© University of Melbourne 2021 88