程序代写代做 html C COMP20008 Elements of Data Processing Correlation

COMP20008 Elements of Data Processing Correlation
Semester 1, 2020
Contact: uwe.aickelin@unimelb.edu.au

Where are we now?
2

Plan
• Discuss correlations between pairs of features in a dataset • Why useful and important
• Pitfalls
• Methods for computing correlation • Euclidean distance
• Pearson correlation
• Mutual information (another method to compute correlation)
3

Correlation
4

What is Correlation?
Correlation is used to detect pairs of variables that might have some relationship
https://www.mathsisfun.com/data/correlation.html
5

What is Correlation?
Visually can be identified via inspecting scatter plots
https://www.mathsisfun.com/data/correlation.html
6

What is Correlation?
Linear relations
https://www.mathsisfun.com/data/correlation.html
7

Example of Correlated Variables
• Can hint at potential causal relationships (change in one variable is the result of change in the other)
• Business decision based on correlation: increase electricity production when temperature increases
8

Example of non-linear correlation
It gets so hot that people aren’t going near the shop, and sales start dropping
https://www.mathsisfun.com/data/correlation.html
9

Example of non-linear correlation
It gets so hot that people aren’t going near the shop, and sales start dropping
https://www.mathsisfun.com/data/correlation.html
10

Climate change [https://climate.nasa.gov/evidence/]
11

Climate Change? http://en.wikipedia.org/wiki/File:2000_Year_Temperature_Comparison.png
12

Climate Change?
13

Climate Change?
14

Climate Change?
15

Salt Causes High Blood Pressure
Intersalt: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion.
British Medical Journal; 297: 319-328, 1988.
16

Or Does It!?
Intersalt: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion.
British Medical Journal; 297: 319-328, 1988.
If we exclude these four ‘outliers’, which are non-industrialised countries with non-salt diets, we get a quite different result!
17

Example of Correlated Variables
Correlation does not necessarily imply causality!
18

Example of Correlated Variables
Correlation does not necessarily imply causality!
19

Example: Predicting Sales
https://www.kaggle.com/ c/rossmann-store- sales/data
20

Example: Predicting Sales
21

Example: Predicting Sales
22

Example: Predicting Sales
Other correlations • Sales vs. holiday
• Sales vs. day of the week
• Sales vs. distance to competitors • Sales vs. average income in area
23

Factors correlated with human performance
• Bedtime consistency
• High intensity exercise • Intermittent fasting
• Optimism
24

Why is correlation important?
• Discover relationships
• One step towards discovering causality A causes B
Example: Gene A causes lung cancer
 Feature ranking: select the best features for building better predictive models
 A good feature to use, is a feature that has high correlation with the outcome we are trying to predict
25

Microarray data
Each chip contains thousands of tiny probes corresponding to the genes (20k – 30k genes in humans). Each probe measures the activity (expression) level of a gene
Gene 1 expression
Gene 2 expression

Gene 20K expression
0.3
1.2

3.1
26

Microarray dataset
Gene 1
Gene 2
Gene 3

Gene n
Time 1
2.3
1.1
0.3

2.1
Time 2
3.2
0.2
1.2

1.1
Time 3
1.9
3.8
2.7

0.2






Time m
2.8
3.1
2.5

3.4
• Each row represents measurements at some time
• Each column represents levels of a gene
27

Correlation analysis on Microarray data
Can reveal genes that exhibit similar patterns ⇨ similar or related functions ⇨ Discover functions of unknown genes
28

Compare people
Gene 1
Gene 2
Gene 3

Gene n
Person 1
2.3
1.1
0.3

2.1
Person 2
3.2
0.2
1.2

1.1
Person 3
1.9
3.8
2.7

0.2






Person m
2.8
3.1
2.5

3.4
• Each row represents measurements about a person
• Each column represents levels of a gene
29

Genetic network
Connect genes with high correlation – understand behaviour of groups of genes
http://www.john.ranola.org/?page_id=116
30

Euclidian Distance
31

Problems of Euclidean distance
• Objects can be represented with different measure scales
Day 1
Day 2
Day 3

Day m
Temperature
20
22
16

33
#Ice-creams
50223
55223
45098

78008
#Electricity
102034
105332
88900

154008
d(temp,ice-cr)= 540324 d(temp,elect)= 12309388
• Euclidean distance: does not give a clear intuition about how well variables are correlated
32

Problems of Euclidean distance
Cannot discover variables with similar behaviours/dynamics but at different scale
33

Problems of Euclidean distance
Cannot discover variables with similar behaviours/dynamics but in the opposite direction (negative correlation)
34

Pearson
35

Assessing linear correlation – Pearson
correlation
• We will define a correlation measure rxy, assessing samples from two features x and y
• Assess how close their scatter plot is to a straight line (a linear relationship)
• Range of rxy lies within [-1,1]:
• 1 for perfect positive linear correlation
• -1 for perfect negative linear correlation
• 0 means no correlation
• Absolute value |r| indicates strength of linear correlation
• http://www.bc.edu/research/intasc/library/correlation.shtml
36

Pearson’s correlation coefficient (r)
37

Pearson coefficient example
Height (x)
Weight (y)
1.6
50
1.7
66
1.8
77
1.9
94
• How do the values of x and y move (vary) together?
• Big values of x with big values of y?
• Small values of x with small values of y?
38

Pearson coefficient example
39

Example rank correlation
“If a university has a higher-ranked football team, then is it likely to have a higher-ranked basketball team?”
Football ranking
University team
1
Melbourne
2
Monash
3
Sydney
4
New South Wales
5
Adelaide
6
Perth
Basketball ranking
University team
1
Sydney
2
Melbourne
3
Monash
4
New South Wales
5
Perth
6
Adelaide
40

Interpreting Pearson correlation values
In general it depends on your domain of application. Jacob Cohen has suggested
• 0.5 is large
• 0.3-0.5 is moderate
• 0.1-0.3 is small
• less than 0.1 is trivial
41

Properties of Pearson’s correlation
• Range within [-1,1]
• Scale invariant: r(x,y)= r(x, Ky)
• Multiplying a feature’s values by a constant K makes no difference • Location invariant: r(x,y)= r(x, K+y)
• Adding a constant K to one feature’s values makes no difference • Can only detect linear relationships
y = a.x + b + noise
• Cannot detect non-linear relationships y=x3 +noise
42

Examples
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
43

2017 Exam question 3a
• http://www.bc.edu/research/intasc/library/correlation.shtml
44

2016 exam question 2a)
45

Mutual Information
46

Recap: Pearson correlation – assess linear correlation between two features
https://www.mathsisfun.com/data/correlation.html
47

What about non-linear correlation?
Pearson correlation is not suitable for this scenario (value less than 0.1)
https://www.mathsisfun.com/data/correlation.html 48

Introduction to Mutual Information
A correlation measure that can detect non-linear relationships
• It operates with discrete features
• Pre-processing: continuous features are first discretised into bins (categories). E.g. small [0,1.4], medium (1.4,1.8), large [1.8,3.0]
Object
Height
Discretised Height
1
2.03
large
2
1.85
large
3
1.23
small
4
1.31
small
5
1.72
medium
6
1.38
small
7
0.94
small
49

Variable discretisation: Techniques
• Domain knowledge: assign thresholds manually
• Carspeed:
• 0-40: slow
• 40-60: medium • >60: high
• Equal-length bin
• Divide the range of the continuous feature into equal length intervals (bins). If
speed ranges from 0-100, then the 10 bins are [0,10), [10,20), [20,30), …[90,100] • Equal frequency bin
• Divide range of continuous feature into equal frequency intervals (bins). Sort the values and divide so that each bin has same number of objects
50

Discretisation example
• Given the values 2, 2, 3, 10, 13, 15, 16, 17, 19 19, 20, 20, 21 • Show a 3 bin equal length discretisation
• Show a 3 bin equal frequency discretisation
51

A recap on logarithms (to the base 2)
• y = log2x (y is the solution to the question “To what power do I need to raise 2, in order to get x?’’)
• 2*2*2*2=16 which means log216 = 4 (16 is 2 to the power 4) •log2 32=5
• log2 30 = 4.9
• log2 1.2 = 0.26
• log2 0.5 = -1
• In what follows, we’ll write log instead of log2
52

Entropy
• Entropy is a measure used to assess the amount of uncertainty in an outcome
• E.g. Randomly select an element from {1,1,1,1,1,1,1,2}, versus randomly select an element from {1,2,2,3,3,4,5}
• In which case is the value selected more “predictable”? Why?
53

Entropy
• E.g. Randomly select an element from {1,1,1,1,1,1,1,2}, versus randomly select an element from {1,2,2,3,3,4,5}
• In which case is the value selected more “predictable”? Why?
• The former case more certain => low entropy
• The latter case is less certain => higher entropy
• Entropy is used to quantify this degree of uncertainty (surprisingness)
54

Another example
• Consider the sample of all people in this lecture theatre. Each person is labelled young (<30 years) or old (>30 years)
• Randomly select a person and inspect whether they are young or old. • How surprised am I likely to be by the outcome?
• Suppose I repeat the experiment using a random sample of people catching the train to the city in peak hour?
• How surprised am I likely to be by the outcome?
55

Entropy
• Given a feature X. Then H(X) is its entropy. Assuming X uses a number of categories (bins)
• May sometimes write p(i) instead of p
56

Entropy of a Random Variable
• E.g. Suppose there are 3 bins, each bins contains exactly one third of the objects (points) • H(X)=- [ 0.33 x log(0.33) + 0.33 x log(0.33) + 0.33 x log(0.33) ]
57

Another Entropy example
A
B
B
A
C
C
C
C
A
We have 3 categories/bins (A,B,C) for a feature X 9 objects, each in exactly one of the 3 bins
What is the entropy of this sample of 9 objects? Answer: H(X)=1.53
58

How would you compute the entropy for the “Likes to sleep” feature?
Person
Likes to sleep
1
Yes
2
No
3
Maybe
4
Never
5
Yes
6
Yes
7
Never
8
Yes
59

Properties of entropy
• H(X) ≥ 0
• Entropy – when using log base 2 – measures uncertainty of the outcome in bits. This can be viewed as the information associated with learning the outcome
• Entropy is maximized for uniform distribution (highly uncertain what value a randomly selected object will have)
60

Conditional entropy – intuition
Suppose I randomly sample a person. I check if they wear glasses – how surprised am I by their age?
Person
WearGlasses(X)
Age (Y)
1
No
young
2
No
young
3
No
young
4
No
young
5
Yes
old
6
Yes
old
7
Yes
old
61

Conditional entropy H(Y|X)
Measures how much information is needed to describe outcome Y, given that outcome X is known. Suppose X is Height and Y is Weight.
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
62

Conditional entropy
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
H(Y|X) = 2/7 * H(Y|X=big) + 5/7 * h(Y|X=small)
= 2/7(-0.5log 0.5 -0.5 log 0.5) + 5/7(-0.8log 0.8-0.2log 0.2) = 0.801
63

Mutual information definition
• WhereXandYarefeatures(columns)inadataset
• MI(mutualinformation)isameasureofcorrelation
• TheamountofinformationaboutXwegainbyknowingY,or • TheamountofinformationaboutYwegainbyknowingX
64

Mutual information example
Object
Height (X)
Weight (Y)
1
small
light
2
big
heavy
3
small
light
4
small
light
5
small
light
6
small
light
7
small
heavy
H(Y) = 0.8631205 H(Y|X) = 0.5572
H(Y)-H(Y|X) = 0.306 (how much information about Y is gained by knowing X)
65

Mutual information example 2
Object
Height (X)
Weight (Y)
1
big
light
2
big
heavy
3
small
light
4
small
jumbo
5
medium
light
6
medium
light
7
small
heavy
H(Y) = 1.379 H(Y|X) = 0.965
H(Y)-H(Y|X) = 0.414
66

Properties of Mutual Information
• The amount of information shared between two variables X and Y
• MI(X,Y)
• large: X and Y are highly correlated (more dependent)
• small: X and Y have low correlation (more independent)
• 0 ≤ MI(X,Y)
• Sometimes also referred to as ‘Information Gain’
67

Mutual information: normalisation
• MI(X,Y) is always at least zero, may be larger than 1
• In fact, one can show it is true that
• 0≤ MI(X,Y)≤ min(H(X),H(Y))
• (where min(a,b) indicates the minimum of a and b)
• Thus if want a measure in the interval [0,1], we can define normalised mutual information (NMI)
• NMI(X,Y) = MI(X,Y) / min(H(X),H(Y))
• NMI(X,Y)
• large: X and Y are highly correlated (more dependent)
• small: X and Y have low correlation (more independent)
68

Pearson correlation=-: -0.0864
Normalised mutual information (NMI)= 0.43 (3-bin equal spread bins)
69

Pearson correlation=-0.1
Normalised mutual information (NMI)=0.84
70

Pearson correlation=-0.05
Normalised mutual information (NMI)=0.35
71

Examples
• Pearson? • NMI?
72

Examples
• Pearson: 0.08 • NMI: 0.009
73

Computing MI with class features
Identifying features that are highly correlated with a class feature
HoursSleep
HoursExercise
HairColour
HoursStudy
Happy
(class feature)
12
20
Brown
low
Yes
11
18
Black
low
Yes
10
10
Red
medium
Yes
10
9
Black
medium
Yes
10
10
Red
high
No
7
11
Red
high
No
6
15
Brown
high
No
2
13
Brown
high
No
Compute MI(HoursSleep, Happy), MI(HoursExercise, Happy),
and MI(HoursStudy, Happy), MI(HairColour, Happy). Retain most predictive feature(s)
74

•s
Computing MI with class features
HoursSleep
HoursExercise
HairColour
HoursStudy
Happy
(class feature)
12
20
Brown
low
Yes
11
18
Black
low
Yes
10
10
Red
medium
No
10
9
Black
medium
Yes
10
10
Red
high
No
7
11
Black
high
No
6
15
Brown
high
No
2 13 Brown
• MI(HairColour, Happy)=0.27 (NMI=0.28)
high
No
• MI(HoursStudy, Happy)=0.70 (NMI=0.74)
• ….
• Can rank features according to their predictiveness –then focus further on just these
75

Advantages and disadvantages of MI
• Advantages
• Can detect both linear and non-linear dependencies (unlike Pearson)
• Applicable and very effective for use with discrete features (unlike Pearson correlation)
• Disadvantages
• If feature is continuous, it first must be discretised to compute mutual information. This involves making choices about what bins to use.
• This may not be obvious. Different bin choices will lead to different estimations of mutual information.
76

Choose the statement that is not true
• Mutual information can’t detect the existence of linear relationships in the data, but Pearson correlation can
• Mutual information can be used to assess how much one feature is associated with another
• Mutual information can detect the existence of non-linear relationships in the data
• Mutual information doesn’t indicate the “direction” of a relationship, just its strength
77

Question 2ai) from 2016 exam
a) Richard is a data wrangler. He does a survey and constructs a dataset recording average time/day spent studying and average grade for a population of 1000 students:
Student Name
Average time per day spent studying
Average Grade

….
….
i) Richard computes the Pearson correlation coefficient between Average time per day studying and Average grade and obtains a value of 0.85. He concludes that more time spent studying causes a student’s grade to increase. Explain the limitations with this reasoning and suggest two alternative explanations for the 0.85 result.
78

Question 2aii) from 2016 exam
a) Richard is a data wrangler. He does a survey and constructs a dataset recording average time/day spent studying and average grade for a population of 1000 students:
Student Name
Average time per day spent studying
Average Grade

….
….
ii) Richard separately discretises the two features Average time per day spent studying and Average grade, each into 2 bins. He then computes the normalised mutual information between these two features and obtains a value of 0.1, which seems surprisingly low to him. Suggest two reasons that might explain the mismatch between the normalised mutual information value of 0.1 and the Pearson Correlation coefficient of 0.85. Explain any assumptions made.
79

Points to remember (1)
• be able to explain why identifying correlations is useful for data wrangling/analysis
• understand what is correlation between a pair of features
• understand how correlation can be identified using visualisation
• understand the concept of a linear relation, versus a nonlinear relation for a pair of features
• understand why the concept of correlation is important, where it is used and understand why correlation is not the same as causation
• understand the use of Euclidean distance for computing correlation between two features and its advantages/disadvantages
80

Points to remember (2)
• understand the use of Pearson correlation coefficient for computing correlation between two features and its advantages/disadvantages
• understand the meaning of the variables in the Pearson correlation coefficient formula and how they can be calculated. Be able to compute this coefficient on a simple pair of features. The formula for this coefficient will be provided on the exam.
• be able to interpret the meaning of a computed Pearson correlation coefficient
• understand the advantages and disadvantages of using the Pearson correlation coefficient for assessing the degree of relationship between two features
81

Points to remember (3)
• understand the advantages and disadvantages of using mutual information for computing correlation between a pair of features. Understand the main differences between this and Pearson correlation.
• understand the meaning of the variables for mutual information and how they can be calculated. Be able to compute this measure on a simple pair of features. The formula for mutual information will be provided in the exam.
• understand the role of data discretisation in computing mutual information
• understand the meaning of the entropy of a random variable and how to
interpret an entropy value. Understand its extension to conditional entropy
• be able to interpret the meaning of the mutual information between two features
82

Points to remember (4)
• understand the use of mutual information for computing correlation of some feature with a class feature and why this is useful. Understand how this provides a ranking of features, according to their predictiveness of the class
• understand that normalised mutual information can be used to provide a more interpretable measure of correlation than mutual information. The formula for normalised mutual information will be provided on the exam
83

Acknowledgements
• Materials are partially adopted from …
• Previous COMP2008 slides including material produced by James Bailey,
Pauline Lin, Chris Ewin, Uwe Aickelin and others • Interactive correlation calculator:
http://www.bc.edu/research/intasc/library/correlation.shtml
• Correlation <> Causality: http://tylervigen.com/spurious-correlations • Google trends correlation
84