What is Correlation?
• Correlation is used to detect pairs of variables that might have some relationship.
• Visually can be identified via inspecting scatter plots
Copyright By PowCoder代写 加微信 powcoder
• Correlation does not necessarily imply causality!
• Feature ranking: select the best features for building better predictive models:
• A good feature to use, is a feature that has high correlation with the outcome one is trying to predict
https://rpsychologist.com/d3/correlation/
Assessing linear correlation – Pearson correlation
• Assess how close their scatter plot is to a straight line (a linear relationship)
Calculation
Micro Example (Warm Up)
• Calculate Coefficient between X and Y
σ𝑛 𝑥−𝑥ҧ𝑦−𝑦ത 𝑖=1 𝑖 𝑖
𝑥 𝑖 − 𝑥 ҧ 2 . σ 𝑛 𝑖=1
𝑦 𝑖 − 𝑦ത 2
42 𝑥𝑖 − 𝑥ҧ
𝑦 𝑖 − 𝑦ത 2
(𝑥𝑖 − 𝑥ҧ)(𝑦𝑖 − 𝑦ത)
𝑟 = 5 =0.9449
1. Compute the Pearson correlation between Average Steps per day and Average Resting Heart Rate. Show your working. How would you interpret this correlation value?
Average Steps per
Average Resting Heart Rate
σ𝑛 𝑥−𝑥ҧ𝑦−𝑦ത 𝑖=1𝑖 𝑖
σ 𝑛 𝑥 𝑖 − 𝑥 ҧ 2 . σ 𝑛 𝑦 𝑖 − 𝑦ത 2
𝑥 ҧ = 𝑛 𝑥 𝑖 𝑦ത = 𝑛 𝑦 𝑖
σ𝑛 𝑥 −𝑥ҧ 𝑦 −𝑦ത (−1128833.3)
𝑖=1 𝑖 𝑖 = 616166666.7 x2736.2 =−0.86937
𝑥 𝑖 − 𝑥 ҧ 2 . σ 𝑛 𝑦 𝑖 − 𝑦ത 2 𝑖=1
Average Steps per day
Average Resting Heart Rate
(𝑥𝑖 − 𝑥ҧ)(𝑦𝑖 − 𝑦ത)
𝑦 𝑖 − 𝑦ത 2
96694444.44
69444444.44
61361111.11
34027777.78
23361111.11
3361111.111
27777.77778
10027777.78
51361111.11
66694444.44
75111111.11
124694444.4
-1128833.3
616166666.7
2. Based on the Pearson correlation value, can one conclude that doing more steps per day will cause one’s average resting heart rate to
it is causal.
• Data sample is very small, could be a biased sample.
• Could also be a 3rd factor controlling both (e.g. high blood pressure could cause high heart rate, high blood pressure could also cause a person to be less physically active (and thus take lower steps)
• THM: Correlation does not imply Causality
• Limitation of Coefficient
decrease? How else might it be interpreted?
r𝑥𝑦 =−0.86937 • There is a relationship between the two factors, but can’t conclude
How do we measure non-linear correlation
High Med Low
Cold Normal Hot
Variable discretization: Techniques
•Domain knowledge •Equal-length bin •Equal frequency bin
3. Discretise the data as follows: Apply 3 bin equal frequency discretisation to Average Steps per day and 4 bin equal frequency discretisation to Average Resting Heart Rate. Show the values of the discretised features.
• Discretization techniques: Manual thresholds (domain knowledge), Equal-width bin and Equal-frequency bin
Average Steps per day
Disc Average Steps per day
Average Resting Heart Rate
Disc Average Resting Heart Rate
Column 1 = Sorted
3. Discretise the data as follows: Apply 3 bin equal frequency discretisation to Average Steps per day and 4 bin equal frequency discretisation to Average Resting Heart Rate. Show the values of the discretised features.
• Discretization techniques: Manual thresholds (domain knowledge), Equal-width bin and Equal-frequency bin Column 2 Sorted Discrete
Average Steps per day
Disc Average Steps per day
Average Resting Heart Rate
Disc Average Resting Heart Rate
Entropy and Mutual Information
• Quantify the amount of uncertainty in an entire probability distribution
• The entropy of a variable is the “amount of information” contained in the variable
• Describe it in the sense of randomness, surprise • Conditional entropy
• H(Y|X) Measures how much information needed to describe outcome Y, given that outcome X is known
• Mutual Information
• a measure of correlation, the amount of information shared between two
variables X and Y
• MI(X,Y)>=0, large→highly correlated
4. Using the discretised features, compute the entropies:
• H(Average Steps per day)
• H(Average Resting Heart Rate)
• H(Average steps per day | Average Resting Heart Rate) • H(Average Resting Heart Rate | Average Steps per day).
Disc Average Steps per day
Disc Average Resting Heart Rate
𝐻 𝑝 =−𝑝(𝑖)log𝑝(𝑖) 𝑖=1
Conditional Entropy:
𝐻(𝑌|𝑋)=𝑝 𝑥 𝐻(𝑌|𝑋=𝑥) 𝑥∈𝑋
4. Using the discretised features, compute the entropies:
1. H(Average Steps per day)
•=−𝑘 𝑝(𝑖)log𝑝𝑖 𝑖=1
•=− 4log4 − 4log4 − 4log4 12 12 12 12 12 12
•=−3 4log4 12 12
• =−3 1∗−1.585 =1.585 3
𝐻 𝑝 =−𝑝(𝑖)log𝑝(𝑖) 𝑖=1
Conditional Entropy:
𝐻(𝑌|𝑋)=𝑝 𝑥 𝐻(𝑌|𝑋=𝑥) 𝑥∈𝑋
Disc Average Steps per day
Disc Average Resting Heart Rate
4. Using the discretised features, compute the entropies:
2. H(Average Resting Heart Rate)
•=−𝑘 𝑝(𝑖)log𝑝𝑖 𝑖=1
•=− 3log3 − 3log3 − 3log3 − 3log3 12 12 12 12 12 12 12 12
•=−4 3log3 12 12
• =−4 1∗−2 =2 4
𝐻 𝑝 =−𝑝(𝑖)log𝑝(𝑖) 𝑖=1
Conditional Entropy:
𝐻(𝑌|𝑋)=𝑝 𝑥 𝐻(𝑌|𝑋=𝑥) 𝑥∈𝑋
Disc Average Steps per day
Disc Average Resting Heart Rate
4. Using the discretised features, compute the entropies:
3. H(Average Steps per day | Average Resting Heart Rate)→H(S|R) • =σ𝑟∈𝑅𝑝𝑟 𝐻(𝑆|𝑅=𝑟)
• =𝑝 𝑅=4 𝐻 𝑆𝑅=4 +𝑝 𝑅=3 𝐻 𝑆𝑅=3 +𝑝(𝑅=2)𝐻(𝑆|𝑅=2)+𝑝(𝑅=1)𝐻(𝑆|𝑅=1) • =3𝐻𝑆𝑅=4+3𝐻𝑆𝑅=3+3𝐻𝑆𝑅=2+3𝐻𝑆𝑅=1
• 𝐻𝑆𝑅=4 =−1log1=0
• 𝐻 𝑆𝑅=3 =−(1log1)−(2log2)=.918
• 𝐻 𝑆𝑅=2 =−(2log2)−(1log1)=.918 3333
• 𝐻𝑆𝑅=1 =−1log1=0
• =.25 0+0+.918+.918 =0.459
𝐻 𝑝 =−𝑝(𝑖)log𝑝(𝑖) 𝑖=1
Conditional Entropy:
𝐻(𝑌|𝑋)=𝑝 𝑥 𝐻(𝑌|𝑋=𝑥) 𝑥∈𝑋
12 12 12 12
Disc Average Steps per day (S)
Disc Average Resting Heart Rate (R)
4. Using the discretised features, compute the
entropies:
4. H(Average Resting Heart Rate | Average Steps per day)→H(R|S) • = σ𝑠∈𝑆 𝑝 𝑠 𝐻( 𝑅|𝑆 = 𝑠)
• =𝑝𝑆=1𝐻𝑅𝑆=1 +𝑝𝑆=2𝐻𝑅𝑆=2 +𝑝𝑆=3𝐻𝑅𝑆=3 • =4𝐻𝑅𝑆=1+4𝐻𝑅𝑆=2+4𝐻𝑅𝑆=3
• 𝐻𝑅𝑆=1 =−.75log.75 −(.25log.25)=0.311+0.5 • 𝐻𝑅𝑆=2 =−.5log.5 − .5log.5 =.5+.5=1
• 𝐻 𝑅𝑆=3 =− .25log.25 − .75log.75 =0.5+0.311
• =1 1+.811+.811 =0.874 3
𝐻 𝑝 =−𝑝(𝑖)log𝑝(𝑖) 𝑖=1
Conditional Entropy:
𝐻(𝑌|𝑋)=𝑝 𝑥 𝐻(𝑌|𝑋=𝑥) 𝑥∈𝑋
Disc Average Steps per day (S)
Disc Average Resting Heart Rate (R)
5. Using the above information, compute the mutual information between Average Steps per day and Average Resting Heart Rate.
• H(Average Steps per day) = H(S) = 1.585
• H(Average Resting Heart Rate) = H(R) = 2
• H(Average steps per day | Average Resting Heart Rate) = H 𝑆 𝑅
• H(Average Resting Heart Rate | Average Steps per day) = H 𝑅 𝑆
• 𝑀𝐼𝑅,𝑆 =HR −H𝑅𝑆 =2−0.874=1.126
• 𝑀𝐼 𝑅,𝑆 =H S −H 𝑆𝑅 =1.585−0.459=1.126
= 0.459 = 0.874
Mutual Information:
𝑀𝐼𝑅,𝑆 =HR −H𝑅𝑆
𝑀𝐼𝑅,𝑆 =HS −H𝑆𝑅
𝑁𝑀𝐼 𝑅,𝑆 = 𝑀𝐼 𝑅,𝑆 min(H S ,H R )
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com