程序代写代做代考 C Statistical Inference STAT 431

Statistical Inference STAT 431
Lecture 3: Summarizing Data – Multiple Variables

Multiple Numerical Variables: Scatter Plots
• Goal: To visualize relationship between a pair of numerical variables
• Example: Pearson’s Father-Son Height Data

●● ●●●
●●●● ●
●●● ●
● ● ●

● ● ●●●● ● ●● ●● ● ●●●●●●●●●● ● ● ●
● ●●

● ●●
●● ●●●●●●●●●●

●●● ●

● ●●● ●●
●●
●●●●●● ●●●●●
● ●●●●●●●● ● ●●●●● ●
●● ●●
● ● ●●●●●●●● ● ●●● ● ●●●●●●●●●●● ● ●●
●●●
●●●● ● ● ● ● ● ● ● ●● ●
●●●●●●● ● ●●●●●● ●●
● ●

● ● ●●●●●●●●●●●●●● ●● ●
●● ●●●●●● ●
● ● ●●●●●● ●●●●● ●
● ●● ●●●● ● ●●●● ●●●
● ●
●●●● ●●●● ●● ● ● ●● ● ●●●●●●● ●
●●●●● ●●●●●●● ● ● ●●●● ●●●●●●●
● ●

●●● ●● ● ● ● ●● ●● ●● ● ● ●●● ●●●●●●●●

● ● ●●
●● ●●●●●●●● ●●●●● ●●●●● ● ●●●●● ●●●●●●●●●●●●●
●● ● ●● ●●● ●●● ●●●●●● ●●●●●●● ● ● ● ● ●● ● ●●● ●●● ●●● ● ●● ●
● ●●● ●●●●● ●●● ●● ● ● ● ●●●● ●● ● ● ●● ●● ●●●
● ● ●●● ● ● ●●●●●●●●●●●● ●● ●● ● ●● ● ● ●●●●●●●●●●● ●●●● ●
● ●
● ●
● ●●●● ●●●●● ●●● ● ● ●●●●●●●●●●● ●●

● ●


● ●● ● ● ● ●● ● ● ●
● ● ●●●●●●● ●●● ● ● ● ●●●●● ●● ●● ●●● ●
● ●●●●● ● ● ●
● ● ●●●●●● ●●●●●●●●● ●●● ● ●●●
●●● ●●●●●●●●● ●●●
● ● ●●●● ●● ●●● ●● ●●●● ●●●●●●●●


● ●●●●●● ●
● ●●●●●●●●● ●●●●●●●

● ● ●●●●●● ●
● ●● ●●●●●●●●●● ●●●●●●● ●
●●●●●●●
● ● ●●●●● ●●●●● ●●●


● ●●●●●● ●● ●●
● ●●●●●●●●● ●
●●●●●● ● ●●● ● ● ●
●●●●●●●
● ●●●●
●●●●●


●●● ● ●●●●●●●● ●●●●●
●●● ● ●●●●● ●●●●●●●●●

● ●●●●● ● ● ●●●●●

Father’s height (inch)
65.05 63.25 64.95 65.75
Son’s height (inch)
59.78 63.21 63.34 62.79
…… 71.33 68.27 71.78 69.31 70.74 69.30 70.31 67.02
60 65 70 75 Father’s height
STAT 431
2
Son’s height
60 65 70 75

Example: House Prices
• 439 House Prices of 2003 in ZIP 30062
• Two numerical variables of interest: Building Area (in SQFT), Price (in $1000)
●●

● ●●●
● ●●● ●●●
●●


● ●●●●● ●

●●● ●●

● ●●

● ● ●●●●
●●● ● ●●●●● ●
● ●●●●●● ●
●● ●● ● ●



●●●●● ●●●● ● ●●●●
●● ●● ● ● ●●●● ●●
● ●● ●● ● ●● ● ●●●●●●●
●●●●●● ● ●●● ●●
●●●●● ● ● ●●●●●●●●●●●●

●●●●●●●●●●● ● ● ●●●●●●●●●●
● ●●●●●●●●●●● ●●●● ● ● ● ●
● ● ●●●●● ● ● ● ●● ●● ● ● ● ●
● ● ●●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ●●●● ●● ●●
●● ●●●●●●●●● ●
●●●●● ●● ● ● ● ● ● ● ● ●● ● ●
●●●● ● ● ● ● ● ●●
●● ●● ●● ●●●●
● ●●● ●
● ● ●
● ● ●● ●●● ●
● ●
●● ● ●● ● ●● ●
●● ● ● ●

• Positive association between building area and price
• Points with red circles seem to be far away from the majority of data
STAT 431
3
1000
2000 3000 Building Area (SQFT)
4000
Price ($1000)
100 200 300 400 500 600

Association between Numerical Variables
• Positively associated if increased value of one variable tend to occur with increased value of the other
• Negatively associated if increased value of one variable occur with decreased values of the other
• Pearson’s Data: son’s height positively associated with father’s height
• House Price Data: house price positively associated with building area
• Caution: association is NOT proof of causation
E.g., Reading ability of teenagers is positively associated with shoe size
• Sometimes, associations in datasets are not just positive or negative, but also appear to be linear
STAT 431 4

Chocolate Consumption, Cognitive Function, and Nobel Laureates
Franz H. Messerli, M.D.
N Engl J Med 2012; 367:1562-1564 October 18, 2012
STAT 431 5

Linear Association between Two Numerical Variables
• Data: paired observations of two numerical variables
(x1,y1),(x2,y2),…,(xn,yn)
• Sample means and sample SDs: x ̄, sx, y ̄, sy
Two summary statistics of linear relationship:
• The sample covariance between x and y is
1 Xn
sxy = n1 (xi x ̄)(yi y ̄)
i=1
• The sample correlation (Pearson’s correlation) between xand y is
r= sxy sx · sy
STAT 431 6

• Properties
– Has no unit
Sample Correlation
r= sxy sx · sy
– Always satisfies 1  r  1
• r = 1 : the relationship between x and y is exactly positive linear
• r = 1: the relationship between x and y is exactly negative linear
• r = 0 : no linear relationship (what if the yi0 s are all the same?)
– Symmetric
– Invariant under linear transformation of x and y
STAT 431 7

Pearson’s Data
Housing Data
r = 0.82
r = 0.50


● ●

● ● ●● ● ● ● ● ●● ● ● ●
● ●●●● ●●●●● ●●●

● ●●●● ● ● ● ●●●●●●●●●● ●
● ●
● ● ● ● ●●●●●●●●●●●●●● ● ● ●●●●● ●●●●● ●
● ●


●●●
● ●●● ●
●●● ●
● ● ●
●●●●●●●●●● ● ● ● ● ● ● ● ●● ●● ●●●●● ●



● ● ●
● ●●●● ●● ●● ● ●●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ●
● ●

● ●●
●●●●●●●●● ●●●●●●●●
● ● ●● ●●●●●● ●● ●●●

●●● ●

● ●●● ●●
●●
●●●●●● ●●●●●
● ●●●●●●●● ● ●●●●● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●● ●
●●●●●● ● ● ●●●● ●●●●● ● ● ●●●●● ●● ● ●●● ● ● ●●●●●●●● ● ●●● ● ●●●●●●●●●●● ● ●●
● ●

●●
●●● ●● ● ●●●●●●●●●●●●●● ●● ●
● ● ● ●●●●●●● ●●● ● ● ●● ● ●●●●●●●● ●●
●● ● ● ●● ●
● ●●●●●●●●●●●●●●●●●● ●
● ● ●●● ● ● ●●●●●●●●●●●● ●●
● ●● ● ● ●●●●●●●●●●● ●●●● ●

● ● ●● ●●● ● ●● ●●●●●● ●●
● ●● ●●●●●●●● ●●●
● ● ● ●
●●●●●● ●●●●●●● ●●● ● ●●●
●● ● ●●●●● ● ●●●●●●●●
●● ●● ●● ●● ● ●●●●●● ●
● ●●●●●●●●● ●● ●●● ●●●
●● ● ●
● ●●●●●●● ●
●●●●●● ● ●●● ● ● ●
●●●●●●●
● ●●●●
●●●●● ●●


●●●●
●●● ● ●●●●●●●● ●●●●●
●●● ● ●● ●●●●● ● ●
●●●●● ●●●●●● ● ●
●●● ●● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●

● ● ● ●●●●●● ●●●●●● ●●●●●●● ● ● ● ● ●● ● ●●● ●●● ●●● ● ●● ●
● ●●● ●●●●● ●●● ●● ● ● ● ●●●●● ●● ● ● ●● ●● ● ●●●

●●
● ● ●●


●●●●● ●●●● ● ●●●●
● ●●●●● ●

● ● ●●●
●●●●● ● ● ● ● ● ● ● ●● ● ● ●
●●●●●●● ● ● ● ●●
●● ●● ●● ●●●●
● ●●● ●
● ● ●
● ● ●● ●●● ●
● ●
●● ● ●● ● ●● ●
●● ● ● ●

●●●●●● ● ●●● ●
●●●●● ●● ● ●●●●●●●●●●●●

●●●●●●●●●●● ● ● ●●●●●●●●●●
● ●●●●●●●●●●● ●●●● ● ● ● ●
● ● ●●●●● ● ● ● ●● ●● ● ● ● ●
● ● ●●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ●●●● ●● ●●
●● ●●●●●●●●● ●

●●● ●●

● ●●
●● ●● ● ●

●● ●● ● ● ●●●● ●●
● ●● ●● ● ●● ● ●●●●●●●
●●
●●● ● ●●●●● ●
● ●●●●●● ●
●●

● ●●●
● ●●● ●●●
●●


60 65
70
75 1000
2000 3000 Building Area (SQFT)
4000
Father’s height
STAT 431
8
Son’s height
60 65 70 75
Price ($1000)
100 200 300 400 500 600

Guessing Sample Correlations (1)

● ● ● ●●●● ● ●●●
● ●

●● ●
●● ●●●●●●●● ●● ● ●● ● ●● ● ● ● ●●●●



● ●●●
●● ●●●

●●●
● ●●● ●●●● ●●●●
● ●●●● ● ●
●● ●●● ● ●
●● ●●●●●
● ●●●●●●●●●●●●●● ● ●●● ● ●●● ●● ●● ● ●
● ● ●●●● ●●●●●● ●
● ● ● ● ● ●●●
● ● ● ●● ● ● ● ● ● ● ● ● ●
● ●
● ●
● ●● ● ● ● ●● ●●●

●● ●●●●●●●●●●● ●●●●●● ●●
● ●● ●
●● ●
●●●
● ● ●● ● ●
●●●●● ● ●● ●●●●●●
●● ●●●● ● ● ● ●
●● ●●●● ●●●●●● ●● ● ● ●●●●●●●

● ● ●●●● ● ●● ● ●● ●●●●● ● ●
●●
● ●●●●●●●●●●
●● ●●●● ●●
●● ●



●●
●●
●●

● ●
●● ●


●●

● ●
● ●●● ●● ● ● ●● ● ● ●● ●●
●●
●● ●●
●● ●●●



●● ●●●●●
●● ●●●●
●●
● ●●●●●
●●●●●● ● ●●●●●● ●
●●●● ●●●●●● ●●●●●●● ● ●●●
●● ●●●●●●●●
● ● ● ●●●●● ●●●●●● ●● ● ● ●● ●●●●●●●● ●●
●●●●●
●●● ●●●●●●●●●●
●●
● ●●
●● ● ●●●●●●
●● ●●●●●●● ● ●●●
● ● ●
● ● ●
● ● ● ● ● ● ● ●●
● ●● ●●●●●●●●●●● ●

● ●● ● ● ●● ●
● ●● ●●●● ● ● ●●●
● ●●●●●●●●●●●● ● ● ●
●●●●●●●●
●● ●● ● ● ● ●● ●
● ● ●● ●● ●● ● ● ● ● ● ● ●●●● ●●●
● ●● ●●●●●●●●●●
●● ● ●●
●●●● ●● ●●
●●●● ●●
● ●● ●●●
● ●● ● ●

● ● ●● ● ● ● ●●● ●●● ●●


−2 −1 0 1 2 −2 −1 0 1 2 xx

● ●

●●

●● ●
● ●●
● ● ● ● ● ●

●●● ●●● ●● ●●● ●● ●● ●● ●●●● ●●●●
● ● ● ● ● ●● ●● ●● ● ● ●
●●

●●● ●
● ● ●●●●●●● ● ●
●● ●● ● ● ● ● ●● ●●● ●●●●
●● ●
●●●● ● ●●●●
● ●●●● ●● ●●●●● ●● ●● ●● ● ● ●●● ●●

● ●● ●● ● ●●●●
● ●
●●
● ● ● ●● ● ● ●
● ● ●
● ●●●● ●●●
● ●●●
● ● ● ●● ●●●
● ●●●●● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●
●●●●
●●
● ● ●●●●●● ●

● ●●●●●●
● ●● ●● ● ●●
● ●●●●●● ●●●●
● ● ● ●● ● ● ●●
● ● ●● ●●●●●●●●●●●● ●●
●●●● ●
● ●● ●●● ●
● ● ● ● ●●
●● ●●●●●●
●● ●

● ●●●●
●●● ●●● ●● ●


● ●●

● ●●●
● ●●●● ● ●●●●
●● ●●●● ●
●●● ●●●●●●
● ●●●●● ●●●● ●
● ● ●● ● ●●● ● ●
● ●●●●●●●●●● ●




● ●●●●●● ●● ●● ●●●
●● ● ● ● ●● ● ● ●● ●● ● ●● ●● ●
● ● ●● ●●
●● ●●●●●●●
● ● ●● ● ●● ● ● ● ● ●●●●●●●

●●●●●●●●● ●●●●● ● ● ●●●●● ●
●●●●●●●●●●●●●●●●●●●● ●● ● ●● ● ●●● ●●●●●●
● ●● ●● ●● ●●●●


● ●
● ●● ●●●●● ●
● ●●●●●● ●● ●● ● ● ● ● ● ●● ●●● ●
● ●● ●●●●●● ●● ● ●●●● ●● ●●●●●
●●
● ●● ●●● ●
●●●●●●● ●●●●●●
● ●●● ●● ● ●●●●
●● ●●● ●●●●●
● ●
●● ●● ● ●
●● ●●
●● ●● ●
●●●

● ●
●●

−2 −1 0 1 2 x
−2 −1 0 1 2 x
STAT 431
9
yy
−3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
yy
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Guessing Sample Correlations (2)




● ●





● ●
● ●
●●● ●● ●●
0 5 10 15 20 V1
0 5 10 15 20 V3
● ●
● ●
● ●
● ●

● ●
● ●
● ●
● ●
● ● ●

0 5 10 15 20 V5
0 5 10 15 20 V7
STAT 431
10
V6
0 5 10 15 20
V2
0 5 10 15 20
V8
0 5 10 15 20
V4
0 5 10 15 20

Guessing Sample Correlations (3)
● ●●
● ●●●
●● ●●
●●
●●
● ● ● ●●●●●
●●●● ● ● ● ●● ● ●

●●● ●●●●● ●● ● ●
●● ●● ●●● ●
●●●●● ● ●●●● ● ●●
●●● ●●● ●●●
●●●● ●●● ● ● ●●●

●● ●●● ●●●● ● ●●● ●
● ● ● ● ●●●●●
●● ●●
● ●
●● ●
●● ●●
● ●●



●●● ●
●● ●● ●●
●●●
● ● ● ●● ● ●
●● ●●● ●●
●●● ●● ●●● ●
●●● ●●●● ● ●●
●● ● ●●●● ●●●
●●●●● ● ●●●● ●●● ● ●●●●
●●● ●●● ● ●● ●●●●●

●● ●●
● ● ●●● ● ● ●●●
● ●●●●
●●● ●
●●●● ●●● ●●●
● ●●

●●● ●●●

● ●
● ●●

●●● ●●
● ●●● ●●
●● ●● ●●● ●
● ● ● ●●●
●● ●●
● ● ●●
● ●


● ●●

●●●
● ● ● ●●
●●●●●
● ●
●● ●●
●●
● ●●●●●
●● ●
●● ●●●●
● ● ● ●● ● ● ●● ●●●
●● ●●●
● ●
●●● ●●●●●
● ●●●●●
● ●●





−4 −2 0 2 4 xxx
−4 −2 0 2 4
−4 −2 0 2 4
STAT 431
11
y −4 −2 0
2 4
y −4 −2 0
2 4
y −4 −2 0
2 4

Use Sample Correlation with Caution
• A good summary statistic only if relationship is (roughly) linear
• Cannot be used to measure strength of nonlinear relationships
• Simpson’s paradox
• It is always a good idea to plot the data first! – Linearity
– Groups
– Potential outliers (robustness)
STAT 431 12

Put A Line on A Scatter Plot
• When the relationship is (roughly) linear, we can put a line on a scatter plot to indicate it.
• Equation for a line:
y = 0 + 1x
Intercept
Regression Coefficients
Slope
• Q: How to find the “best” line? Xn A: Find (0, 1) that minimizes the objective function
i=1
[yi (0 + 1xi)]2
[the least square (LS) method]
• Solution ˆ =r·sy, ˆ =y ̄ˆx ̄
1 sx 0 1
• More details later in the course: simple linear regression
STAT 431
13

Examples


● ●

● ● ●● ● ● ● ● ●● ● ● ●
● ●●●● ●●●●● ●●●

● ●●●● ● ● ● ●●●●●●●●●● ●
● ●
● ● ● ● ●●●●●●●●●●●●●● ● ● ●●●●● ●●●●● ●
● ●


●●●
● ●●● ●
●●● ●

● ● ●
● ●●●● ●● ●● ● ●●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ●
● ●


● ● ●● ●●● ● ●● ●●●●●● ●●

●●● ●

● ●●● ●●
●●
●●●●●● ●●●●●
● ●●●●●●●● ● ●●●●● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●● ●
● ●●● ●●●●● ●●● ●● ● ● ● ●●●●● ●● ● ● ●● ●● ● ●●●
● ● ●●● ● ● ●●●●●●●●●●●● ●●
● ●● ● ● ●●●●●●●●●●● ●●●● ●
● ●● ●●●●●●●● ●●●
● ● ● ●
●●●●●● ●●●●●●● ●●● ● ●●●
●● ● ●●●●● ● ●●●●●●●●
●● ●● ●● ●● ● ●●●●●● ●
● ●●●●●●●●● ●● ●●● ●●●
●● ● ●
● ●●●●●●● ●
●●●●●● ● ●●● ● ● ●
●●●●●●●
● ●●●●
●●●●● ●●


●●●●
●●● ● ●●●●●●●● ●●●●●

● ● ●
●●●●●●●●●● ● ● ● ● ● ● ● ●● ●● ●●●●● ●


● ●●
●●●●●●●●● ●●●●●●●●
● ● ●● ●●●●●● ●● ●●●
●●●●●● ● ● ●●●● ●●●●● ● ● ●●●●● ●● ● ●●● ● ● ●●●●●●●● ● ●●● ● ●●●●●●●●●●● ● ●●
● ●

●●
●●● ●● ● ●●●●●●●●●●●●●● ●● ●
● ● ● ●●●●●●● ●●● ● ● ●● ● ●●●●●●●● ●●
●● ● ● ●● ●
● ●●●●●●●●●●●●●●●●●● ●
●●● ● ●● ●●●●● ● ●
●●●●● ●●●●●● ● ●
●●● ●● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●

● ● ● ●●●●●● ●●●●●● ●●●●●●● ● ● ● ● ●● ● ●●● ●●● ●●● ● ●● ●
●●
● ● ●●


●●●●● ●●●● ● ●●●●
● ●●●●● ●

● ● ●●●
●●●●● ● ● ● ● ● ● ● ●● ● ● ●
●●●●●●● ● ● ● ●●
●● ●● ●● ●●●●
● ●●● ●
● ● ●
● ● ●● ●●● ●
● ●
●● ● ●● ● ●● ●
●● ● ● ●

●●●●●● ● ●●● ●
●●●●● ●● ● ●●●●●●●●●●●●

●●●●●●●●●●● ● ● ●●●●●●●●●●
● ●●●●●●●●●●● ●●●● ● ● ● ●
● ● ●●●●● ● ● ● ●● ●● ● ● ● ●
● ● ●●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●
● ●●●● ●● ●●
●● ●●●●●●●●● ●

●●● ●●

● ●●
●● ●● ● ●

●● ●● ● ● ●●●● ●●
● ●● ●● ● ●● ● ●●●●●●●
●●
●●● ● ●●●●● ●
● ●●●●●● ●
●●

● ●●●
● ●●● ●●●
●●


60 65
70 75 1000
2000 3000 4000 Building Area (SQFT)
Father’s height
STAT 431
14
Son’s height
60 65 70 75
Price ($1000)
100 200 300 400 500 600

• Which line is a better fit, 1) solid or 2) dashed?
Examples
Son’s height
60 65 70 75
STAT 431
15
60 65 70 75
Father’s height

Multiple Categorical Variables: Contingency Tables
• Relationships between several categorical variables could be examined with a contingency table
• Construction: display the frequency for each possible combination of categories
• Example: Berkeley graduate admission data (1973)
– Three variables: (1) gender, (2) admission status, (3) major applied
– Only look at Gender vs. Admission status: • Bias against women?
Men Women
(Reconstructed from Table 4.11, p.132 of textbook)
% Admitted
44 30
Admitted Denied
1197 1494 557 1278
STAT 431
16

An Interesting Phenomenon: Simpson’s Paradox • Stratified by the third variable: major applied
Men
Women
Admitted Denied
Admitted Denied
511 314 353 207 120 205 138 279
53 138 22 351
89 19
17 8 202 391 131 244
94 299 24 317
1197 1494
557 1278
Major
% Admitted
Men Women
A 62 82
B 63 68
C 37 34
D 33 35
E 28 24
F67 Total 44 30
• Simpson’s paradox: direction of association reversed after marginalization
• It is important to stratify!
STAT 431 17

• •
Another Example of Simpson’s Paradox
Two treatments for kidney stones were compared
Researchers reviewed hospital records and computed the success rate for each treatment
Success
Failure
273
77
289
61

Treatment A Treatment B
Treatment B looks better, right?
% Success
78
83
STAT 431
18

Another Example of Simpson’s Paradox
• Next, researchers looked separately at patients with small stones, and patients with large stones…
Treatment A
Treatment B
Success Failure
Success Failure
81 6 192 71
234 36 55 25
273 77
289 61
Stone Size
Small Large Total
Treatment A
93 73 78
Treatment B
87 69 83
% Success
C. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986) Br Med J (Clin Res Ed) 292 (6524):879-882
STAT 431 19

• Key points of this class: – Scatter plot
Class Summary
– Sample correlation / sample covariance
– The least square method for putting a line on a scatter plot
– Contingency table & Simpson’s paradox
• Topics in R:
– Data input and manipulation
– R routines for sample mean / SD / median / quantile / IQR / histogram / box plot / normal plot / z-scores / correlation / covariance / scatter plot
• Reading: Section 4.4 of the textbook
• Next class: Basic Concepts of Inference (I) (Ch.6.1-6.2)
STAT 431 20