Why Multivariate Analysis: Simpson’s Paradox
• Question: Is there unfair discrimination in the acceptance rate below?
Acceptance Rate at a University
Male
Female
Accepted
275(45.8%)
150 (37.5%)
Not Accepted
325
250
Total
600
400
Why Multivariate Analysis: Simpson’s Paradox
No relationship between gender and acceptance
– Higher percent of females apply to BIA program.
– It is harder to get into the BIA program than CS
Acceptance Rate at a University By Department
CS Department
Male
Female
Accepted
250
100
Not Accepted
250
100
Total
500
200
BIA Department
Male
Female
Accepted
25
50
Not Accepted
75
150
Total
100
200
Why Multivariate Analysis: Another example
• Astudyofsalariesshowedanegativerelationshipbetween starting salary and the level of the degree.
– For example, employees with PhDs earned less than those with Masters Degrees.
Why?
• More in-depth analysis of the data organized by employer type e.g., universities, government, private industry, showed that:
– There was a positive relationship between starting salary and the degree level, and
– Employer type was confounded with degree level.
3
Multivariate Analysis (MVA)
“…observation and analysis of more than one statistical outcome variable at a time. …. the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. ….”
Source: Wikipedia
4
Variables and Variable types
Variable:
“A quantity that may assume any given value or set of values” Source: (Dictionary.com)
• Continuous variables – The have numerical values and can occur anywhere within some interval. There is no fixed number of values the variable can take on.
• Discrete variables – They can be numerical or nonnumerical. There are a fixed number of values the variable can take on.
5
•
Takes on no more than a countable number of values
Examples:
– Roll a die twice
Let X be the number of times 4 comes up (then X could be 0, 1, or 2 times)
– Toss a coin 5 times.
Let X be the number of heads
Discrete Random Variable
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall
(then X = 0, 1, 2, 3, 4, or 5)
Ch. 4-6
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall
Ch. 4-7
• Can take on any value in an interval
– Possible values are measured on a continuum
Examples:
– Weight of packages filled by a mechanical filling process
– Temperature of a cleaning solution
– Time between failures of an electrical component
Continuous Random Variable
Stevens Classification of Variables
• Nominal: each observation belongs to one of several distinct categories.
• Ordinal: same as nominal but there also exists a known order among the variables
• Interval: is a variable in which the difference between successive values always the same.
• Ratio: are interval variables with a natural point representing the origin of measurement (natural zero point).
The difference between two interval variables is a ratio variables
8
Events
Any collection of outcomes of an experiment.
• Eventsconsistingofsingleoutcomesinthesamplespaceare
called elementary or simple events.
• Eventsconsistingofmorethanoneoutcomearecalled compound events.
9
• Assumes all outcomes in the sample space are equally likely to occur
Classical probability of event A:
P(A) NA number of outcomes that satisfy the event A N total number of outcomes in the sample space
• Requires a count of the outcomes in the sample space
Classical Probability
Probability in discrete space
Probability Axioms:
P(A)0 P()1
For Mutually Exclusive/Disjoint Events:
P(A1 A2 …An)P(A1)P(A2)…P(An)
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
A13
A14
A15
A16
A17
A18
A19
A20
11
Expectation of function variables
E[f(x)]x f(x)p(x) 𝐸𝑓(𝑋) = 𝑓𝑋𝑝(𝑋)
12
Function
13
“A relation between two sets in which one element of the second set is assigned to each element of the first set ”.
Source: (Dictionary.com)
Examples:
z(x, y) 5x 7 y 25
f (x, y) x g(x, y)h(x, y)
E(g(X))x g(X)f(X) Vector Representation
E(g(X)) x g(X)f(X) 𝐸𝑔(𝑋)= 𝑔𝑋𝑝(𝑋)
Expectation of a random variables
E[X] x xp(x) 𝐸𝑋= 𝑋𝑝(𝑋)
E[a]=a; E[aX]=aE[X]
14
Variance of a random variables
Var(X)E[(X)2] 2 x (x )2 p(x)
𝑉𝑎𝑟(𝑋) = (𝑋 − 𝜇)2 𝑝(𝑋)
Var(a)=0; Var(aX)=a2Var(X)
15
Probability in discrete space
Lemma:
P(A)1P(A) P(AB)P(A)P(B)P(AB)
16
Events – class assignments
An insurance company offers four different deductible level- none(N), low(L), medium(M), and high (H). for its homeowner’s policyholders, and three different for its automobile policyholders. Given the following random sample of policyholders.
• What is the probability that the individual has a medium auto deductible and a high homeowner’s deductible?
• What is the probability that the individual has a medium auto deductible ?
• What is the probability that the individual has a high homeowner’s deductible ?
• What is the probability that the individual is in the same category for both auto and homeowner’s deductibles?
• What is the probability that the individual is in two different categories?
• What is the probability that the individual is in two different categories?
• What is the probability that the individual has a medium auto deductible given he/she has a high homeowner’s deductible?
• What is the probability that the individual high homeowner’s deductible given he/she has a has a medium auto deductible?
Home
Auto
N
L
M
H
L
40
60
50
30
M
70
100
200
100
H
20
30
150
150
See the Excel File
17
Probability Distribution
Home
Auto
250
500
1000
0
40
20
1000
60
100
30
2000
50
200
150
5000
30
150
Row Total
180
70
100
470 350
Col Total
130 190
400
280
1000
Home
Auto
0
1000
2000
5000
0.18
250
0.04
0.06
0.05
0.03
500
0.07
0.10
0.20
0.10
0.47
1000
0.02
0.03
0.15
0.15
0.35
Col Total
0.13 0.19
0.40
0.28
1.00
18
A Probability Table
Probabilities and joint probabilities for two events A and B are summarized in this table:
B
B
A
P(AB)
P(AB)
P(A)
A
P(AB)
P(AB)
P(A)
P(B)
P(B)
P(S)1.0
Ch. 3-19 Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall
Events – class assignments
Another insurance company offers four different deductible level- none(N), low(L), medium(M), and high (H). for its homeowner’s policyholders, and three different for its automobile policyholders. Given the following random sample of policyholders.
Home
Auto
N
L
M
H
L
20
40
80
60
200 500 300 1000
M
50
100
200
150
H
30
60
120
90
100 200 400 300
Auto N L M H
L M H
0.02 0.04 0.05 0.10 0.03 0.06
0.08 0.06 0.20 0.15 0.12 0.09
0.20 0.50 0.30 1.00
0.10 0.20 0.40 0.30
20
Conditional Probability
P(A|B)
P(AB) P(B)
21
Independent Events
A and B are independent (no additional information) if: P(A|B)P(A) or P(AB)P(A)P(B)
22
Covariance of x and y
𝐶𝑜𝑣 𝑥,𝑦 =𝐸[(𝑥−𝐸(𝑥))(𝑦−𝐸(𝑦))]
23
Conditional Probability
P(A|B)
P(A∩B)=𝑃 𝐴𝐵 𝑃 𝐵 =𝑃 𝐵𝐴 𝑃(𝐴)
P(AB) P(B)
24
Independent Events
A and B are independent (no additional information) if: P(A|B)P(A) or P(AB)P(A)P(B)
25
Counting the Possible Outcomes
• Use the Combinations formula to determine the number of combinations of n items taken k at a time
Cnk
n! k!(nk)!
Ch. 3-26
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall
• where
– n! = n(n-1)(n-2)…(1) – 0! = 1 by definition
Probability Distributions
Probability Distributions
Discrete
Probability Distributions
Continuous
Probability Distributions
Uniform
Normal
Exponential
Chi-sqr
F
Bernoulli
Binomial
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall
Ch. 4-27
HW01
E(X+Y)=E(X)+E(Y) Var(X+Y)=Var(X)+Var(Y)+2Cov(x,y)
Var(X+Y)=Var(X)+Var(Y)) if X and Y independent
28