Contingency Tables 1
Billie Anderson, Mark Newman
2020-08-22
Agenda
▪ Prior Homework
▪ Contingency Tables
▪ Next Homework
Introduction to Contingency Tables
Tables are like cobwebs, like the sieve of Danaides; beautifully reticulated, orderly to look upon, but which will hold no conclusion. Tables are abstractions, and the object a most concrete one, so difficult to read the essence of.
From Chartism by Thomas Carlyle (1840), Chapter II, Statistics
Many of our methods in statistics revolve around us trying to understand relationships among variables, whether they are response or explanatory variables.
With categorical variables, these relationships are often studied from data that has been summarized by a contingency table in table form or frequency form, giving the frequencies of observations cross-classified by two or more such variables.
Sometimes there is a third, stratifying variable, where we wish to determine if the relationship between two primary variables is the same or different for all levels of the stratifying variable.
The component of Chapter 4 that we are going to focus on deals with formal statistics and tests for association among variables.
Example – Berkley Admissions Data
The table below which we have seen before presents admissions data from Berkley University. This is an example of a 2×2 contingency table. Remember, we refer to contingency by the number of rows and then the number of columns (r x c).
Admissions to Berkeley graduate programs
Frequency
Marginal
Contingency
Gender
Admitted
Rejected
Total
% Admit
Odds(Admit)
Male
1198
1493
2691
44.52
0.802
Female
557
1278
1835
30.35
0.436
Total
1755
2771
4526
38.78
0.633
It is always a good idea to make the distinction between the response variable and the explanatory variables.
This is an example of the simplest kind of contingency table, a 2 x 2 classification of individuals according to two dichotomous (binary) variables.
For such a table, the question of whether there is an association between admission and gender is equivalent to asking if the proportions of males and females who are admitted to graduate school are different, or whether the difference in proportions admitted is not zero.
Example – Berkley Admissions Data (cont.)
Admissions to Berkeley graduate programs
Frequency
Marginal
Contingency
Gender
Admitted
Rejected
Total
% Admit
Odds(Admit)
Male
1198
1493
2691
44.52
0.802
Female
557
1278
1835
30.35
0.436
Total
1755
2771
4526
38.78
0.633
\(H_0:\)proportion of males and females admitted are the same
\(H_1:\)proportion of males and females admitted are different
OR
Suppose \(\pi_m\) represents the proportion (population) of males admitted to Berkley.
Suppose \(\pi_f\) represents the proportion (population) of females admitted to Berkley.
\(H_0:\pi_m = \pi_f\)
\(H_1:\pi_m \neq \pi_f\)
If were to think about modeling with this data, it would make sense to think of admission as the response variable and gender as the explanatory variable.
Example – Arthritis Treatment Data
Below displays a 2x2x3 contingency table.

This is a three-way table, with factors Treatment, Sex, and Improvement.
If the relation between Treatment and Improvement is the same for both genders, an analysis of Treatment versus Improvement (collapse over Sex) could be carried out.
Notation for Contingency Tables
Let’s prepare and start getting familiar with some of the notation.

Each observation is randomly sampled from some population and classified on two categorical variables, A and B.
Notation for Contingency Tables (cont.)

Let \(N={n_{ij}}\) be the observed frequency of table variables A and B with r rows and c columns, as shown above in the table above.
A subscript is replaced by a “+” when summed over the corresponding variable, so \(n_{i+}=\sum_{j}n_{ij}\) gives the total frequency in row \(i\), \(n_{+j}=\sum_{i}n_{ij}\) gives the total frequency in column j, and \(n_{++}=\sum_{i}\sum_{j}n_{ij}\) is the grand total; \(n_{++}\) is also represented by just \(n\).
Since we presume that each observation is randomly sampled from some population and classified on two categorical variables, A and B, the joint distribution of these variable, and let \(\pi_{ij}=Pr(A=i, B=j)\) denote the population probability that an observation is classified in row \(i\), column \(j\) (or cell (\(ij\))) in the table.
Notation for Contingency Tables (cont.)

The cell proportions, \(p_{ij}=\frac{n_{ij}}{n}\), give the sample joint distribution.
The row totals \(n_{i+}\) and column totals \(n_{+j}\) are called marginal frequencies for variables A and B, respectively. These describe the distribution of each variable ignoring the other.
For the population probabilities, the marginal distributions are defined analogously as the row and column totals of the joint probabilities, \(\pi_{i+}=\sum_{j}\pi_{ij}\), and \(\pi_{+j}=\sum_{i}\pi_{ij}\)
The sample margin proportions are, correspondingly, \(p_{i+}=\sum_{j}p_{ij}=\frac{n_{i+}}{n}\), and \(p_{+j}=\sum_{i}p_{ij}=\frac{n_{+j}}{n}\)
Notation for Contingency Tables (cont.)

When one variable (the column variable, B, for example) is a response variable, and the other (A) is an explanatory variable, it is useful to examine the distribution of the response B for each level of A separately. These define the conditional distributions of B, given the level of A, and are defined for the population as \(\pi_{j|i}=\frac{\pi_{ij}}{\pi_{i+}}\).
Remember: \(\pi\) denotes a population proportion \(p\) denotes a sample proportion.
Example
Use the 2×2 contingency table for the Berkley Admissions data and, for each number in the table, label the following:
Berkley <- margin.table(UCBAdmissions, 2:1)
library(gmodels)
CrossTable(Berkley, prop.chisq=FALSE, prop.c=FALSE, format="SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 4526
##
## | Admit
## Gender | Admitted | Rejected | Row Total |
## -------------|-----------|-----------|-----------|
## Male | 1198 | 1493 | 2691 |
## | 44.519% | 55.481% | 59.456% |
## | 26.469% | 32.987% | |
## -------------|-----------|-----------|-----------|
## Female | 557 | 1278 | 1835 |
## | 30.354% | 69.646% | 40.544% |
## | 12.307% | 28.237% | |
## -------------|-----------|-----------|-----------|
## Column Total | 1755 | 2771 | 4526 |
## -------------|-----------|-----------|-----------|
##
##
Example (cont.)
##
## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 4526
##
## | Admit
## Gender | Admitted | Rejected | Row Total |
## -------------|-----------|-----------|-----------|
## Male | 1198 | 1493 | 2691 |
## | 44.519% | 55.481% | 59.456% |
## | 26.469% | 32.987% | |
## -------------|-----------|-----------|-----------|
## Female | 557 | 1278 | 1835 |
## | 30.354% | 69.646% | 40.544% |
## | 12.307% | 28.237% | |
## -------------|-----------|-----------|-----------|
## Column Total | 1755 | 2771 | 4526 |
## -------------|-----------|-----------|-----------|
##
##
Observed frequency: 1198 Marginal frequencies: 2691 Marginal probability: 59.456% Joint probability: 26.469%
2x2 Tables: Odds and Odds Ratio
There is a difference between odds and odds ratio!
It is a subtle, but important difference.
Let’s re-visit the Berkley Admissions data.
##
## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 4526
##
## | Admit
## Gender | Admitted | Rejected | Row Total |
## -------------|-----------|-----------|-----------|
## Male | 1198 | 1493 | 2691 |
## | 44.519% | 55.481% | 59.456% |
## | 26.469% | 32.987% | |
## -------------|-----------|-----------|-----------|
## Female | 557 | 1278 | 1835 |
## | 30.354% | 69.646% | 40.544% |
## | 12.307% | 28.237% | |
## -------------|-----------|-----------|-----------|
## Column Total | 1755 | 2771 | 4526 |
## -------------|-----------|-----------|-----------|
##
##
2x2 Tables: Odds and Odds Ratio (cont.)
Remember, \(p_{ij}=\frac{n_{ij}}{n}\) gives the sample joint probability in the 2x2 contingency table.
Notice that \(p_{11}+p_{12}+p_{21}+p_{22}=1\).
In the contingency table above, you would be interested in whether the column variable is independent of the row variable (for a 2x2 table); is gender independent of admission? You would formally state this as,
\(H_0\): gender and admission are independent of each other (there is no relationship)
\(H_1\): gender and admission are not independent of each other (there is some dependency, some relationship exists)
How do we go about testing a hypothesis like this? We use an odds ratio.
An odds ratio is a quantitative metric that measures the strength of an association between two variables.
Going to study odds first and then odds ratio.
Odds
For a binary variable, let \(\pi\) denote the probability of success, then the odds of success is \(\frac{\pi}{1-\pi}\).
From the rule of complements, \(1-\pi\) is the probability of a success not occurring; probability of failure.
\(1\) is the number we always think about when we are trying to evaluate the odds of success.
Another way to think of odds: \(\frac{\pi}{1-\pi}\)=\(\frac{\text{probability of event of interest (success) occurring}}{\text{proabability of event of interest (success) not occurring}}=\frac{\text{probability of success}}{\text{probability of failure}}\)
For example, three cases to consider:
1. if odds = \(1\), then the numerator and denominator values are equal, so the probability of success and failure are the same.
2. if odds \(>1\), then the numerator > denominator, so the probability of success is greater than failure.
For example, if \({\pi}=.75\), then odds = \(\frac{.75}{.25}=3\). So, a success is three times as likely as a failure.
3. if odds \(<1\), then numerator < denominator, so the probability of success is less than failure.
For example, if \({\pi}=.25\), then odds = \(\frac{.25}{.75}=\frac{1}{3}\). So, the probability of success is one-third that of failure; failure is more likely.
The odds of success are multiplicative around 1. If we want odds of success to be additive around 1, we take the log of the odds, which we call log odds or logit, defined as
\[logit(\pi)=log(odds)=log(\frac{\pi}{1-\pi})\]
You will see, later in the class, that this is the quantity that is measured in logistic regression.
Odds Ratio
An odds ratio is a quantitative metric that measures the strength of an association between two variables.
For a 2x2 table in which the variables are binary, suppose Group \(1\) is the row variable and Group 2 is the column variable, and let \(\pi_1\) be the success probability for Group \(1\) and let \(\pi_2\) be the success probability for Group 2, then the odds ratio is the ratio of the odds for the two groups. The population odds ratio is denoted by \(\theta\).
population odds ratio = \(\theta=\frac{odds_1}{odds_2}=\frac{\pi_1/(1-\pi_1)}{\pi_2/(1-\pi_2)}\)
You interpret the odds ratio in the same manner as you did the odds; the logic of comparing the value to \(1\), < \(1\) and > \(1\).
If \(\theta=1\), then the probability of success and failure for both groups are the same; so there is no association among the column and row variables.
Going back to our hypothesis that we originally stated:
\(H_0\): gender and admission are independent of each other (there is no relationship)
\(H_1\): gender and admission are not independent of each other (there is some dependency, some relationship exists)
Is the same as testing:
\(H_0: \theta=1\)
\(H_1: \theta \ne 1\)
Remember, \(\theta\) is a parameter (odds ratio calculated from the population and it should be in \(H_0\) and \(H_1\))
How are we going to test this? Same way we test all statistical hypothesis in analytics!
Take a sample, compute a statistic (sample odds ratio in this case) and determine if the sample odds ratio is significantly different from \(1\).
If we have statistical support that the sample odds ratio is significantly different from \(1\), we will reject \(H_0\) and accept \(H_1\).
If we do not have statistical support that the odds ratio is significantly different from \(1\), we will not reject \(H_0\).
The sample odds ratio is denoted as \(\hat{\theta}\).
\[\hat{\theta}=\frac{p_1/(1-p_1)}{p_2/(1-p_2)}=\frac{n_{11}/n_{12}}{n_{21}/n_{22}}=\frac{n_{11}n_{22}}{n_{12}n_{21}}\]
Hypothesis (cont.)
library(vcd)
library(grid)
library(gnm)
library(vcdExtra)
data(“UCBAdmissions”)
(UCB <- margin.table(UCBAdmissions, 1:2))
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
(OR <- oddsratio(UCB, log = FALSE))
## odds ratios for Admit and Gender
##
## [1] 1.84108
Hypothesis (cont.)
Let’s make sure we understand how the odds ratio was computed.
data <-
datasets::UCBAdmissions %>%
as.data.frame() %>%
select(Gender, Admit, Freq) %>%
group_by(Gender, Admit) %>%
summarise(n = sum(Freq)) %>%
pivot_wider(names_from = ‘Admit’, values_from = ‘n’) %>%
as.data.frame()
func <- function(z) if(is.numeric(z)) sum(z) else ''
total_row <- data %>% lapply(func) %>% as.data.frame()
total_row[1] <- 'Total'
data <-
data %>%
rbind(total_row) %>%
mutate(Total = Admitted + Rejected)
data
## Gender Admitted Rejected Total
## 1 Male 1198 1493 2691
## 2 Female 557 1278 1835
## 3 Total 1755 2771 4526
Hypothesis (cont.)
This is a sample odds ratio because it was computed from sample data.
Presume Group \(1\) are males that were Admitted to Berkley and Group 2 are females admitted to Berkley.
\[\hat{\theta}=\frac{p_1/(1-p_1)}{p_2/(1-p_2)}=\frac{n_{11}/n_{12}}{n_{21}/n_{22}}=\frac{n_{11}n_{22}}{n_{12}n_{21}}\] Let’s make sure we can compute \(\hat{\theta}\) using the proportions (probabilities) and the counts.
\[\hat{\theta}=\frac{p_1/(1-p_1)}{p_2/(1-p_2)}\]
Hypothesis (cont.)
(data <- data %>% mutate(p = Admitted / Total, `1-p` = 1 – p))
## Gender Admitted Rejected Total p 1-p
## 1 Male 1198 1493 2691 0.4451877 0.5548123
## 2 Female 557 1278 1835 0.3035422 0.6964578
## 3 Total 1755 2771 4526 0.3877596 0.6122404
\(p_1\) is the probability of being in Group \(1\) (males admitted to Berkley).
\(1-p_1\) is the probability of not being in Group \(1\) (males rejected from Berkley).
\(p_1=\frac{1198}{1198+1493}=\frac{1198}{2691}=.4452\)
\(1-p_1=1-.4452=.5548\)
\(p_2\) is the probability of being in Group 2 (females admitted to Berkley).
\(1-p_2\) is the probability of not being in Group 2 (females rejected from Berkley).
\(p_2=\frac{557}{557+1278}=\frac{557}{1835}=.3035\)
\(1-p_2=1-.3035=.6965\)
Hypothesis (cont.)
(data <- data %>% mutate(odds = p/`1-p`))
## Gender Admitted Rejected Total p 1-p odds
## 1 Male 1198 1493 2691 0.4451877 0.5548123 0.8024113
## 2 Female 557 1278 1835 0.3035422 0.6964578 0.4358372
## 3 Total 1755 2771 4526 0.3877596 0.6122404 0.6333454
(or <- data$odds[1]/data$odds[2])
## [1] 1.84108
Putting all this together to calculate \(\hat{\theta}\),
\[\hat{\theta}=\frac{.4452/.5548}{.3035/.6965}=\frac{.8025}{.4358}=1.84\] Males are 1.84 times more likely to be admitted to Berkley than females.
Now, compute \(\hat{\theta}\) using the counts.
\[\hat{\theta}=\frac{n_{11}/n_{12}}{n_{21}/n_{22}}=\frac{1198/557}{1493/1278}=\frac{2.15}{1.17}=1.84\]
You are \(1.84\) more times likely to be Admitted to Berkley if you are a male versus if you are a female.
Hypothesis (cont.)
Now, is \(1.84\) significantly different from \(1\) so that we could conclude, based on this sample, that there is a dependency (relationship) between admission status and gender?
Perform the statistical test.
UCB <- margin.table(UCBAdmissions, 1:2)
OR <- oddsratio(UCB, log = FALSE)
summary(OR)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## Admitted:Rejected/Male:Female 1.84108 0.11763 15.651 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, based on the very small p-value (less than \(\alpha=.05\)), we reject \(H_0\) and accept \(H_1\).
Hypothesis (cont.)
Recall,
\(H_0\): gender and admission are independent of each other (there is no relationship)
\(H_1\): gender and admission are not independent of each other (there is some dependency, some relationship exists)
Is the same as testing:
\(H_0: \theta=1\)
\(H_1: \theta \ne 1\)
So, we have support that the variables Admission and Gender are related. Your gender is related to whether you are admitted to Berkley or not.
Sidenote: The more straightforward conclusion is when we can reject \(H_0\) and accept \(H_1\).
Suppose p-value \(> \alpha\) and we had not rejected \(H_0\). How would we have made a conclusion back in the context of the problem we are studying?
We have no statistical support that gender and admission status are not independent of each other; a relationship may exist.
Remember, we never accept \(H_0\). We never accept \(H_0\)! We never accept \(H_0\)!
Relationship Between Hypothesis Testing and Confidence Intervals
You can perform a statistical test to test the following hypotheses by constructing a confidence interval for the parameter, \(\theta\).
\(H_0\): \(\theta=1\) : gender and admission are independent of each other (there is no relationship)
\(H_1\): \(\theta \ne 1\) : gender and admission are not independent of each other (there is some dependency, some relationship exists)
The below R code provides you a confidence interval for \(\theta\). In this class, you are not going to have to construct the confidence interval by hand, but rather, know how to relate the confidence interval to the hypothesis test.
confint(OR)
## 2.5 % 97.5 %
## Admitted:Rejected/Male:Female 1.624377 2.086693
95% sure that the true odds ratio that relates admission status to gender is between \(1.62\) and \(2.09\)
Remember, a confidence interval is a set of plausible values for the parameter of interest, \(\theta\), in this case.
Since \(1\) is not in the confidence interval, \(1\) is not a plausible value for \(\theta\). So, we can reject \(H_0\) and accept \(H_1\). Notice this is the same conclusion that we made with our statistical test previously.
In summary
\(H_0\): independent
\(H_1\): not independent
If \(p \lt \alpha\): you reject \(H_0\) and say there is evidence to support acceptance of \(H_1\).
If \(p \ge \alpha\): you fail to reject \(H_0\).
If the confidence interval does not contain \(1\), you reject \(H_0\) and say there is evidence to support acceptance of \(H_1\).
If the confidence interval does contain \(1\), you fail to reject \(H_0\).
Next Homework
▪ Review Expectations