Inference for Counts in Two-Way Tables
• data is often presented as counts in two-way
tables
• in some cases the data consists of random
samples from each of several
subpopulations, and the goal is to assess
whether the possible outcomes have the
same distribution within each subpopulation
• in other cases the data consists of a single
sample from a population where each
sampled item is cross classified according to
two categorical variables
• in general there are r rows and c columns
and the counts are
Xij , i = 1, . . . , r, j = 1, . . . , c
Column Variable
Row Variable 1 . . . c Total
1 X11 . . . X1c X1.
. . . . . . . . . . . . . . .
r Xr1 . . . .
Total X.1 . . . X.c X..
Test of Homogeneity
Example: Consider the following table, in
which the cells in 5 samples were classified
as to type
Cell type
Sample P Q R S T U Total
1 3 24 18 13 32 10 100
2 7 25 16 11 27 14 100
3 5 28 14 10 31 12 100
4 7 22 19 11 26 15 100
5 17 19 13 8 25 18 100
• we’ll call the values Xij , for i = 1, …, r and
j = 1, …, c
• in this case the row totals ri = Xi. were
fixed by the study design
• we can assume each row follows a
multinomial distribution, with number of
trials ri (all 100 in this example) and
probabilities pij, j = 1, . . . , c
• note that the probabilities sum to 1 in each
row,
∑r
j=1 pij = 1
• the question of interest is whether the
distribution of cell types is the same across
all samples, that is, whether the multinomial
probabilities are the same in each row
• so the null hypothesis is
H0 : pij = pkj = p.j
for all i, j, k or, more simply
H0 : the distribution of cell types is
the same in each sample
• the alternative hypothesis is
Ha : pij 6= pkj
for some i, j, k, or
Ha : the distribution of cell types is not
the same in each sample
• we can also say
H0 : the distributions of cell types are
homogeneous
Ha : the distributions of cell types are
not homogenous
Test for Association
Example: 327 people were randomly selected and
their hair colour and handedness was determined.
The results are
Hair Colour
Handedness Red Brown Other Total
Left 12 54 37 103
Right 29 75 66 170
Ambidex. 7 27 20 54
Total 48 156 123 327
• here there is a single sample, and the cells of
the multinomial distribution form a two-way
table
• the hypotheses of interest are
H0 : there is no association between
handedness and hair colour
Ha : there is an association between
handedness and hair colour
• the cells of the table follow a multinomial
distribution with probabilities pij,
i = 1, . . . , r, j = 1, . . . , c
• as with all multinomials, the probabilities
must sum to one,
r∑
i=1
c∑
j=1
pij = 1
• the null hypothesis of no association is
H0 : pij = pi.p.j
for all i, j
• here, pi. is the probability of being in the ith
row, and p.j is the probability of being in the
jth column
• recall that the joint probability of
independent events is the product of the
marginal probabilities, so the null hypothesis
assumes independence between the row and
column variables
• the alternative hypothesis is
Ha : pij 6= pi.p.j
for some i, j
Test for Homogeneity or Association
• the analysis of the data is the same for tests
of homogeneity and test of independence!
• this is fortunate, because the distinction
between the two is not always clear
• even if the data are collected as a single
sample, the null hypothesis of no association
can be phrased as homogeneity of
conditional distributions across the rows or
columns
– for example, with the handedness-hair
colour example we can ask whether the
distribution of hair colour is the same
given the person is right handed, left
handed or ambidexterous people
• the test statistic is the goodness of fit
statistic
X2 =
r∑
i=1
c∑
j=1
(Xij − eij)
2
eij
• the test statistic compares the observed
counts to those expected assuming H0 is
true in each cell of the table
• the expected counts are given by
eij =
Xi.X.j
X..
=
row i sum × column j sum
overall sum
• X2 has an approximate χ2 distribution with
(r − 1)(c − 1) degrees of freedom
• for the approximation to be valid, all
expected counts should be greater than 5
• we also assume that the data consist of (a)
random sample(s)
• the P value is P (χ2(r−1)(c−1) > X
2), and as
usual smaller P values give stronger
evidence against H0
• in general
– observed and expected close:
little evidence of association
– observed and expected far apart:
strong evidence of association
• the expected counts are calculated as,
eij =
row sum × column sum
overall sum
=
Xi.X.j
X..
where the terms have different meaning
depending on the hypothesis
• for heterogeneity the row (or column) totals
are considered fixed, and the expected
counts are
eij =
riX.j
n
= rip̂.j
where n = X.. =
∑
ri is the overall sum
• here p̂.j = X.j/n is the pooled estimate of
the common value of p.j under the null
hypothesis, and multiplying by ri = Xi.
gives the mean or expected value
• for association, when we calculate expected
counts using
eij =
Xi.X.j
n
we are using the formula np̂ij under the null
hypothesis, where p̂ij = p̂i.p̂.j, and
p̂i. = Xi./n and p̂.j = X.j/n
• so
eij = np̂ij = np̂i.p̂.j
= n
Xi.
n
X.j
n
=
Xi.X.j
n
Examples
Cell types:
• here the null hypothesis is that the
distributions of cell types are the same for
each sample
• the data can be analyzed in Minitab, as
follows
MTB > chis c2-c7
Chi-Square Test: C2, C3, C4, C5, C6, C7
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected
counts
C2 C3 C4 C5 C6 C7 Total
1 3 24 18 13 32 10 100
7.80 23.60 16.00 10.60 28.20 13.80
2.954 0.007 0.250 0.543 0.512 1.046
2 7 25 16 11 27 14 100
7.80 23.60 16.00 10.60 28.20 13.80
0.082 0.083 0.000 0.015 0.051 0.003
3 5 28 14 10 31 12 100
7.80 23.60 16.00 10.60 28.20 13.80
1.005 0.820 0.250 0.034 0.278 0.235
4 7 22 19 11 26 15 100
7.80 23.60 16.00 10.60 28.20 13.80
0.082 0.108 0.563 0.015 0.172 0.104
5 17 19 13 8 25 18 100
7.80 23.60 16.00 10.60 28.20 13.80
10.851 0.897 0.563 0.638 0.363 1.278
Total 39 118 80 53 141 69 500
Chi-Sq = 23.802, DF = 20, P-Value = 0.251
• we have no evidence of a difference in
distribution of cell types in the different
samples
Hair colour/handedness:
• the observed counts are
Hair Colour
Handedness Red Brown Other Total
Left 12 54 37 103
Right 29 75 66 170
Ambidex. 7 27 20 54
Total 48 156 123 327
• calculating proportions relative to a row or
column total, or to the overall total often
reveals interesting features of the data
• for example, division by the sample size gives
Hair Colour
Handedness Red Brown Other Total
Left .04 .17 .11 .32
Right .09 .23 .20 .52
Ambidex. .02 .08 .06 .16
Total .15 .48 .37 1.00
• for this we can see that most people have
brown hair and are right handed
• or we could calculate proportions relative to
the row or column totals
• using the row total gives
Hair Colour
Handedness Red Brown Other Total
Left .12 .52 .36 1.00
Right .17 .44 .39 1.00
Ambidex. .13 .50 .37 1.00
Total .15 .48 .37 1.00
• the three distributions of hair colour appear
to be similar for left, right and
ambidexterous people
• the expected counts are
Hair Colour
Handedness Red Brown Other Total
Left 15.12 49.14 38.74 103
Right 24.95 81.10 63.94 170
Ambidex. 7.93 25.76 20.31 54
Total 48 156 123 327
• for example e11 = 103(48)/327 = 15.12
• note that the expected counts add up to the
same row and column totals as the observed
counts
• note also that the expected counts are not
rounded
• the contributions to X2 are
Hair Colour
Handedness Red Brown Other Total
Left .644 .481 .078
Right .656 .459 .066
Ambidex. .108 .060 .005
Total
• the total test statistic is X2 = 2.557 on
2 × 2 = 4 degrees of freedom
• the P value is .63 indicating that there is no
evidence against the null hypothesis of no
association between handedness and hair
colour
• the calcuations can be done in Minitab using
the chisquare command
MTB > print c1-c3
ROW C1 C2 C3
1 12 54 37
2 29 75 66
3 7 27 20
DATA> chisquare c1-c3
Expected counts are printed below observed
counts
C1 C2 C3 Total
1 12 54 37 103
15.12 49.14 38.74
2 29 75 66 170
24.95 81.10 63.94
3 7 27 20 54
7.93 25.76 20.31
Total 48 156 123 327
ChiSq = 0.644 + 0.481 + 0.078 +
0.656 + 0.459 + 0.066 +
0.108 + 0.060 + 0.005 = 2.557
df = 4