CS计算机代考程序代写 Inference for Counts in Two-Way Tables

Inference for Counts in Two-Way Tables

• data is often presented as counts in two-way
tables

• in some cases the data consists of random
samples from each of several
subpopulations, and the goal is to assess
whether the possible outcomes have the
same distribution within each subpopulation

• in other cases the data consists of a single
sample from a population where each
sampled item is cross classified according to
two categorical variables

• in general there are r rows and c columns
and the counts are
Xij , i = 1, . . . , r, j = 1, . . . , c

Column Variable

Row Variable 1 . . . c Total

1 X11 . . . X1c X1.
. . . . . . . . . . . . . . .

r Xr1 . . . .

Total X.1 . . . X.c X..

Test of Homogeneity

Example: Consider the following table, in
which the cells in 5 samples were classified
as to type

Cell type

Sample P Q R S T U Total

1 3 24 18 13 32 10 100

2 7 25 16 11 27 14 100

3 5 28 14 10 31 12 100

4 7 22 19 11 26 15 100

5 17 19 13 8 25 18 100

• we’ll call the values Xij , for i = 1, …, r and
j = 1, …, c

• in this case the row totals ri = Xi. were
fixed by the study design

• we can assume each row follows a
multinomial distribution, with number of
trials ri (all 100 in this example) and
probabilities pij, j = 1, . . . , c

• note that the probabilities sum to 1 in each
row,

∑r
j=1 pij = 1

• the question of interest is whether the
distribution of cell types is the same across
all samples, that is, whether the multinomial
probabilities are the same in each row

• so the null hypothesis is

H0 : pij = pkj = p.j

for all i, j, k or, more simply

H0 : the distribution of cell types is

the same in each sample

• the alternative hypothesis is

Ha : pij 6= pkj

for some i, j, k, or

Ha : the distribution of cell types is not

the same in each sample

• we can also say

H0 : the distributions of cell types are

homogeneous

Ha : the distributions of cell types are

not homogenous

Test for Association

Example: 327 people were randomly selected and
their hair colour and handedness was determined.
The results are

Hair Colour

Handedness Red Brown Other Total

Left 12 54 37 103

Right 29 75 66 170

Ambidex. 7 27 20 54

Total 48 156 123 327

• here there is a single sample, and the cells of
the multinomial distribution form a two-way
table

• the hypotheses of interest are

H0 : there is no association between

handedness and hair colour

Ha : there is an association between

handedness and hair colour

• the cells of the table follow a multinomial
distribution with probabilities pij,
i = 1, . . . , r, j = 1, . . . , c

• as with all multinomials, the probabilities
must sum to one,

r∑

i=1

c∑

j=1

pij = 1

• the null hypothesis of no association is

H0 : pij = pi.p.j

for all i, j

• here, pi. is the probability of being in the ith
row, and p.j is the probability of being in the
jth column

• recall that the joint probability of
independent events is the product of the

marginal probabilities, so the null hypothesis
assumes independence between the row and
column variables

• the alternative hypothesis is

Ha : pij 6= pi.p.j

for some i, j

Test for Homogeneity or Association

• the analysis of the data is the same for tests
of homogeneity and test of independence!

• this is fortunate, because the distinction
between the two is not always clear

• even if the data are collected as a single
sample, the null hypothesis of no association
can be phrased as homogeneity of
conditional distributions across the rows or
columns

– for example, with the handedness-hair
colour example we can ask whether the
distribution of hair colour is the same

given the person is right handed, left
handed or ambidexterous people

• the test statistic is the goodness of fit
statistic

X2 =
r∑

i=1

c∑

j=1

(Xij − eij)
2

eij

• the test statistic compares the observed
counts to those expected assuming H0 is
true in each cell of the table

• the expected counts are given by

eij =
Xi.X.j

X..

=
row i sum × column j sum

overall sum

• X2 has an approximate χ2 distribution with
(r − 1)(c − 1) degrees of freedom

• for the approximation to be valid, all
expected counts should be greater than 5

• we also assume that the data consist of (a)
random sample(s)

• the P value is P (χ2(r−1)(c−1) > X
2), and as

usual smaller P values give stronger
evidence against H0

• in general

– observed and expected close:
little evidence of association

– observed and expected far apart:
strong evidence of association

• the expected counts are calculated as,

eij =
row sum × column sum

overall sum
=

Xi.X.j
X..

where the terms have different meaning
depending on the hypothesis

• for heterogeneity the row (or column) totals
are considered fixed, and the expected
counts are

eij =
riX.j

n
= rip̂.j

where n = X.. =

ri is the overall sum

• here p̂.j = X.j/n is the pooled estimate of
the common value of p.j under the null
hypothesis, and multiplying by ri = Xi.
gives the mean or expected value

• for association, when we calculate expected
counts using

eij =
Xi.X.j

n

we are using the formula np̂ij under the null
hypothesis, where p̂ij = p̂i.p̂.j, and
p̂i. = Xi./n and p̂.j = X.j/n

• so

eij = np̂ij = np̂i.p̂.j

= n
Xi.
n

X.j
n

=
Xi.X.j

n

Examples

Cell types:

• here the null hypothesis is that the
distributions of cell types are the same for
each sample

• the data can be analyzed in Minitab, as
follows

MTB > chis c2-c7

Chi-Square Test: C2, C3, C4, C5, C6, C7

Expected counts are printed below observed counts

Chi-Square contributions are printed below expected

counts

C2 C3 C4 C5 C6 C7 Total

1 3 24 18 13 32 10 100

7.80 23.60 16.00 10.60 28.20 13.80

2.954 0.007 0.250 0.543 0.512 1.046

2 7 25 16 11 27 14 100

7.80 23.60 16.00 10.60 28.20 13.80

0.082 0.083 0.000 0.015 0.051 0.003

3 5 28 14 10 31 12 100

7.80 23.60 16.00 10.60 28.20 13.80

1.005 0.820 0.250 0.034 0.278 0.235

4 7 22 19 11 26 15 100

7.80 23.60 16.00 10.60 28.20 13.80

0.082 0.108 0.563 0.015 0.172 0.104

5 17 19 13 8 25 18 100

7.80 23.60 16.00 10.60 28.20 13.80

10.851 0.897 0.563 0.638 0.363 1.278

Total 39 118 80 53 141 69 500

Chi-Sq = 23.802, DF = 20, P-Value = 0.251

• we have no evidence of a difference in
distribution of cell types in the different
samples

Hair colour/handedness:

• the observed counts are

Hair Colour

Handedness Red Brown Other Total

Left 12 54 37 103

Right 29 75 66 170

Ambidex. 7 27 20 54

Total 48 156 123 327

• calculating proportions relative to a row or
column total, or to the overall total often
reveals interesting features of the data

• for example, division by the sample size gives

Hair Colour

Handedness Red Brown Other Total

Left .04 .17 .11 .32

Right .09 .23 .20 .52

Ambidex. .02 .08 .06 .16

Total .15 .48 .37 1.00

• for this we can see that most people have
brown hair and are right handed

• or we could calculate proportions relative to
the row or column totals

• using the row total gives

Hair Colour

Handedness Red Brown Other Total

Left .12 .52 .36 1.00

Right .17 .44 .39 1.00

Ambidex. .13 .50 .37 1.00

Total .15 .48 .37 1.00

• the three distributions of hair colour appear
to be similar for left, right and
ambidexterous people

• the expected counts are

Hair Colour

Handedness Red Brown Other Total

Left 15.12 49.14 38.74 103

Right 24.95 81.10 63.94 170

Ambidex. 7.93 25.76 20.31 54

Total 48 156 123 327

• for example e11 = 103(48)/327 = 15.12

• note that the expected counts add up to the
same row and column totals as the observed
counts

• note also that the expected counts are not
rounded

• the contributions to X2 are

Hair Colour

Handedness Red Brown Other Total

Left .644 .481 .078

Right .656 .459 .066

Ambidex. .108 .060 .005

Total

• the total test statistic is X2 = 2.557 on
2 × 2 = 4 degrees of freedom

• the P value is .63 indicating that there is no
evidence against the null hypothesis of no
association between handedness and hair
colour

• the calcuations can be done in Minitab using
the chisquare command

MTB > print c1-c3

ROW C1 C2 C3

1 12 54 37

2 29 75 66

3 7 27 20

DATA> chisquare c1-c3

Expected counts are printed below observed

counts

C1 C2 C3 Total

1 12 54 37 103

15.12 49.14 38.74

2 29 75 66 170

24.95 81.10 63.94

3 7 27 20 54

7.93 25.76 20.31

Total 48 156 123 327

ChiSq = 0.644 + 0.481 + 0.078 +

0.656 + 0.459 + 0.066 +

0.108 + 0.060 + 0.005 = 2.557

df = 4