Tongue cancer rates by DHB
Tongue cancer rates by DHB
Patrick Graham
09/09/2021
Contents
1 Tongue cancer rates by DHB 1
1.1 Setup the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Model 1: Common underlying rate across all DHBs . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Model 2: DHB specific underlying rates and inference for functions of these rates . . . . . . . 6
1 Tongue cancer rates by DHB
We consider rates of tongue cancer by DHB, based on cancers registered in 2018.
1.1 Setup the data
cancerdata <- read.csv("data/tongue_cancer.csv",header=TRUE)
str(cancerdata)
## 'data.frame': 20 obs. of 6 variables:
## $ Year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ Desc : chr "Tongue - C01-C02" "Tongue - C01-C02" "Tongue - C01-C02" "Tongue - C01-C02" ...
## $ Sex : chr "AllSex" "AllSex" "AllSex" "AllSex" ...
## $ Demography: chr "Northland" "Waitemata" "Auckland" "Counties Manukau" ...
## $ Cases : int 7 23 14 12 19 3 10 4 6 3 ...
## $ Population: int 185800 615100 493300 567000 421000 113400 249700 49500 172300 121300 ...
#only a small dataset so why not just print it to have a look
cancerdata
## Year Desc Sex Demography Cases Population
## 1 2018 Tongue - C01-C02 AllSex Northland 7 185800
## 2 2018 Tongue - C01-C02 AllSex Waitemata 23 615100
## 3 2018 Tongue - C01-C02 AllSex Auckland 14 493300
## 4 2018 Tongue - C01-C02 AllSex Counties Manukau 12 567000
## 5 2018 Tongue - C01-C02 AllSex Waikato 19 421000
## 6 2018 Tongue - C01-C02 AllSex Lakes 3 113400
## 7 2018 Tongue - C01-C02 AllSex Bay of Plenty 10 249700
## 8 2018 Tongue - C01-C02 AllSex Tairawhiti 4 49500
## 9 2018 Tongue - C01-C02 AllSex Hawke's Bay 6 172300
## 10 2018 Tongue - C01-C02 AllSex Taranaki 3 121300
## 11 2018 Tongue - C01-C02 AllSex MidCentral 9 66700
1
## 12 2018 Tongue - C01-C02 AllSex Whanganui 4 181700
## 13 2018 Tongue - C01-C02 AllSex Capital & Coast 12 153900
## 14 2018 Tongue - C01-C02 AllSex 7 315900
## 15 2018 Tongue - C01-C02 AllSex Wairarapa 1 46800
## 16 2018 Tongue - C01-C02 AllSex 4 155500
## 17 2018 Tongue - C01-C02 AllSex West Coast 3 32400
## 18 2018 Tongue - C01-C02 AllSex Canterbury 20 560800
## 19 2018 Tongue - C01-C02 AllSex South Canterbury 1 60900
## 20 2018 Tongue - C01-C02 AllSex Southern 15 307400
rates <- cancerdata$Cases / cancerdata$Population
summary(rates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.642e-05 2.409e-05 3.524e-05 4.371e-05 4.605e-05 1.349e-04
#may make more sense if expressed as rate per 100000
rates100000 <- rates*100000
summary(rates100000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.642 2.409 3.524 4.371 4.605 13.493
rates.df <- data.frame(cancerdata$Demography,rates100000,rates,
cancerdata$Cases,cancerdata$Population)
names(rates.df) <- c("Dhb","rate_100000","rate","Cases", "Population")
rates.df
## Dhb rate_100000 rate Cases Population
## 1 Northland 3.767492 3.767492e-05 7 185800
## 2 Waitemata 3.739229 3.739229e-05 23 615100
## 3 Auckland 2.838030 2.838030e-05 14 493300
## 4 Counties Manukau 2.116402 2.116402e-05 12 567000
## 5 Waikato 4.513064 4.513064e-05 19 421000
## 6 Lakes 2.645503 2.645503e-05 3 113400
## 7 Bay of Plenty 4.004806 4.004806e-05 10 249700
## 8 Tairawhiti 8.080808 8.080808e-05 4 49500
## 9 Hawke's Bay 3.482298 3.482298e-05 6 172300
## 10 Taranaki 2.473207 2.473207e-05 3 121300
## 11 MidCentral 13.493253 1.349325e-04 9 66700
## 12 Whanganui 2.201431 2.201431e-05 4 181700
## 13 Capital & Coast 7.797271 7.797271e-05 12 153900
## 14 2.215891 2.215891e-05 7 315900
## 15 Wairarapa 2.136752 2.136752e-05 1 46800
## 16 2.572347 2.572347e-05 4 155500
## 17 West Coast 9.259259 9.259259e-05 3 32400
## 18 Canterbury 3.566334 3.566334e-05 20 560800
## 19 South Canterbury 1.642036 1.642036e-05 1 60900
## 20 Southern 4.879636 4.879636e-05 15 307400
#What is the overall rate
rawrate <- sum(cancerdata$Cases) /sum(cancerdata$Population)
rawrate*100000 #3.634198
## [1] 3.634198
2
1.2 Model 1: Common underlying rate across all DHBs
Let Yi and Ni be the number of cancer cases and population at risk for the ith DHB. Initially we consider a
simple model that assumes the underlying rate is the same for all DHBS:
[Yi|λ] ∼ Poisson(Yi|Niλ), i = 1 . . . , n (1)
so, for i ∈ 1, . . . , n, E(Yi|λ) = Niλ and E(Yi/Ni = λ). That is, λ is the expected tongue cancer rate, often
referred to as the underlying rate, in contrast to the observable rate ri = Yi/Ni. The populations at risk, or
“exposures,” {Ni, i ∈ 1, . . . , n} are regarded as known constants. It is arguable whether we should make the
conditioning on these values explicit by including them in the model definition and write
[Yi|Ni, λ] ∼ Poisson(Niλ), i = 1 . . . , n.
To simplify notation a little we will just regard the Ni as extra background information we know in advance
of observing the cancer counts and so we won’t explicitly condition on them.
Model (1) assumes the expected rate is the same for all DHBs. This means the observable rates ri = Yi/Ni
vary over DHBs only because of random variation and not because of variation in the underlying rates. The
model is a bit simplistic but can be a base to compare other models against. Here we are just using it as a
simple example to explore some aspects of Monte Carlo simulation of posterior distributions.
Let Y = (Y1, . . . , Yn). We assume conditional independence of the Yi given θ so the likelihood is
p(Y|λ) =
n∏
i=1
Poisson(Yi|Niλ).
The conjugate prior for the rate parameter of a Poisson model is Gamma(a, b). We set a = 3/100, 000, b = 1
and our prior expectation is therefore E(θ) = a/b = 3/100, 000 Thus, our prior guess at the underlying rate
is 3 per 100,000, but the weight we put on that prior guess is equivalent to the information that would be
provided by learning the number of cases for a DHB with a population of size 1 (impossible of course, because
the rate for such a DHB would be 0 or 1, but its a useful analogy.)
The posterior for λ is
p(λ|Y) =
∏n
i=1 Poisson(Yi|Niλ)Gamma(λ|3/100, 000, 1)∫ ∏n
i=1 Poisson(Yi|Niλ)Gamma(λ|3/100, 000, 1) dλ
∝
[
n∏
i=1
(Niλ)Yi exp(−Niλ)
] [
λ((3/100,000)−1) exp(−λ)
]
=
[
n∏
i=1
NYii
][
n∏
i=1
λYi
][
exp(λ
n∑
i=1
Ni)
] [
λ((3/100,000)−1)) exp(−λ)
]
∝ λ(
∑
n
i=1
Yi+((3/100,000)−1)) exp−
(
λ
(
n∑
i=1
Ni + 1
))
∝ Gamma
(
λ|
(
n∑
i=1
Yi + (3/100, 000)
)
, (Ni + 1)
)
(2)
Code to compuet the posterior for the assumed common rate for all DHBs is straightforward.
##Gamma prior
a <- 3/100000
b <- 1 #Like saying prior evidence is equivalent to
#one extra tiny DHB of size 1
3
totcases <- sum( cancerdata$Cases)
totpop <- sum(cancerdata$Population)
##update to get parameters of the posterior, using conjugacy
apost <- a + totcases
bpost <- b + totpop
##Compute posterior summaries
postmean <- apost/bpost
post_median <- qgamma(0.5,shape=apost,rate = bpost)
postmean * 100000
## [1] 3.634198
post_median * 100000
## [1] 3.627357
q025 <- qgamma(0.025,shape=apost,rate=bpost)
q975 <- qgamma(0.975,shape=apost,rate=bpost)
exact_quantiles <- 100000*c(q025,post_median,q975)
exact_quantiles
## [1] 3.118513 3.627357 4.188762
# Simulation approach - first look at modest size Monte Carlo sample
post_lambda100 <- rgamma(n=100,shape=apost,rate=bpost)
##check quantiles
post_quantiles100 <-
quantile(post_lambda100,probs=c(0.025,0.5,0.975))
post_quantiles100
## 2.5% 50% 97.5%
## 3.107372e-05 3.636848e-05 4.154266e-05
exact_quantiles
## [1] 3.118513 3.627357 4.188762
#check_mean
post_mean100 <- mean(post_lambda100)
100000*post_mean100 #3.662318
## [1] 3.65006
exact_mean <- apost / bpost
100000*exact_mean #3.634198
## [1] 3.634198
#check standard deviation
post_sd100 <- sd(post_lambda100)
The Monte Carlo error for the posterior mean is
4
MCerror <- post_sd100 / sqrt(100)
MCerror
## [1] 3.020686e-07
which is tiny. It makes more sense when multiplied by 100,000, as per the rates themselves. Recall
Var(Cθ) = C2Var(θ)
so
sd(Cθ) = Csd(θ)
100000 * MCerror #0.02695065,
## [1] 0.03020686
which is still pretty small. The MC mean is about 1 MC standard error from the exact mean.
Now, see what happens for a bigger posterior sample.
post_lambda1000 <- rgamma(n=1000,shape=apost,rate=bpost)
post_quantiles1000 <- quantile(post_lambda1000,probs=c(0.025,0.5,0.975))
##Compare true and simulation results
exact_quantiles
## [1] 3.118513 3.627357 4.188762
100000*post_quantiles100
## 2.5% 50% 97.5%
## 3.107372 3.636848 4.154266
100000*post_quantiles1000
## 2.5% 50% 97.5%
## 3.119403 3.618105 4.210571
#tail quantiles looking pretty good by the time Monte Carlo
#simulation size reaches 1000
#MC error for nsim=1000
post_sd1000 <- 100000*sd(post_lambda1000)
post_sd1000
## [1] 0.2733795
MC_error1000 <- post_sd1000/sqrt(1000)
MC_error1000
## [1] 0.00864502
post_mean1000 <- 100000*mean(post_lambda1000)
post_mean1000
## [1] 3.636131
100000*exact_mean
## [1] 3.634198
5
So the true mean is about on MC standard error from the exact mean. The MC error is fairly trivial though
and represents 1/sqrt(1000) = 3.2% of the posterior standard deviation
1.3 Model 2: DHB specific underlying rates and inference for functions of these
rates
We now turn our attention to Monte Carlo simulation for more challenging estimands (things to be estimated).
We let the underlying rates vary by DHB and address questions such as the rank of each DHB and the
probability that the underlying rate in each DHB is the largest of all DHBs. In reality, any such comparisons
should take account of other factors that vary by DHB and affect cancer rates, such as age structure. However,
to keep the illustration relatively simple we ignore such factors here. Consequently, the comparisons by DHB
presented below should not be taken too seriously. Further analysis would be needed to rigorously compare
tongue cancer rates by DHB.
Our model is now
[Yi|λi] ∼ Poisson(Niλi), i = 1, . . . , n (3)
which differs from (1) by allowing each DHB to have its own underlying rate parameter λi in contrast to
the common underlying rate assumed in (1). Assuming conditional independence over DHBs and letting
λ = (λ1, . . . λn)
′
), the likelihood is now
p(Y|λ) =
n∏
i=1
Poisson(Yi|Niλi). (4)
If we assume a priori independence,then p(λ) =
∏n
i=1 p(λi). We will assume independent Gamma priors
for the underlying tongue cancer rates, and, in particular, assume that each DHB parameter has the same
Gamma((3/100, 000), 1) prior as adopted for the assumed common λ in model (1). Therefore the prior for
the underlying rates is
p(λ) =
n∏
i=1
Gamma(λi|(3/100, 000), 1). (5)
From (4) and (5) the joint posterior for the underlying rate parameters is
p(λ|Y) =
∏n
i=1 Poisson(Yi|Niλi)
∏n
i=1 Gamma(λi|(3/100, 000), 1)∫ ∏n
i=1 Poisson(Yi|Niλi)
∏n
i=1 Gamma(λi|(3/100, 000), 1) dλ
∝
n∏
i=1
Poisson(Yi|Niλi)×Gamma(λi|(3/100, 000), 1)
∝
n∏
i=1
Gamma (λi|(Yi + (3/100, 000)), (Ni + 1)) (6)
where the last line (6) follows fron the conjugacy of the Poisson and Gamma distributions.
So, we can generate the joint posterior distribution for the underlying rates by drawing from independent
Gamma distributions. As well as summarising the posterior for the underlying rates for each DHB, we
can compare rates acroos DHBs and compute such this as the posterior ditribuion of each DHBs rank and
Pr(λi > λj ,∀j 6= i|Y), i.e the posterior probability that the underlying rate for DHB i is greater than the
underlying rate for all other DHBs.
fulla_post <- a + cancerdata$Cases #vector
fullb_post <- b + cancerdata$Population #vector
fulla_post
6
## [1] 7.00003 23.00003 14.00003 12.00003 19.00003 3.00003 10.00003 4.00003 6.00003
## [10] 3.00003 9.00003 4.00003 12.00003 7.00003 1.00003 4.00003 3.00003 20.00003
## [19] 1.00003 15.00003
fullb_post
## [1] 185801 615101 493301 567001 421001 113401 249701 49501 172301 121301 66701 181701
## [13] 153901 315901 46801 155501 32401 560801 60901 307401
## rgamma is partially vectorised; Easiest to loop
## over simulations and on each iteration generate the vector of lambda
##lambda values for the 20 DHBs
## also need to work out the maximum and rank for each set of lambdas
##generated
M <- 1000 ##number of draws from the posterior
n <- length(rates) #number of groups - DHBs in this case
##Set-up structures for storing output
post_fullambda <- matrix(nrow=M,ncol=n )
post_max <- matrix(nrow=M,ncol=n)
post_rank <- matrix(nrow=M,ncol=n)
for (i in 1:M ) {
fullambda <- rgamma(n,shape=fulla_post,rate=fullb_post)
ranks <- rank(fullambda)
ismax <- (ranks == max(ranks) )
post_fullambda[i,] <- fullambda
post_rank[i,] <- ranks
post_max[i,] <- ismax
}
##check results
##posterior quantiles for each DHB
fullpost_quantiles <- apply(post_fullambda,MARGIN=2,FUN=quantile,
probs=c(0.025,0.5,0.975))
fullpost_quantiles.df <- data.frame(rates.df$Dhb,t(100000*fullpost_quantiles))
fullpost_quantiles.df <-
cbind(fullpost_quantiles.df,rates.df$Cases)
names(fullpost_quantiles.df) <- c("DHB","q025","q50","q975","cases")
fullpost_quantiles.df
## DHB q025 q50 q975 cases
7
## 1 Northland 1.65444626 3.517522 6.920025 7
## 2 Waitemata 2.42320310 3.676191 5.359995 23
## 3 Auckland 1.51521882 2.796301 4.408050 14
## 4 Counties Manukau 1.06150977 2.096230 3.403149 12
## 5 Waikato 2.70014759 4.426788 6.961326 19
## 6 Lakes 0.63930037 2.367682 6.586044 3
## 7 Bay of Plenty 1.95621243 3.798187 6.804720 10
## 8 Tairawhiti 2.34942450 7.542662 17.904782 4
## 9 Hawke's Bay 1.23543386 3.275601 6.657502 6
## 10 Taranaki 0.48299559 2.211004 6.215087 3
## 11 MidCentral 6.21342678 12.854694 24.599970 9
## 12 Whanganui 0.62397582 2.005451 4.756890 4
## 13 Capital & Coast 3.72194130 7.517346 12.664989 12
## 14 0.87653985 2.104554 4.044580 7
## 15 Wairarapa 0.05356417 1.520950 7.653331 1
## 16 0.68726132 2.390353 5.757225 4
## 17 West Coast 1.88841422 8.491304 21.066409 3
## 18 Canterbury 2.18543470 3.519093 5.370977 20
## 19 South Canterbury 0.04188125 1.075816 5.969569 1
## 20 Southern 2.79727183 4.788989 7.546623 15
##posterior quantiles for each DHB's rank
fullpost_ranks_quantiles.df <-
apply(post_rank,MARGIN=2,FUN=quantile,
probs=c(0.025,0.5,0.975))
fullpost_ranks_quantiles.df <-
data.frame( t(apply(post_rank,MARGIN=2,FUN=quantile,
probs=c(0.025,0.5,0.975)) ) )
fullpost_ranks_quantiles.df$Dhb <- rates.df$Dhb
fullpost_ranks_quantiles.df
## X2.5. X50. X97.5. Dhb
## 1 3.000 11 18 Northland
## 2 5.975 12 16 Waitemata
## 3 3.000 8 15 Auckland
## 4 1.000 5 11 Counties Manukau
## 5 8.000 14 18 Waikato
## 6 1.000 6 17 Lakes
## 7 4.000 12 18 Bay of Plenty
## 8 6.000 18 20 Tairawhiti
## 9 2.000 10 17 Hawke's Bay
## 10 1.000 6 17 Taranaki
## 11 17.000 20 20 MidCentral
## 12 1.000 5 15 Whanganui
## 13 12.000 18 20 Capital & Coast
## 14 1.000 5 13
## 15 1.000 3 18 Wairarapa
## 16 1.000 6 16
## 17 5.000 18 20 West Coast
## 18 6.000 11 16 Canterbury
## 19 1.000 2 17 South Canterbury
## 20 8.000 15 18 Southern
8
names(fullpost_ranks_quantiles.df)[1:3] <- c("q025","q50","q975")
fullpost_ranks_quantiles.df
## q025 q50 q975 Dhb
## 1 3.000 11 18 Northland
## 2 5.975 12 16 Waitemata
## 3 3.000 8 15 Auckland
## 4 1.000 5 11 Counties Manukau
## 5 8.000 14 18 Waikato
## 6 1.000 6 17 Lakes
## 7 4.000 12 18 Bay of Plenty
## 8 6.000 18 20 Tairawhiti
## 9 2.000 10 17 Hawke's Bay
## 10 1.000 6 17 Taranaki
## 11 17.000 20 20 MidCentral
## 12 1.000 5 15 Whanganui
## 13 12.000 18 20 Capital & Coast
## 14 1.000 5 13
## 15 1.000 3 18 Wairarapa
## 16 1.000 6 16
## 17 5.000 18 20 West Coast
## 18 6.000 11 16 Canterbury
## 19 1.000 2 17 South Canterbury
## 20 8.000 15 18 Southern
##posterior probability that rate in each DHB is the maximum
fullpost_max <- colMeans(post_max)
fullpost_max.df <- data.frame(rates.df$Dhb,fullpost_max)
names(fullpost_max.df) <- c("Dhb","prob")
fullpost_max.df
## Dhb prob
## 1 Northland 0.003
## 2 Waitemata 0.000
## 3 Auckland 0.000
## 4 Counties Manukau 0.000
## 5 Waikato 0.000
## 6 Lakes 0.000
## 7 Bay of Plenty 0.001
## 8 Tairawhiti 0.113
## 9 Hawke's Bay 0.001
## 10 Taranaki 0.000
## 11 MidCentral 0.630
## 12 Whanganui 0.000
## 13 Capital & Coast 0.053
## 14 0.000
## 15 Wairarapa 0.003
## 16 0.001
## 17 West Coast 0.191
## 18 Canterbury 0.000
## 19 South Canterbury 0.003
## 20 Southern 0.001
9
##What about probability in the top 5
intop5 <- (post_rank >= 16)
Prtop5 <- colMeans((intop5) ) Prtop5.df <- data.frame(rates.df$Dhb,Prtop5) Prtop5.df ## rates.df. 5 ## 1 Northland 0.148 ## 2 Waitemata 0.075 ## 3 Auckland 0.003 ## 4 Counties Manukau 0.000 ## 5 Waikato 0.269 ## 6 Lakes 0.067 ## 7 Bay of Plenty 0.178 ## 8 Tairawhiti 0.768 ## 9 Hawke's Bay 0.123 ## 10 Taranaki 0.051 ## 11 MidCentral 0.993 ## 12 Whanganui 0.021 ## 13 Capital & Coast 0.887 ## 14 0.003 ## 15 Wairarapa 0.088 ## 16 0.041 ## 17 West Coast 0.787 ## 18 Canterbury 0.060 ## 19 South Canterbury 0.037 ## 20 Southern 0.401 We can see that very few of the DHBS are definitively in the top 5. We would be most confident about Mid-central, West Coast and Taranaki ranking in the top 5. The fact that we could only confidently assert that three of the 20 DHBS are in the top 5 of all DHBs, illustrates the uncertainty in the estimation and the difficulties that would be faced if some treatment or screening programme was to be targeted at the DHBs with the highest rates. However, the data analysed here pertain to a single year, 2018. Analysing more years of data would give more stable picture of variation by DHB in tongue cancer rates, though adjustment for differences in population structure between DHBs would be necessary before any firm conclusions could be drawn from an analysis of this sort. Later in the course we will consider models that are a compromise between model (1) and (3). 10 Tongue cancer rates by DHB Setup the data Model 1: Common underlying rate across all DHBs Model 2: DHB specific underlying rates and inference for functions of these rates