Generalized linear models
Bayesian Statistics Statistics 4224/5224 Spring 2021
April 1, 2021
1
The following notes summarize Section 4.3 of Bayesian Statisti- cal Methods, by Brian J. Reich and Sujit K. Ghosh.
For a more in-depth discussion of Bayesian methods for gener- alized linear models, see Chapter 16 of Bayesian Data Analysis, by Andrew Gelman et al.
Generalized linear models
Multiple linear regression assumes that the response variable is Gaussian and thus the mean response can be any real number.
Many analyses do not conform to this assumption; for example, binary responses with support {0, 1}, and count data with support {0,1,2,…}. Clearly these data are not Gaussian and their mean cannot be any real number.
2
The generalized linear model (GLM) extends linear regression concepts to handle these non-Gaussian outcomes.
The basic steps to a GLM are (1) determine the support of the response and select an appropriate parameteric family and (2) link the covariates to the parameters that define this family.
Let the linear combination of the covariates by ηi = β1xi1 + · · · + βkxik .
The linear predictor ηi can take any value in (−∞, ∞) depending on xi1,…,xik.
3
The function that links the linear predictor with a parameter is called the link function. Say that the parameter in the likelihood for the response is μi, then the link function g is
g(μi) = ηi .
The link function must be an invertible function that is well-
defined for all permissible values of the parameter.
Link functions are not unique, and must be selected by the user.
Bayesian fitting of a GLM requires selecting the prior and com- puting the posterior distribution.
4
The priors for the regression coefficients discussed for Gaussian data can be applied to GLMs.
The posterior distributions for GLMs are usually too complicated to derive the posterior in closed-form; however, much of the intuition developed with Gaussian linear models caries over to GLMs.
With non-Gaussian responses the full conditional distributions for the βj are usually not conjugate and Metropolis sampling must be used.
We will use Stan to carry out the MCMC sampling.
5
Binary data
Binary outcomes yi ∈ {0, 1} occur frequently when the result of an experiment is recorded only as an indicator of a success or failure.
Binary variables must follow a Bernoulli distribution.
The Bernoulli distribution has one parameter, the success prob-
ability Pr(yi = 1) = μi ∈ (0, 1).
Since the parameter is a probability, the link function must have input range (0,1) and output range (−∞,∞).
6
Logistic regression
Logistic regression uses the logit link function, defined as
μ g(μ) = logit(μ) = log 1−μ .
The logit link function converts the success probability μ first to the odds of success, μ/(1 − μ) > 0, then to the log-odds, which can be any real number.
The logistic regression model is written
yi|β ∼ indep Bernoulli(μi) where logit(μi) = ηi = xijβj
k j=1
7
The inverse logit, or logistic function, is g−1(η) = eη
1 + eη and so the model can also be expressed as
yi|β ∼ indep Bernoulli A Stan template model is
data {
int
vector[n] x;
int
}
eηi 1 + eηi .
8
parameters {
real alpha;
real beta;
}
model {
y ~ bernoulli_logit(alpha + beta * x) ;
}
Because the log odds that yi = 1 are linear in the covariates, βj is interpreted as the increase in the log odds corresponding to an increase of one unit in xj with all other covariates held fixed.
Equivalently, with all other covariates held fixed, increasing xj by one multiplies the odds by eβj.
9
Probit regression
There are many possible link functions from (0,1) to (−∞,∞). In fact, the quantile function (inverse cdf) of any continuous random variable with support (−∞,∞) would suffice.
The link function in logistic regression is the quantile function of the logistic distribution.
In probit regression, the link function is the Gaussian quantile function
yi|β ∼ indep Bernoulli(μi) where μi = Φ(ηi) where Φ is the standard normal cdf.
10
Unfortunately, the regression coefficients βj in probit regression do not have a nice interpretation such as their log-odds interpre- tation in logistic regression.
Probit regression is equivalent to specifying latent variables ui ∼ indep Normal(ηi,1) and assuming that yi = 1 indicates that ui exceeds a threshold, ui > 0.
For example, ui might represent a patient’s blood pressure and yi is the corresponding binary indicator that the patient has high blood pressure (defined as exceeding 0). Then
so
Pr(yi = 1) = Pr(ui > 0) = 1−Φ(0−ηi) = Φ(ηi) Pr(yi = 1|β) = Φ(β1xi1 + ··· + βkxik) .
11
Count data
Random variables with support yi ∈ {0, 1, 2, . . .} often arise as the number of events that occur in a time interval or spatial region.
We will focus on modeling the mean as a function of the covari- ates.
The link between the linear predictor and the mean must ensure that E(yi) = μi > 0. A natural link function is the log link,
log(μi) = β1xi1 + ··· + βkxik .
In this model, the mean is multiplied by eβj if xj increases by one
with all other covariates remaining fixed.
Unlike binary data, specifying the mean does not completely determine the likelihood for count data.
12
Poisson regression
The Poisson regression model is
yi|β ∼ indep Poisson(μi) where μi = eβ1xi1+···+βkxik .
A critical assumption of the Poisson model is that the mean and variance are equal, that is,
E(yi|β) = Var(yi|β) = μi .
The distribution of a count is over-dispersed (under-dispersed)
if its variance is greater than (less than) its mean.
If over-dispersion is present, then the Poisson model is inap- propriate and another model should be considered, so that the posterior accurately reflects all sources of variability.
13
Negative binomial regression
One approach to accommodating over-dispersion is to incorpo- rate gamma random variables ei ∼ iid gamma(m, m), with mean 1 and variance 1/m, for which
yi|λi, ei ∼ indep Poisson(λiei) .
Marginally over ei, yi follows the negative binomial distribution
yi|λi, m ∼ indep NegBinomial(m, qi) . with size m and probability qi = m/(λi + m).
14
The size m need not be an integer, but if it is and we envision a sequence of independent Bernoulli trials each with success prob- ability qi, then yi can be interpreted as the number of failures that occur before the mth success.
More importantly for handling over-dispersion is that E(yi|λi, m) = λi and Var(yi|λi, m) = λi + λ2i /m .
The parameter m controls over-dispersion.
If m is large, then ei ≈ 1 and the model reduces to the Poisson regression model with Var(yi) ≈ E(yi); if m is close to zero, then ei has large variance and thus Var(yi) > E(yi).
The overdispersion parameter can be given a gamma prior. 15
Zero-inflated Poisson
Another common deviation from the Poisson distribution is an excess of zeros.
For example, say that yi is the number of fish caught by visitor i to a state park.
It may be that most of the yi are zero because only a small proportion (say π) of the visitors fished, but the distribution of yi for those that did fish is Poisson with mean λ.
16
The pmf corresponding to this scenario is
(1−π)+πPois(0|λ)
y = 0 y > 0
p(y|λ, π) =
πPois(y|λ) where Pois(·) is the Poisson pmf.
This is equivalent to the two-stage model
yi|zi, λ ∼ Poisson(ziλ) where zi ∼ Bernoulli(π)
where zi is the latent indicator that visitor i fished, and thus the mean of yi is zero for non-fishers with zi = 0.
17