1 Rules
STAT/BIOSTAT 534 Statistical Computing Final Project
Adrian Dobra
adobra@uw.edu
This is your final examination for this course. The work you will submit needs to reflect your own understanding of the material covered in class. It is fine to talk with each other about the project. However, when you actually do the work, you should do it alone. You are free to use any code you wrote for other assignments as well as any code I shared with you. Other than that, the code you will share with me should be entirely your own.
The background material you need to use is presented in Section 2. The full description of the project is given in Section 3. Successfully implementing the algorithm described in Section 3 will give you 300 points towards your final grade. Note: you do not need any knowledge of Bayesian statistics to complete this assignment.
2 Bayesian Inference in Univariate Logistic Regression
We assume to have observed the n independent samples
D = {(y1,x1),…,(yn,xn)}.
The response variable Y is binary, i.e. yi ∈ {0, 1} for i = 1, 2, . . . , n. The explanatory variable X can be continuous or discrete. We consider the univariate logistic regression model
logP(y=1|x) = β0+β1x. (1) P(y = 0|x)
Our model assumptions say that each yi follows a Bernoulli distribution with probability of success πi = P(yi = 1|xi):
yi ∼ Ber(πi).
Since the samples are assumed to be independent, the likelihood is:
n
L(β0,β1|D)=[P(yi =1|xi)]yi[1−P(yi =1|xi)]1−yi,
i=1
1
where
We define the logit function:
logit : (0, 1) −→ (−∞, +∞),
Its inverse is:
logit−1 : (−∞, +∞) −→ (0, 1), The log-likelihood is
l(β0, β1|D) =
= (yi logπi +(1−yi)log[1−πi]).
Simple calculations show that
πi = P (yi = 1|xi) = logit−1(β0 + β1xi) ∈ (0, 1).
∂l(β0, β1|D) ∂β0 ∂l(β0, β1|D) ∂β1
n
= [yi−πi],
i=1 n
= [yixi − πixi]. i=1
i=1
logit(p) = log p . 1−p
logit−1(x) =
exp(x) . 1 + exp(x)
log L(β0, β1|D), n
We assume that the logistic regression coefficients follow independent N(0,1) priors. The joint posterior distribution of β0 and β1 is therefore given by
where and
P (β0, β1|D) = 1 exp (l∗(β0, β1)) , (2) P(D)
l∗(β0,β1)=−log(2π)−1β02 +β12+l(β0,β1|D), 2
∞∞
P (D) = exp (l∗(β0, β1)) dβ0dβ1. (3)
−∞ −∞
We call P(D) the marginal likelihood associated with the univariate logistic regression (1). The gradient of l∗(β0, β1) is
∂l∗(β0,β1) ∇l∗(β,β)= ∗∂β0 .
The Hessian matrix associated with l∗(β0,β1) is
∂2l∗(β0,β1) ∂2l∗(β0,β1)
0 1 ∂l (β0,β1) ∂β1
2∗ ∂β2 ∂β0∂β1
.
Dl(β0,β1)= 2∗ 0
∂ l (β0,β1)
∂ β 1 ∂ β 0 2
2∗
∂ l (β0,β1)
∂ β 12
2.1 The Newton-Raphson Algorithm
We determine the mode of the posterior distribution (2), i.e.
β0,β1 = argmax(β0,β1)∈R2l∗(β0,β1),
by employing the Newton-Raphson algorithm. The procedure starts with the initial values β(0), β(0) = (0, 0). At iteration k, we update our current estimate β(k−1), β(k−1) of the
01 01 mode β , β to a new estimate β(k), β(k) as follows:
01 01
β(k) β(k−1) −1 0 = 0 − D2l∗(β(k−1), β(k−1))
∇l∗(β(k−1), β(k−1)). β(k) β(k−1) 0 1 0 1
11
The procedure stops when the estimates of the mode do not change after performing a new update, i.e. |β(k) − β(k−1)| < ε and |β(k) − β(k−1)| < ε. Here ε is some small positive number,
00 11
e.g. 0.0001.
2.2 The Laplace Approximation
Since the integral (3) cannot be explicitly calculated, we need to approximate it numerically. We calculate the marginal likelihood P(D) using the Laplace approximation, i.e.
likelihood log P (D).
2.3 The Metropolis-Hastings Algorithm
01
Markov chain to its next state β(k),β(k) as follows. 01
We generate a candidate state β , β by sampling from the bivariate normal distribution 0 1
−1/2 ∗ 2∗
P(D) = 2πexp l (β0,β1) det −D l (β0,β1) , (4) where β ,β is the mode of the posterior distribution (2). We note that you should
0 1
not actually calculate P(D). Instead, you should calculate the logarithm of the marginal
Sampling from the posterior distribution (2) can be done using the Metropolis-Hastings algorithm. The procedure starts with the initial values β(0),β(0) = β ,β , i.e. we start
01 01
right at the mode of the distribution (2). We update the current state β(k−1), β(k−1) of the
β(k−1) N 0
−1 ,−D2l∗ β,β
. (5) Note that the covariance matrix of the proposal (5) is the negative of the inverse of the
Hessian matrix evaluated at the mode of (2). 3
2 β(k−1) 1
0 1
We accept the move to the proposed state, i.e. we set β(k),β(k) = β ,β with 01 01
probability
min 1, exp l∗ β , β − l∗ β(k−1), β(k−1) . (6) 0101
Otherwise the Markov chain stays at its current state, i.e. we set β(k), β(k) = β(k−1), β(k−1). 0101
We see that the proposal distribution (5) is symmetric, i.e. the probability of proposing β , β if the chain is currently in β(k−1), β(k−1) is equal with the probability of proposing
01 01
β(k−1),β(k−1) if the chain is currently in β ,β . As such, the proposal distribution (5)
01 01
cancels when we calculate the acceptance probability (6).
The implementation of an iteration of the Metropolis-Hastings algorithm proceeds as follows. If the proposed state leads to an increase of l∗, i.e.
l∗ β , β ≥ l∗ β(k−1), β(k−1) , 0101
we accept the move to the proposed state and set β(k),β(k) = β ,β . Otherwise, if the
proposed state leads to a decrease in l∗, i.e.
01 01 l∗ β , β < l∗ β(k−1), β(k−1) ,
0101
we sample u from a Uniform(0, 1) distribution. If
log(u) ≤ l∗ β , β − l∗ β(k−1), β(k−1) ,
0101
we accept the move to the proposed state and set β(k), β(k) = β , β . If
01 01 log(u) > l∗ β , β − l∗ β(k−1), β(k−1)
0101
we reject the move and the chain stays at the current state, i.e. we set β(k),β(k) = 01
β(k−1), β(k−1). 01
2.4 Monte Carlo Integration
You need to evaluate the integral (3) by sampling from the two independent normal priors for the regression coefficients β and β , as follows. Simulate β(1), β(1) , . . . , β(10000), β(10000)
010101
from a bivariate normal distribution
N2 0,01.
0 1 0 The Monte Carlo estimate of (3) is given by the average
1 10000 (j) (j) exp l β0 ,β1 .
10000 j=1
4
3 Project Description (300 points)
The dataset you need to use is contained in the file “534finalprojectdata.txt”. This file has 148 rows (samples) and 61 columns (variables). The first 60 columns are associated with 60 explanatory variables X, while column 61 (the last column) corresponds with the response binary variable Y .
You need to write a C/C++ parallel program using the Master/Slave scheme. The Master process should proceed as follows:
- The Master should initialize an empty single-list L.
- The Master sends work requests to the Slaves and should record the results of these
requests in the list L.
- A work request to a Slave should contain a single integer number between 1 and 60. That is, the Master asks a Slave to analyze the logistic regression of Y (the last column in the data) on X (the i-th column in the data).
- The result of a work request should contain the following numbers:
(a) the index of the explanatory variable associated with the logistic regression processed by the Slave;
(b) the estimates of the log-marginal likelihood of this logistic regression. You need to produce two estimates: the first estimate is the Laplace approximation (4), while the second estimate is based on Monte Carlo integration as described in Section 2.4.
(c) estimates of the coefficients β0 and β1 of this logistic regression. These esti- mates are obtained as follows:
– Use the Metropolis-Hastings algorithm to simulate 10000 samples β(k),β(k) : k = 1,…,10000,
from the posterior distribution (2).
– The estimates for β0 and β1 are given by the sample means:̄ 110000(k) ̄ 110000(k) β0= β0 , β1= β1 . 10000 k=1 10000 k=1
- The Master records the information from the Slaves in the list L such that the uni- variate logistic regression appear in decreasing order of their marginal log-likelihood as estimated by Monte Carlo integration. The length of L should not be bigger than 5 at any point during the execution of the program.
- The Master outputs the contents of the list L. 5
01
7. The Master tells the Slaves to die.
Very important: please include the output from the Master in your submission. This way we will know whether your program returns the correct output. You will determine the top five logistic regressions together with the estimates of their coefficients.
The Slave process should proceed as follows:
- The Slave should read in the data (just once, at initialization) and should start listening for orders from the Master.
- Once an order is received, the Slave should perform all the required calculations asso- ciated with a univariate logistic regression.
- The Slave should return the results to the Master.
- The Slave should die once the Master indicates there is no further work to be executed.
6