R语言代写: BTRY/STSCI 4520 Final Exam

#########################################################

#############   BTRY/STSCI 4520  ########################

#############      Final Exam    ########################

############# Due: May 20, 2018  ########################

#########################################################

# Instructions: save this file in the format <NetID>_Final.R.

# Complete each question using code below the question number.

# You need only upload this file to CMS.

# Note, we assume your working directory contains any files

# that accompany this one.

# Further note: 10% will be deducted if your file produces

# an error when run. If your code produces an error and you

# cannot find it, comment out that portion of the code and we

# will give partial credit for it.

# Do not use either the function set.seed() or rm(list=ls())

# in your code.

#### IMPORTANT INSTRUCTIONS FOR THE FINAL

## The final is to be completed as though it were in-class.

## That means

##

## 1. You must complete this work independently, without

## collaboration or external assistance. Violations will

## be treated under the academic code.

##

## 2. We will not provide office hours. We will monitor

## Piazza and provide clarifications where questions are

## unclear.

##

## 3. While we will not, in general, debug your work for you,

## we are happy to try to explain error messages if you can

## isolate the line of code that produces them and, preferably,

## reproduce them with a few-line script.

##

## [Example: R tries to interpret dimensions, but doesn’t

## always get it right.  So the code

##

##   b = matrix(2,1,1)

##   t(1:3)/b

##

##  returns an error, but (1:3)/b does not.  We think it reasonable

##  to explain something like this to you if you get a message

##

##   Error in t(1:3)/b : non-conformable arrays

##

##  along with other instances of “I know where the error is, I just

##  don’t know why R has a problem or how to fix it”]

################################################

# Question 1: Control Variates and Functionals #

################################################

# This question covers some recent ideas in Monte Carlo

# integration first published just two years ago in

#

# https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12185

# Recall that if we are interested in estimating

#

#  mu = E g(X)   where  X ~ f(x)

#

# a Monte Carlo estimate is given by

#

#  gbar =  (1/n) sum  g(Xi)   where  Xi ~ f(x)

#

# Control Variates use a function h(x) where we

# know that

#

#  E h(X) = 0  if  X ~ f(x)

#

# (If we know Eh(X) = m, then use h*(x) = h(x)-m in place of h(x).)

#

# Then if h(x) looks like g(x) we can expect that

#

#  hbar =  1/n sum h(Xi)    

#

# (which “should” be zero) might tell us something about

# how far  gbar is from mu.

# That is, if hbar is too high, it is likely that gbar is also

# too high.  In particular, we can improve gbar by

#

#  gbar2 =  gbar – alpha hbar

#

# Note that since hbar has mean zero, the expectation of gbar2 is

# still mu.  We found in class that the optimal alpha was

#

#  alpha = cov(g(X),h(X))/var(h(X))

#

# which we can calculate from the g(Xi) and h(Xi) values.

# a) Write a function to calculate Vanilla and Control Variate

# estimates of gbar from a sample X. It should take arguments

# X (a sample of data)  and functions g and h, and should

# return a 2-vector giving the Vanilla and Control Variate estimates.

MCfunc1 = function(X,g,h){

}

# b) i) Use this to write a function to conduct a simulation

# study in which you set g to be  sin(x+0.1) and h to be

# -exp(x) + exp(1/2)  and generate X from a standard normal

# (the expectation of exp(X) is exp(1/2 when X is N(0,1)).

# You should have length(X) be N = 100 and conduct R = 50

# replications and return the percentage variance reduction

# due to control variates

MCsim1 = function(N,R){

}

# ii) What variance reduction do you achieve?

# c) We can combine both control variates and antithetic

# sampling by looking at

#

#  (1/n) sum (g(Xi) + g(-Xi))/2 + alpha (h(Xi) + h(-Xi))/2

#

# (note that if f(x) is symmetric about zero, X and -X

# have the same distribution).

# i) Write a function to conduct a simulation as above

# but with antithetic sampling, too.  You should return

# the Vanilla Monte Carlo variance and the variance

# reduction due to each of antithetic sampling, control

# variates and both.

#

# It makes a big difference that you do the antithetic

# average before the control variate calculation.

#

# In this case use the same number N of random samples

# for antithetic sampling rather than fixing the number

# of function evaluations.

MCsim2 = function(N,R){

  return( list( Vanilla.Var = ,

                Anti.Reduction = ,

                Control.Reduction = ,

                Both.Reduction =  ) )

}

# ii) Which adds most reduction? Does the joint reduction

# exceed what you would expect after observing

# the reduction for each individual strategy?

# Bonus:  The original question had you set g(x) = sin(x)

# and h(x) = x  (we’ll use this with Control Functionals below).

# Why is this an uninformative choice of functions for the

# simulation you just conducted?

# d) The better the agreement between h and g, the more

# effective the control variates strategy will be.  Eg,

# compare the h we used in parts b and c with using

# h(x) = x.

# Oates et. al. 2016 propose to approximate g non-parametrically.

#

# Specifically, we’ll let

#

#  h(x) = sum_j d_j psi_j(x)

#

# To do this, we need to know the integral of h(x). To accomplish

# this, the authors propose using a modified form of a kernel.

# Specifically, they use the first X_1,…,X_(N/2) samples and

# define

#

#   psi_j(x) =  phi'(x-X_j;s) + phi(x-X_j;s) f'(x)/f(x)

#

# where phi(x;s) is a normal density with standard deviation s and ‘

# represents a derivative with respect to x.

#

# The reason for this choice is that

#

#  int psi_j(x)f(x)dx = int [phi'(x-X_j;s) f(x) + phi(x-X_j;s) f'(x)]dx

#                     = int [phi(x-X_j;s) f(x)]’ dx

#                     = phi(Inf,s)f(Inf) – phi(-Inf,s)f(-Inf)

#                     = 0

#

# With this choice of psi, we get the d_j by simple least-squares regression

# of  g(Xi) on  psi_j(Xi) for  i = 1,…,N/2.

#

# To carry this out you should

#    – evaluate psi_j(X_i) for for the X_i in your data

#    – estimate d_j by using the matrix of phi_j(X_i) to predict g(X_i)

#      using lm.  Some coefficients will be returned as NA — set them to 0.

#    – then use these d_j to give you sum_j d_j phi_j(x) for any x.

# i) Write a function to evaluate this h(x) for a vector of values x. It

# should have inputs

#

# x — the vector of values at which to evaluate h(x)

# X — a vector of draws from f(x)

# s — the bandwidth

# g — the function we want to approximate

#

# In this case you can assume that f(x) is a standard normal. It should

# return a vector of values of length(x).  You may use

hfunc = function(x,X,s,g){

  

}

# ii) Using g(x) = sin(x), and s = 0.2, set

xvec = seq(-2,2,by = 0.1)

# iii) plot h(x) based on 100 standard normals.

# e) i) Write a function to implement

#

#  1.  Vanilla Monte Carlo

#  2.  Control Variates, using h(x) = x

#  3.  Control Functionals

#        – Use the first half of the values X_1,…,X_N

#          to produce hfunc as in 1d

#        – Use the second half to obtain the control variate

#          estimate with this hfunc

#

# These should be based on X being N(0,1)

#

# Your function should take arguments N, g and s and return

# the three estimates above as a 3-vector

ControlFunctionals = function(N,g,s){

  

}

# ii) When g(x) = sin(x) we know that E g(X) = 0. Based on

# 50 replicates at N = 100, using s = 0.2, what is the

# expected squared error of each of your estimates?

# f) One of the claims of the paper is that we can

# achieve a faster convergence rate than (1/sqrt(N))

# because as we get more data, h(x) gets closer to g(x).

#

# i) Write a function to conduct a simulation study so that

# for each value of

Ns = c(100,200,500,1000)

# you estimate the root-average-squared-error (RMSE) of

# each of the three estimates in 1e based on R = 20

# simulations using values of s in

svec = 2/sqrt(Ns)

# that correspond to each value in Ns.  (Here we have

# smaller s so that we can approximate g better when

# we have more data points).

# You should return the matrix of mean squared errors.

# CAUTION: this may take a few minutes to run.

ControlFuncSim = function(Ns,R,s){

}

# ii) Plot log(RMSE) versus log(N) for the three estimates.

# This should give you approximately a straight line with

# the order of convergence given by the slope.

# iii) Confirm that vanilla and control variates approaches have

# MSE = O(1/Ns); what is the order for Control Functionals?

####################################################

# Question 2: Kernel Density Estimation and        #

#             Minimum Hellinger Distance estimates #

####################################################

# In this question we will use Kernel Density Estimation

# as an intermediate step in statistical procedures.

#

# But we will first need to deal with KDE’s. Recall that

# we can estimate a nonparametric density by

#

#  fhat(x,s) =  (1/n) sum phi(x-Xi;s)

#

# (using the same notation as in Question 1).

# a) i) Write a function to evaluate fhat for a vector

# of evaluation points x, using data X with bandwidth s.

# You should not need to use either for loops or apply

# statements.

kde = function(x,X,s){

   

}

# The data we will use for this question comes from a study of

# the effectiveness of a drug pyrantel used to treat parasites

# in horses.  These data record the logit of the ratio of

# the number of parasite eggs found in horse feces before

# and after treatment.

eggrate = c(-1.1653, -0.7538, -1.3218, -2.3394, -1.9766, -1.8718, -1.5041)

# ii) Using s = 0.2 produce a plot of the kernel density estimate

# on values of x from -2.5 to -0.5.

# b) In order to apply this well, we need to choose a

# bandwidth. To do this we’ll use cross-validation.

# Specifically we’ll define a score for each h as

#

# CV(s) =  sum log  fhat^(-i)(Xi;s)

#

# that is. For each Xi, we leave Xi out of the data

# and estimate fhat^(-i) without Xi, then we see how high

# fhat^(-i) thinks the density is at Xi.  The higher

# the density, the better we are at predicting where future

# data might be.  Hnce the optimal s is the one that maximizes

# CV(s).

#

# i) Write a function to evaluate CV(s) for each s. Bonus 2 points

# for avoiding for loops and apply statements.

kde.cv = function(X,s){

}

# ii) Use this function to decide on the optimal s for the eggrate

# data above with possible bandwidths given in

ss = seq(0.1,1,by=0.1)

# to give the value sbest that maximizes the cross-validated

# likelihood.

sbest =

# c) We will also need to be able to simulate data from fhat.

# To do this, we observe that we can express fhat as being exactly

# the density that corresponds to the following

#

#   1. Choose one of the Xi at random

#   2. Simulate a new z from a N(Xi,s) distribution.

#

# i) Write a function to simulate N observations from fhat given

# X and s. 2 bonus points for avoiding for loops

kde.sim = function(N,X,s){

}

# ii) Use this to simulate 1000 data points from your estimate

# above (with optimal bandwidth). Draw a histogram of these points

# and add the original data as vertical red lines.

# d) One of the advantages of having a density estimate is

# that there are more ways that you can compare your data to

# a parametric family of models.

#

# In particular, one measure of how different two distributions

# are is Hellinger distance

#

#    HD(f,g) = int ( sqrt(f(x)) – sqrt(g(x)) )^2 dx

#            = 2 – 2 int sqrt(f(x)*g(x)) dx

#

# For a family of densities f(x,theta) and an estimate

# fhat(x), the minimum Hellinger Distance estimate (MHDE)

# is obtained by maximizing

#

#    A(theta) =  \int sqrt( f(x,theta)*fhat(x) ) dx

#

# In our case, we will use a normal family for the mean

#

#  f(x,theta) = dnorm(x,mean=theta)

#

# and our KDE.

#

# We still need to approximate the integral. Here we will

# sample from fhat and use the approximation

#

#  A(theta,fhat) = int sqrt( f(x,theta)/fhat(x) ) fhat(x) dx

#           ~=  (1/N) sum  sqrt( f(Xi,theta)/fhat(Xi) )

#

# where the Xi are sampled from fhat as in part c.

# i) Write a function to calculate A(theta), for f(x,theta) given

# by dnorm(x,mean=theta) using the original data X and a sample

# Xsamp generated from fhat with bandwidth s.

Afn = function(theta,X,Xsamp,s){

}

# ii) And a further function to find the optimum value based on

# Data X, bandwidth s and sample size N.

#

# You may use any optimization function you like (Giles

# used ‘optimize’) and can assume the maximum lies in

# (-10,10).

HD.opt = function(X,s,N){

}

# iii) Use this to obtain the value that minimizes Hellinger

# distance for the data using the optimal bandwidth

# above based on 1000 Monte Carlo samples.

hellinger.opt =

# e) i) One reason to look for the MHDE is robustness

# to outliers. To see this add an additional data point

# to your data set with values in

O = seq(-1,10,by=0.2)

# plot the value of the MHDE and the average

# value in your data versus the values in O above. Keep

# the optimal bandwidth you calculated in part c.

# ii) Which of these (average or MHDE) do you think is

# more reasonable?

# f) One of the reasons that Hellinger distance has

# received some attention is that it is supposed to be

# as precise as the mean, at least asymptotically.

#

# However, for a given sample, it can be difficult to

# work out how precise it is.  One suggestion was

# to use it within a Bayesian analysis. See

#

#    https://link.springer.com/article/10.1007%2Fs11749-014-0360-z

#

# The idea is to replace the posterior with

#

#    exp( n*A(theta) ) pi(theta)

#

# of pi(theta) is the prior.

#

# i) Using a N(0,10) prior, write a function to conduct a random

# walk MCMC algorithm. The function should take in

#

#    X — your data

#    s — bandwidth for the KDE

#    N — number of Monte Carlo samples to evaluate A

#    nmc — length of the MCMC chain

#    sig — random walk variance

#

# and return a set of posterior samples and the acceptance

# rate. You should keep the same Monte Carlo samples for

# each evaluation.

HD.MCMC = function(X,s,N,nmc,sig){

return( list( samples = ,  acc =  ) )

}

# ii) Experimentally choose sig to give you between 30% and 40%

# acceptance rate and produce a histogram of the posterior

# based on 1000 samples after 1000 samples burn-in and thinning

# to every 5th sample.

# Bonus: in fact, MCMC can be used with a stochastic likelihood.

# That is, it still works if you use new Monte Carlo samples

# each time.  Apply this using N = 100, but draw new samples

# for each step in the MCMC. How high can you get your

# acceptance rate?

HD.MCMC2 = function(X,s,nsamples,nmc,sig){

return( list( samples = ,  acc =  ) )

}

######################################

# Question 3: Random Effects Methods #

######################################

# In the additional exercises after Lecture 18, we

# saw how Poisson random effects models can be fit

# with MCMC. Here we’ll look at extensions of

# this and Frequentist alternatives.

# In this question we will consider a logistic model

# with random effects. That is, we observe outcomes that

# are either 0 or 1 and write out

#

#  P(Y=1|X,Z) = plogis( beta0 + beta1*x + z)

#

# where x is a covariate and z is a subject-specific

# random effect.

#

# We can turn this into an over-all probability by

#

# P(Y|X,z) = P(Y=1|X,Z)^Y*(1-P(Y=1|X,Z))^(1-Y)

#

# which evaluates to P(Y=1|X,Z) when Y = 1 and

# (1-P(Y=1|X,Z)) when Y = 0.

# Here we regard Z as a latent normal random variable that

# applies to several data points. By way of example, the

# data in

toenail = read.table(‘toenail.txt’,head=TRUE)

toenail$Subject = as.factor(toenail$Subject)

# gives an indicator of nail health (good or bad) for

# treatment for foot fungus. This is recorded for 7 visits each

# for 12 Subjects at different times given by number of Months.

#

# Here we will write Y_ij for the i-th measurement from the j-th

# patient and say

#

#  P(Y_ij=1|X_ij,Z_j,beta0,beta1) = plogis( beta0 + beta1*X_ij + Z_j)

#

# where $X_ij$ is the time of the i-th visit of subject j. Z_j

# is a different value for each subject so that some subjects

# have consistently worse feet than others depending on the size

# of Z_j.

#

# We don’t get to see the Z_j, so we will specify that

#

#   Z_j  ~ N(0, \sigma^2)

#

# The likelihood of the observed data in group j is

# then

#

# P(Y_1j,..,Y_7j|X_j) = \int prod( P(Y_ij|X_ij,Z_j,beta0,beta) ) phi(Z_j;0,sigma)dZ_j

#

# and the likelihood for the parameters beta0, beta1, sigma is

#

# l(beta0,beta1,sigma) = sum_j log P(Y_1j,…,Y_7j|X_j)

# a) When only one random effect applies to each notation we can

# evaluate the integral above numerically.  To do this, we will

# use a Gauss-Hermite approximation. This can be obtained through

# the ecoreg package  (you will need to install it).

library(ecoreg)

# The function gauss.hermite produce Gauss-Hermite points

gh = gauss.hermite(21)

# Where we can approximate  \int f(x) phi(x) dx by

#

#   sum(gh[,’Weights’]*f(gh[,’Points’])

#

# Note that this is for a standard N(0,1), to make this N(0,s^2)

# you need to multiply gh[,’Points] by s.

# i) Write a function to evaluate the negative of the log likelihood

# of the toenail data for any beta0, beta1 and sigma given

# the vector theta = c(beta0,beta1,sigma)

logistic.nll = function(theta,data){

}

# ii) Hence find the values of beta0, beta1 and sigma that

# maximize the log likelihood. You may use optim or any other

# optimizer that you wish.

# Good starting values are

theta = c(-0.2,-0.2,1.2)   

# b) As an alternative, we can use Monte Carlo integration for

# each random effect. To do this, we’ll replace the Gauss-Hermite

# points and weights with a sample of 1000 N(0,1) random values

# and weights 1/1000.  To change this to integrating with

# respect to a N(0,s^2) you can make the same changes as for

# Gauss-Hermite quadrature.  Use the same 1000 points for each

# group.

#

# i) Write a function to calculate the negative log likelihood based

# on this approximation. You may hard-code the use of N = 1000

# Monte Carlo points.

logistic.nll.mc = function(theta,data){

  

}

# ii) How would you go about optimizing this likelihood? Do so.

# c) Given a likelihood, we can always carrying out an MCMC

# analysis to find a posterior. In this case we will

# fix sigma to be 1.2 to make the computation simpler.

# Setting your prior on beta0 and beta1 to be

#

# P(beta0,beta1) = N(0,10)*N(0,10)

#

#

# i) Carry out a random walk MCMC for 5000 steps using the

# likelihood you found in part a using a standard deviation

# of

sds = c(1,0.2)

# for the steps in each dimension of your random walk. Use the same

# initial value for beta0 and beta1 as in part a.

# ii) How far is your expected a-posteriori estimator from the maximum

# likelihood estimate in part a?

# iii) Produce 95% credible intervals for beta0 and beta1

# d) Rather than doing the integration numerically, MCMC also

# allows you to simply include the random effects as additional

# parameters. That is, we can write the joint distribution of

# all of the data, the Z’s and and beta0 and beta1 (still keeping

# sigma fixed at 1.2) as

#

#  P(Y_ij|X,Z_j,beta0,beta1)P(Z_j)P(beta0)P(beta1)

#

# And conducting a random walk on beta0,beta1,Z_1,..,Z_K

# i) Write a function to calculate this log posterior using the same

# prior as as in part c.

joint.posterior = function(theta,Z,data){

}

# and carry out a random walk MCMC using step sizes with variance

sds2 = sds/4

# for theta and steps of standard deviation 0.5 for Z.

# Initialize Z to be all zero.

# ii) Have your expected a posteriori estimates for bet0

# and beta1 moved outside the credible interval found

# in part d?

# iii) Plot a histogram of the posterior distribution of Z_1

# based on every 5th draw from the chain after dropping the first

# 1000 draws

# e) i) Can we calculate the conditional distribution of Z_1

# as a frequentist? Here is one last way of generating random

# variables that takes inspiration from importance sampling

# in Monte Carlo integration.

#

# Specifically, suppose we want to sample from g(x), but can

# readily draw from f(x). The idea is to draw X_1,…,X_N but

# to then re-sample these X’s with weights

#

#  W_i = g(X_i)/f(X_i)

#

# (the sample() function allows you to specify a vector ‘prob’

# of weights). This means that the density of the resulting

# X’s is

#

#   P(X selected)*P(X in dx) = W f(x) = g(x)/f(x) * f(x) = g(x)

#

# To get a posterior distribution of Z1, first, use the beta0,

# beta1 and sigma from part a.  Then

#

#   1. Generate 1000 samples of Z from N(0,sigma^2) (you may use rnorm)

#   2. Resample the Z with replacement, but with weights given by

#        prod(  P(Y_i1 | X_i1, Z,beta0,beta1) )

#

# Write a function to obtain this samples using beta0, beta1

# and sigma calculated in part a but using the data in

Subj1 = toenail[1:7,]

ranef.post = function(N,theta,Subj1){

}

# ii) Produce a histogram based on your 1000 resulting samples. How does

# this compare to the posterior distribution in part b?

# iii) This approach to generating random numbers is known as sequential

# Monte Carlo (SMC) and is particularly useful when you want to keep

# updating distributions — you can take a sample from f_1(x) to f_2(x)

# and then update again to f_3(x).  This arises, for example, if you

# are tracking a noisily observed process.

#

# Here, however, we are only generating from one distribution. Is it a

# good or a bad idea? Why?

# f)  Describe how you would go about constructing a bootstrap distribution

# for this model, accounting for the fact that you might have different

# subjects next time.  You do not need to carry this out.

#######################

# BONUS: Optimization #

#######################

# This repeats the final exercise in Lab 7 (which the labs

# did not get to).  Additional 10 marks for an answer within

# 10% of the true values

# The data in FhN.csv come from an example that Giles was

# working on for his most recent book. They arise from

# a model of the way neurons transmit signals and are given

# by an ordinary differential equation.

#

# This is a simplification of a more complex model than won

# Alan Hodgkin and Andrew Huxley a Nobel prize in 1963 (so

# it’s fitting that Giles is writing this lab while in

# Stockholm).

#

# The data is in

FHN = read.csv(‘fhn.csv’,head=FALSE)

# It contains a set of times

t = FHN[,1]

# and two-columns of values

V = FHN[,2:3]

# so we can look at

matplot(t,V,type=‘l’)

# These values were produced using the following function,

# which relies on the deSolve package to solve ODE equations

library(deSolve)

FHN.fn = function(t,p){

x0 = p[4:5]

fhnfunode = function(t,x,p)

{

r = x;

r[1] = p[3]*(x[1] – x[1]^3/3 + x[2])

r[2] = -(x[1] -p[1] + p[2]*x[2])/p[3]

return(list(dx=r))

}

res = lsoda(x0,t,fhnfunode,p)

return( res[,2:3] )

}

# This depends on a vector p of 5 parameters,

# for instance, if you set

p = c(1,1,3,0.5,-1.2)

# you get

matplot(t, FHN.fn(t,p),type=‘l’)

# The game here is to find the values of p that generated

# the data. There is no noise in the data, so the right

# p should produce V exactly.

# To do this, use the optim function in R, and experiment

# with different optimization solvers and/or starting points.

# You will need to define an objective function — I recommend

# squared error and trying to find where it is zero.

# Two things might help:

#

#  1. I will tell you the true parameters are in the ranges

#      [-1,1],  [-1,1],   [0.5,5],  [-2,2],  [-2, 2]

#

#  2. You will find that at bad values of p, FHN.fn will

#  produce an error. This will cause your optimization to

#  to stop. However you can get around this with the

#  try-catch construction.

#

#   Here’s an example

for(i in 1:3){

  try({

      m = b/8

  })

  print(i)

}

# Although there is an error (we have never defined b) the for

# loop does not terminate.  You can set a number of lines of

# code inside the {} in try.   

#

# You might use this to try evaluating at some p, for example,

# and return Inf, or a really big number if that p produces

# an error.

#

# You may search for good starting values, but describe how you

# decided to start from there, and include code to get from them

# to your final estimate.