CS代写 MAST20005/MAST90058"

title: “Week 3 Lab, MAST20005/MAST90058”
author: “School of Mathematics and Statistics, University of Melbourne”
date: “Semester 2, 2022”
institute: “University of Melbourne”

fontsize: 11pt
classoption: a4paper
html_document:
toc_float:
collapsed: false

**Goals:** (i) Explore data numerically and graphically; (ii) Getting familiar
with probability, density calculations and random number generation; (iii)
Parameter estimation and basic numerical optimization in R.

**Data:** In this lab we will analyse the Tasmanian Rainfall data
(`tasmania.csv`), obtained from the Australian Bureau of Meteorology.
Observations represent the maximum daily rainfall from 1995 to 2014 recorded by
54 weather stations in Tasmania.

“`{r echo=FALSE}
set.seed(1125)

# Exploratory data analysis

Load the dataset in R by using the command below. **You will need to copy the
file to your current working directory, or otherwise set your working directory
to the folder with the file (see the previous week’s lab notes).**
tasmania <- read.csv("tasmania.csv") # load data dim(tasmania) # check data dimension names(tasmania[, 1:5]) # names of the first 5 columns year <- tasmania[, 1] # create a vector for year s1 <- tasmania[, 2] # create a vector for station 1 (Burnie) s2 <- tasmania[, 3] # create a vector for station 2 (Cape Grim) Explore the distribution of extreme rainfall in Burnie and Cape Grim by computing some basic summary statistics. summary(s1) # 5-number summary plus sample mean sd(s1) # sample standard deviation IQR(s1) # interquartile range quantile(s1, type = 6) # sample percentiles using Type 6 approximations quantile(s1, type = 7) # sample percentiles using Type 7 approximations summary(s2) sd(s2, na.rm = TRUE) IQR(s2, na.rm = TRUE) The sample mean and median suggest that the location for the distribution `s1` is larger than that of `s2`. One observation is missing for `s2` (corresponding to year 2014). Note that we need to use the option `na.rm = TRUE` to remove missing values (shown as `NA` in R). Measures of spread are obtained by computing the sample interquartile range (IQR) and sample standard deviation. The variability in `s2` is slightly smaller than that of `s1`. Next, explore the data by graphic summaries. The following gives an annotated density histogram for `s1` and a smooth density estimate: ```{r fig.height=4.5, fig.width=6} hist(s1, freq = FALSE, xlab = "Extreme rainfall (Burnie, Tas)", col = 8) smooth.density = density(s1) # fits a smooth curve lines(smooth.density, lty = 2, lwd = 2, col = 2) # draws the curve By adding the argument `nclass` in `hist` you can control the number of bins for the histogram (e.g.\ try to add the argument `nclass = 15` or `breaks = 15`). Note that this is treated as a suggestion only; R will try to choose a similar number of bins while ensuring that the breakpoints are at 'pretty' numbers. From the above plot it is clear that the sample distribution for `s1` is not quite symmetric. Compare the main features of two or more distributions we can draw multiple boxplots on the same plotting window. ```{r fig.height=4, fig.width=6} boxplot(s1, s2, horizontal = TRUE, names = c(" ", "Cape Grim"), col = c("yellow", "orange")) The following computes and plots empirical cdfs: ```{r fig.height=4, fig.width=6} ecdf1 <- ecdf(s1) ecdf2 <- ecdf(s2) plot(ecdf1) plot(ecdf2, col = 2, add = TRUE) The QQ plot compares the empirical distribution of the data against some theoretical distribution. The following compares the sample distribution with the theoretical normal distribution: ```{r fig.height=4, fig.width=6} qqnorm(s1, main = "Normal QQ plot for S1") qqline(s1) Although the central part of the data distribution is compatible with the normality assumption, note that the right tail deviates from the straight line. Probability theory suggests that extremes such as rainfall maxima follow the so-called Extreme Value distribution with inverse cdf $\mu + \sigma F^{-1}(p)$, where $F^{-1}(p) = -\log(-\log(p))$ is the inverse cdf of the standard EV distribution. ```{r fig.height=4, fig.width=6} Finv <- function(p) {-log(-log(p))} # quantile function p <- (1:20) / 21 y <- sort(s1) # order statistics x <- Finv(p) # theoretical quantiles plot(x, y, ylab = "Sample quantiles", xlab = "EV quantiles") fit <- lm(y ~ x) # this computes the "line of best fit" # (more details later in the semester) abline(fit) # this plots the "line of best fit" From the last QQ plot the EV model seems to be more appropriate than the normal model, since the points in EV QQ plot are a little closer to the straight line compared to the previous normal QQ plot. As an exercise, repeat the same analysis for other weather stations. # Distributions and random numbers We look at some of the basic operations associated with probability distributions. There are a large number of probability distributions available, but we only look at a few. If you would like to know what distributions are available you can do a search using the command: ```{r eval=FALSE} help.search("distribution") Here we give details about the commands associated with the normal distribution and briefly mention the commands for other distributions. The functions for different distributions are very similar where the differences are noted below. To get a full list of the distributions available in R you can use the following command: ```{r eval=FALSE} help(Distributions) For every distribution there are four commands. The commands for each distribution are determined by a prefix indicating the functionality: `d-` returns the value of the probability density function (or probability mass function for discrete data), `p-` returns the cumulative density function, `q-` returns the inverse cumulative density function (quantiles), and `r-` returns randomly generated numbers. For the commands below, we use the normal distribution, but similar calculation can be applied to other distributions. Given a set of values `dnorm` returns the value of the pdf at each point. If you only give it the points, it assumes you want to use a mean of zero and standard deviation of one. There are options to use different values for the mean and standard deviation. Try the following: ```{r fig.keep="none"} dnorm(0) * sqrt(2 * pi) dnorm(0, mean = 4, sd = 2) dnorm(c(-1, 0, 1)) x <- seq(-20, 20, by = 0.1) y <- dnorm(x) plot(x, y) y <- dnorm(x, mean = 2.5, sd = 0.5) plot(x, y) The second function we examine is `pnorm()`. Given a number or a list it computes the probability that a normally distributed random number will be less than that number (i.e.\ it returns the normal cdf). It accepts the same options as `dnorm()`: ```{r fig.keep="none"} pnorm(0) # lower tail probability (cdf) pnorm(1, lower.tail = FALSE) # upper tail probability pnorm(0, mean = 2, sd = 3) x <- seq(-20, 20, by = 0.1) y <- pnorm(x) plot(x, y) y <- pnorm(x, mean = 3, sd = 4) plot(x, y) The next function we look at is `qnorm()` which is the inverse of `pnorm()`. The idea behind `qnorm()` is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. For example, try: ```{r fig.keep="none"} qnorm(c(0.25, 0.5, 0.75), mean = 1, sd = 2) # quartiles for N(1,2) x <- seq(0, 1, by = 0.05) y <- qnorm(x) plot(x, y) y <- qnorm(x, mean = 3, sd = 2) plot(x, y) y <- qnorm(x, mean = 3, sd = 0.1) plot(x, y) The last function we examine is the `rnorm()` function which can generate random numbers whose distribution is normal. The argument that you give it is the number of random numbers that you want, and it has optional arguments to specify the mean and standard deviation: ```{r fig.keep="none"} rnorm(4, mean = 3, sd = 3) rnorm(4, mean = 3, sd = 3) y <- rnorm(200) y <- rnorm(200, mean = -2) # Maximum likelihood estimation In this task, we will look at maximum likelihood estimation. In simple cases the MLE is available in closed form. For example, if observations were iid from $\mathrm{N}(\mu, \sigma^2)$, then we already know that the MLE for $\mu$ and $\sigma^2$ are $\hat{\mu} = \bar{X}$ and $\hat{\sigma}^2 = (n-1)S^2/n$. Thus, the ML estimates for the extreme rainfall data can be quickly computed as: mu.hat = mean(s1) n = length(s1) sigma.hat = sqrt((n - 1) * var(s1) / n) Probability theory says that a better model for maxima is the extreme value (EV) distribution, or Gumbel distribution, with pdf \[ f(x \mid \mu, \sigma) = \frac{1}{\sigma} \exp\left\{-\frac{x - \mu}{\sigma}\right\} \exp\left[-\exp\left\{-\frac{x - \mu}{\sigma} \right], \quad x \in \mathbb{R}, \, \mu \in \mathbb{R}, \, \sigma > 0. \]
The log-likelihood function for this model is,
\[ \ell(\mu, \sigma) = – n \log\sigma
– \sum_i\frac{x_i – \mu}{\sigma}
– \sum_i\exp\left\{-\frac{x_i – \mu}{\sigma}
\right\}. \]
Unfortunately, as it is often the case in practice, there is no closed form
solution for the MLE and maximisation has to be carried out numerically.

There are number of routines available to maximize the log-likelihood function.
One simple approach is given by the function `fitdistr()` in the
`MASS` package, which can be used to fit a number of models (see the
help page for `fitdistr()` to see the available distributions). Since
the Gumbel distribution is not in the default list of distributions, we must
write its pdf before using `fitdistr()`.

library(MASS)

# Prepares the Gumbel pdf:
dgumbel <- function(x, mu, sigma) exp((mu - x) / sigma - exp((mu - x) / sigma)) / sigma # Fits the Gumbel distribution gumbel.fit <- fitdistr(x = s1, densfun = dgumbel, start = list(mu = 50, sigma = 10)) gumbel.fit Note that `fitdistr()` requires 3 main arguments: data, a pdf (or pmf) and a starting point to use for the optimisation algorithm. Choosing suitable starting parameter values is crucial to ensure a good solution. If preliminary estimates are available you should consider them as starting points. For the Normal, log-Normal, geometric, exponential and Poisson distributions, `fitdistr()` uses closed-form MLEs. For example: normal.fit <- fitdistr(x = s1, densfun = "normal") normal.fit For all other distributions, `fitdistr()` maximises the likelihood function numerically using the optimisation routine `optim()`. In practical applications, we frequently write the negative log-likelihood from scratch and directly use `optim()` or some other optimisation routine, depending on the likelihood at hand. Most numerical optimisation routines assume that you want to minimise a function. Note that maximising the log-likelihood $\ell$ is equivalent to minimising the negative log-likelihood function $-\ell$. neg.llk <- function(theta) { # negative log-likelihood mu <- theta[1] sigma <- theta[2] out <- -sum(log(dgumbel(s1, mu, sigma))) return(out) fit <- optim(c(50, 10), neg.llk) # fits MLEs theta.hat <- fit$par # returns estimates The arguments in `optim()` are the vector of starting values and the function to be minimised. A number of other tuning parameters can be used depending on the optimisation method specified in `optim()` (thus the slightly different result compared to that from `fitdistr()` is due to a different convergence tolerance definition). The function returns a number of useful outputs, including final parameter estimates and value of the minimised negative log-likelihood, and information about convergence (to see them type `fit`). A detailed explanation can be found in `Value` in `help(optim)`. Various classic optimisation algorithms are available within this function (see `Details` and references in `help(optim)`). When the parameters are known to be in some interval, optimisation can be carried out using the method `method = "L-BFGS-B"` for `optim()`, or the function `nlminb()`, which allow us to specify upper and lower bound for each parameter. Next, carry out a visual check on the fitted models. The following plots the fitted normal and Gumbel models over the histogram of the data: ```{r fig.height=4, fig.width=6} # Write fitted normal and Gumbel pdfs pdf1 <- function(x) dnorm(x, mean = mu.hat, sd = sigma.hat) pdf2 <- function(x) dgumbel(x, mu = theta.hat[1], sigma = theta.hat[2]) # Plot data and fitted models hist(s1, freq = FALSE, col = "gray", main = NULL, xlab = "x", xlim = c(0,100)) curve(pdf1, from = 0, to = 100, col = 2, lty = 2, lwd = 2, add = TRUE) curve(pdf2, from = 0, to = 100, col = 1, lty = 1, lwd = 2, add = TRUE) From the plot, the Gumbel distribution appears to be a better fit for the data compared to the normal distribution. # Method of moments estimation In this task we compute point estimates using the method of moments (MM). Assume that the data are generated by $X \sim \mathrm{Gumbel}(\mu, \sigma)$ with pdf given in the previous section. The first two moments of this distribution are: \begin{aligned} \mathbb{E}(X) &= a \sigma + \mu \\ \mathbb{E}(X^2) &= b \sigma^2 + 2 a \sigma \mu + \mu^2 \end{aligned} where $a = 0.577215$ and $b= 1.978$. Let $\bar{X} = \sum_i X_i /n$ and $M_2 = \sum_i X^2_i /n$ be the first two sample moments. The method of moment estimators for $\mu$ and $\sigma$ are: \begin{aligned} \widetilde{\sigma} &= \sqrt{\frac{M_2 - \bar{X}^2}{b - a^2}} \\ \widetilde{\mu} &= \bar{X} - a \widetilde{\sigma} \end{aligned} Compute estimates for the extreme rainfall data recorded at the Burnie station: x.bar <- mean(s1) x2.bar <- mean(s1^2) a <- 0.577215 b <- 1.978 sigma.tilde <- sqrt((x2.bar - x.bar^2) / (b - a^2)) sigma.tilde mu.tilde <- x.bar - a * sigma.tilde Note that the MM estimates are quite close to the maximum likelihood estimates obtained previously (above), but the MM estimates are much quicker to compute. How could you compare the accuracy of the two estimators? # Exercises 1. Compare the maximum rainfall in Marrawah, Low Head and Dover. Do this both numerically and graphically. # Insert answer here... 2. Create a box plot the compares the maximum rainfall from all 54 weather stations. (Hints: you might need to consult the in-built documentation for the `boxplot()` function; remember that a data frame is also a list.) # Insert answer here... 3. Which two stations have very different rainfall measurements to the others? Can you explain why that might be? (What is special about these two stations?) Plot histograms for both of these stations. Fit the Gumbel distribution to the data for these two stations, using maximum likelihood. # Insert answer here... 4. Use a simulation to explore the bias and variance of the MM estimators for the Gumbel distribution. (Hint: the R package `evd` has a function `rgumbel()` which is useful here.) # Insert answer here... 5. Do question 1b(ii) from the tutorial problems. In addition to calculating the MLE, also plot the likelihood function. # Insert answer here... 6. See question 1c(iii) from the tutorial problems. Plot the log-likelihood function for the sample given in the hint. Now do it for a larger sample (generate your own data). # Insert answer here... 7. See question 3c from the tutorial problems. For the $X$ sample, verify the given statistics and plot the likelihood function. Calculate the MLE using the formula derived in that question. Also calculate the MLE by using `fitdistr()`. # Insert answer here... 8. See question 4 from the tutorial problems. Plot the pdf for $\theta = 3$, using the already available function in R for the pdf of an exponential distribution (do not define your own). Do some simulations to verify that $\bar{X}$ is unbiased for $\theta$ and that its variance is $\theta^2 / n$ (do this for $n = 20$ and $\theta = 3$). For the given sample, calculate the MLE by using `fitdistr()`. # Insert answer here... 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts