CS代考 MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2022

Point estimation
(Module 2)
Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2022
1 Estimation & sampling distributions 2 Estimators

Copyright By PowCoder代写 加微信 powcoder

3 Method of moments
4 Maximum likelihood estimation
Aims of this module
• Introduce the main elements of statistical inference and estimation, especially the idea of a sampling distribution • Show the simplest type of estimation: that of a single number
• Show some general approaches to estimation, especially the method of maximum likelihood
1 Estimation & sampling distributions
Motivating example
On a particular street, we measure the time interval (in minutes) between each car that passes:
2.55 2.13 3.18 5.94 2.29 2.41 8.72 3.71
We believe these follow an exponential distribution:
Xi ∼ Exp(λ)
What can we say about λ?
Can we approximate it from the data?
Yes! We can do it using a statistic. This is called estimation.
Statistics: the big picture
We want to start learning how to do inference. First, we need a good understanding of the ‘sampling’ part.

Distributions of statistics
Consider sampling from X ∼ Exp(λ = 1/5).
Convenient simplification: set θ = 1/λ. This makes E(X) = θ and var(X) = θ2. Note: There are two common parameterisations,
fX(x) = λe−λx, x ∈ [0,∞) fX(x)=1θe−θ1x, x∈[0,∞)
λ is called the rate parameter (relates to a Poisson process) Be clear about which is being used!
Take a large number of samples, each of size n = 100:
1. 1.84 1.19 11.73 5.64 17.98 0.26 …
2. 2.67 7.15 5.99 1.03 0.65 3.18 …
3. 16.99 2.15 2.60 5.40 3.64 2.01 …
4. 2.21 1.54 4.27 5.29 3.65 0.83 …
5. 12.24 1.59 2.56 1.38 5.72 0.69 …
Then calculate some statistics (x ̄, x(1), x(n), etc.) for each one:
Min. Median Mean
1. 0.02 4.10 5.17
2. 0.16 4.48 5.84
3. 0.17 3.39 4.38
4. 0.03 3.73 5.43
5. 0.01 3.12 4.71
As we continue this process, we get some information on the distributions of these statistics.
Sampling distribution (definition)
Recall that any statistic T = φ(X1 , . . . , Xn ) is a random variable.
The sampling distribution of a statistic is its probability distribution, given an assumed population distribution and a
sampling scheme (e.g. random sampling).
Sometimes we can determine it exactly, but often we might resort to simulation. In the current example, we know that:
X(1) ∼ Exp(100λ) 􏰃Xi ∼Gamma(100,λ)
How to estimate?
Suppose we want to estimate θ from the data. What should we do? Reminder:
• Population mean, E(X) = θ = 5
• Population variance, var(X) = θ2 = 52
• Population standard deviation, sd(X) = θ = 5

Can we use the sample mean, X ̄, as an estimate of θ? Yes!
Can we use the sample standard deviation, S, as an estimate of θ? Yes!
Will these statistics be good estimates? Which one is better? Let’s see. . .
We need to know properties of their sampling distributions, such as their mean and variance.
Note: we are referring to the distribution of the statistic, T, rather than the population distribution from which we draw samples, X.
For example, it is natural to expect that:
• E(X ̄) ≈ μ (sample mean ≈ population mean)
• E(S2) ≈ σ2 (sample variance ≈ population variance)
Let’s see for our example:
10 20 30 40 50 60
Left: distribution of X ̄. Right: distribution of S2. Vertical dashed lines: true values, E(X) = 5 and var(X) = 52.
• Should we use X ̄ or S to estimate θ? Which one is the better estimator?
• We would like the sample distribution of the estimator to be as close as possible to the true value θ = 5.
• In practice, for any given dataset, we don’t know which estimate is the closest, since we don’t know the true value.
• We should use the one that is more likely to be the closest.
• Simulation: consider 250 samples of size n = 100 and compute:
> summary(x.bar)
Min. 1st Qu. Median
3.789 4.663 4.972
> sd(x.bar)
[1] 0.4888185
> summary(s)
Min. 1st Qu. Median
3.502 4.473 4.916
[1] 0.7046119
x ̄1, . . . , x ̄250, s1,…,s250
Mean 3rd Qu. Max.
5.015 5.365 6.424
Mean 3rd Qu. Max.
5.002 5.512 7.456
From our simulation, sd(X ̄) ≈ 0.49 and sd(S) ≈ 0.70. So, in this case it looks like X ̄ is superior to S. 2 Estimators
Definitions
• A parameter is a quantity that describes the population distribution, e.g. μ and σ2 for N(μ,σ2) 3
0.0 0.2 0.4 0.6 0.8
f(s2) 0.00 0.02 0.04

• The parameter space is the set of all possible values that a parameter might take, e.g. −∞ < μ < ∞ and 0 ≤ σ < ∞. • An estimator (or point estimator) is a statistic that is used to estimate a parameter. It refers specifically to the random variable version of the statistic, e.g. T = u(X1 , . . . , Xn ). • An estimate (or point estimate) is the observed value of the estimator for a given dataset. In other words, it is a realisation of the estimator, e.g. t = u(x1, . . . , xn), where x1, . . . , xn is the observed sample (data). • ‘Hat’ notation: If T is an estimator for θ, then we usually refer to it by θˆ for convenience. Examples We will now go through a few important examples: • Sample mean • Sample variance • Sample proportion In each case, we assume a sample of iid rvs, X1, . . . , Xn, with mean μ and variance σ2. Sample mean Properties: • E ( X ̄ ) = μ • var(X ̄) = σ2 n Also, the Central Limit Theorem implies that usually: X≈N μ,n Often used to estimate the population mean, μˆ = X ̄. Sample variance Properties: • E(S2)=σ2 S2= 1 􏰃􏰆Xi−X ̄􏰇2 X = n (X1 + X2 + . . . Xn) = n • var(S2) = (a messy formula) Often used to estimate the population variance, σˆ2 = S2. Sample proportion For a discrete random variable, we might be interested in how often a particular value appears. Counting this gives the sample frequency: freq(a) = 􏰃 I(Xi = a) i=1 Let the population proportion be p = Pr(X = a). Then we have: freq(a) ∼ Bi(n, p) Divide by the sample size to get the sample proportion. This is often used as an estimator for the population proportion: freq(a) 1􏰃n pˆ = n = n For large n, we can approximate this with a normal distribution: 􏰌 p(1−p)􏰍 pˆ≈Np, n • The sample pmf and the sample proportion are the same, both of them estimate the probability of a given event or set of events. • The pmf is usually used when the interest is in many different events/values, and is written as a function, e.g. pˆ(a). • The proportion is usually used when only a single event is of interest (getting heads for a coin flip, a certain candidate winning an election, etc.). Examples for a normal distribution If the sample is drawn from a normal distribution, Xi ∼ N(μ, σ2), we can derive exact distributions for these statistics. I ( X i = a ) Sample mean: Sample variance: ̄ 􏰌σ2􏰍 X∼N μ,n S 2 ∼ n − 1 χ 2n − 1 E(S2) = σ2, var(S2) = 2σ4 n−1 χ2k is the chi-squared distribution with k degrees of freedom. Bias Consider an estimator θˆ of θ. • If E(θˆ) = θ, the estimator is said to be unbiased • The bias of the estimator is, E(θˆ) − θ • The sample variance is unbiased for the population variance, E(S2) = σ2. • What if we divide by n instead of n − 1 in the denominator? Transformations and biasedness E(n−1S2)= n−1σ2 <σ2 nn In general, if θˆ is unbiased for θ, then it will usually be the case that g(θˆ) is biased for g(θ). Unbiasedness is not preserved under transformations. Challenge problem Is the sample standard deviation, S = S2, biased for the population standard deviation, σ? 5 (more details in Module 3) (problem 5 in week 3 tutorial) Choosing between estimators • Evaluate and compare the sampling distributions of the estimators. • Generally, prefer estimators that have smaller bias and smaller variance (and it can vary depending on the aim of your problem). • Sometimes, we only know asymptotic properties of estimators (will see examples later). Note: this approach to estimation is referred to as frequentist or classical inference. The same is true for most of the techniques we will cover. We will also learn about an alternative approach, called Bayesian inference, later in the semester. Challenge problem (uniform distribution) Take a random sample of size n from the uniform distribution with pdf: 􏰌1 1􏰍 f(x) = 1 θ − 2 < x < θ + 2 Can you think of some estimators for θ? What is their bias and variance? Challenge problem (boundary problem) Take a random sample of size n from the shifted exponential distribution, with pdf: f(x) = e−(x−θ) (x 􏰁 θ) Equivalently: Can you think of some estimators for θ? What is their bias and variance? Coming up with (good) estimators? How can we do this for any given problem? We will cover two general methods: • Method of moments • Maximum likelihood 3 Method of moments Method of moments (MM) – Make the population distribution resemble the empirical (data) distribution. . . – . . . by equating theoretical moments with sample moments – Do this until you have enough equations, and then solve them • Example: if E(X) = θ, then the method of moments estimator of θ is X ̄. • General procedure (for r parameters): 1. X1,...,Xn i.i.d. f(x | θ1,...,θr). 2. kth moment is μk = E(Xk) 3. kth sample moment is Mk = n1 􏰂 Xik 4. Set μk = Mk, for k = 1,...r and solve for (θ1,...,θr). • Alternative: Can use the variance instead of the second moment (sometimes more convenient). 6 Xi ∼ θ + Exp(1) • An intuitive approach to estimation • Can work in situations where other approaches are too difficult • Usually biased • Usually not optimal (but may suffice) • Note: some authors use a ‘bar’ (θ) or a ‘tilde’ (θ) to denote MM estimators rather than a ‘hat’ (θ). This helps to distinguish different estimators when comparing them to each other. Example: Geometric distribution • Sampling from: X ∼ Geom(p) • The first moment: • The MM estimator is obtained by solving xp(1 − p)x−1 = p X ̄ = p1 p􏰟 = 1 ̄ X which gives Example: Normal distribution • Sampling from: X ∼ N(μ,σ2) • Population moments: E(X) = μ and E(X2) = σ2 + μ2 • Sample moments: M1 = X ̄ and M2 = n1 􏰂 Xi2 • Equating them: Solving these gives: X ̄=μ and n1􏰃Xi2=σ2+μ2 1 􏰃n ̄2 ̄2 μ􏰟=Xandσ􏰟=n (Xi−X) • This not the usual sample variance! 2 n−1 2 • σ􏰟 = n S 2 n−12 2 • Thisoneisbiased,E(σ􏰟 )= n σ ̸=σ . Example: Gamma distribution • Sampling from: X ∼ Gamma(α, θ) • The pdf is: 1 α−1 f(x|α,θ)= Γ(α)θαx 􏰌−x􏰍 exp θ • Population moments: E(X) = αθ and var(X) = αθ2 •Samplemoments:M=X ̄andS2= 1 􏰂(Xi−X ̄)2 n−1 Equating them: Solving these gives: X ̄=αθ and S2=αθ2 S 2 X ̄ 2 θ􏰟=X ̄ and α􏰟=S2 • This is an example of using S2 instead of M2 4 Maximum likelihood estimation Method of maximum likelihood (ML) • Idea: find the ‘most likely’ explanation for the data • More concretely: find parameter values that maximise the probability of the data Example: Bernoulli distribution • Sampling from: X ∼ Be(p) • Data are 0’s and 1’s • Then pmf is f(x|p)=px(1−p)1−x, x=0,1, 0≤p≤1 • Observe values x1,...,xn of X1,...,Xn (iid) • The probability of the data (the random sample) is Pr(X1 =x1,...,Xn =xn |p)=􏰜f(xi |p)=􏰜pxi(1−p)1−xi = p􏰂 xi (1 − p)n−􏰂 xi • Regard the sample x1 , . . . , xn as known (since we have observed it) and regard the probability of the data as a function of p. • When written this way, this is called the likelihood of p: L(p) = L(p | x1,...,xn) = Pr(X1 = x1,...,Xn = xn | p) = p􏰂 xi (1 − p)n−􏰂 xi • Want to find the value of p that maximizes this likelihood. • It often helps to find the value of θ that maximizes the log of the likelihood rather than the likelihood • This is called the log-likelihood lnL(p) = lnp􏰂xi +ln(1−p)n−􏰂xi • The final answer (the maximising value of p) is the same, since the log of non-negative numbers is a one-to-one function whose inverse is the exponential, so any value θ that maximises the log-likelihood also maximises the likelihood. • Putting x = 􏰂ni=1 xi so that x is the number of 1’s in the sample, ln L(p) = x ln p + (n − x) ln(1 − p) • Find the maximum of this log-likelihood with respect to p by differentiating and equating to zero, ∂lnL(p)=x1+(n−x)−1 =0 • This gives p = x/n • Therefore, the maximum likelihood estimator is pˆ = X/n = X ̄ x = 50 x = 40 x = 80 0.00 0.25 0.50 0.75 1.00 Figure 1: Log-likelihoods for Bernoulli trials with parameter p Maximum likelihood: general procedure • Random sample (iid): X1, . . . , Xn • Likelihood function with m parameters θ1, . . . , θm and data x1, . . . , xn is: • If X is discrete, for f use the pmf • If X is continuous, for f use the pdf L(θ1,...,θm) = 􏰜f(xi | θ1,...,θm) • The maximum likelihood estimates (MLEs) or the maximum likelihood estimators (MLEs) θˆ1 , . . . , θˆm are values that maximize L(θ1, . . . , θm). • Note: same abbreviation and notation for both the estimators (random variable) and the estimates (realised values). • Often (but not always) useful to take logs and then differentiate and equate derivatives to zero to find MLE’s. • Sometimes this is too hard, but we can maximise numerically. No closed-form expression in this case. Example: Exponential distribution Sampling (iid) from: X ∼ Exp(θ) f(x|θ)= 1θe−x/θ, x>0, 0<θ<∞ 1 L(θ) = θn exp 􏰌 − 􏰂ni=1 xi 􏰍 θ ln L(θ) = −n ln(θ) − θ 9 log likelihood(p) ∂lnL(θ) n 􏰂xi ∂θ =−θ+θ2 =0 This gives: θˆ = X ̄ Example: Exponential distribution (simulated) > x <- rexp(25) # simulate 25 observations from Exp(1) >x
[1] 0.009669867 3.842141708 0.394267770 0.098725403
[5] 0.386704987 0.024086824 0.274132718 0.872771164
[9] 0.950139285 0.022927997 1.538592014 0.837613769
[13] 0.634363088 0.494441270 1.789416017 0.503498224
[17] 0.000482703 1.617899321 0.336797648 0.312564298
[21] 0.702562098 0.265119483 3.825238461 0.238687987
[25] 1.752657238
> mean(x) # maximum likelihood estimate
[1] 0.8690201
Log−likelihood curve
1.5 2.0 2.5
What if we repeat the sampling process several times?
Log−likelihood curves
1.5 2.0 2.5
−60 −50 −40
−30 −20 −10 −60 −50 −40
−30 −20 −10

Example: Geometric distribution
Sampling (iid) from: X ∼ Geom(p)
L(p)=􏰜p(1−p)xi−1 =pn(1−p)􏰂xi−n,
This gives: pˆ = 1/X ̄
Example: Normal distribution
Sampling (iid) from: X ∼ N(θ1, θ2)
∂ ln L(p) n 􏰂ni=1 xi − n
∂p =p− 1−p =0
􏰜n 1 􏰎(xi−θ1)2􏰏 L(θ1,θ2)= √2πθ exp − 2θ2
ln L(θ1, θ2) = − 2 ln(2πθ2) − 2θ Take partial derivatives with respect to θ1 and θ2.
(xi − θ1)2
estimators are therefore:
Note: θ2 is biased.
Stress and cancer: VEGFC
> x <- c(0.97, 0.52, 0.73, 0.96, 1.26) > n <- length(x) > mean(x) # MLE for population mean
1 􏰃n n − 1 (X−X ̄)2= S2
>sd(x)*sqrt((n-1)/n) #MLEforthepop.st.dev. [1] 0.2492709
> qqnorm(x) # Draw a QQ plot
> qqline(x) # Fit line to QQ plot
∂lnL(θ1,θ2) ∂θ
= θ (xi −θ1)
∂lnL(θ,θ) n 1n
1 2 =− + 􏰃(xi−θ1)2
∂θ2 2θ2 2θ2
Set both of these to zero and solve. This gives: θˆ1 = x ̄ and θˆ2 = n−1 􏰂ni=1(xi − x ̄)2. The maximum likelihood
θ􏰞=X ̄, θ􏰞= 12nin

Challenge problem (boundary problem)
Normal Q−Q Plot
Theoretical Quantiles
0.9 1.0 1.1 1.2
Take a random sample of size n from the shifted exponential distribution, with pdf: f(x | θ) = e−(x−θ) (x 􏰁 θ)
Equivalently:
Derive the MLE for θ. Is it biased? Can you create an unbiased estimator from it?
Invariance property
Sample Quantiles
Xi ∼ θ + Exp(1)
Suppose we know θˆ but are actually interestd in φ = g(θ) rather than θ itself. Can we estimate φ? Yes! It is simply φˆ = g(θˆ).
This is known as the invariance property of the MLE. In other words, transformations don’t affect the value of the MLE.
Consequence: MLEs are usually biased since expectations are not invariant under transformations. Is the MLE a good estimator?
Some useful results:
• Asymptotically unbiased
• Asymptotically optimal variance (‘efficient’) • Asymptotically normally distributed
The proofs of these rely on the CLT. More details of the mathematical theory will be covered towards the end of the semester.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com