程序代写 MA 540 Proof

Chapter 11
Frequentist Techniques for Parameter Estimation
We use the data from Example 3.2 to motivate some of the issues which we address in this chapter.
Example 11.1. Consider the height-weight data from the 1975 World Almanac and Book of Facts [125], which we compile in Table 11.1 and plot in Figure 11.1. Based on the behavior, we consider the quadratic observation model

Copyright By PowCoder代写 加微信 powcoder

Yi =✓0 +✓1(xi/12)+✓2(xi/12)2 +”i, , i=1,…,15, (11.1) where xi is the height in inches and Yi is the corresponding weight. We denote the
vector of parameters by ✓ = [✓0 , ✓1 , ✓2 ]T and assume that the errors “i are unbiased
Height 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 (in)
Weight 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164 (lbs)
Table 11.1. Height-weight data from [125].
160 150 140 130 120
58 60 62 64 66 68 70 72 Height (in)
Figure 11.1. Behavior of height-weight data from [125]. 257
MA 540 Proof
Weight (lbs)

258 Chapter 11. Frequentist Techniques for Parameter Estimation
and identically distributed with variance 2. We note that the model exhibits a nonlinear dependence on the independent variable xi but a linear dependence on the parameters ✓.
In this chapter, we address the following issues.
• Construct unbiased frequentist estimators ✓ˆ and ˆ for the parameters ✓ and error variance 2. Construct a covariance matrix V for the parameters: Sec- tion 11.1.1–11.1.3.
• Determine properties of the distribution for ✓ˆ and construct confidence inter- vals: Sections 11.1.4 and 11.1.5.
• Construct prediction intervals for heights x⇤ not used for inference: Sec- tion 11.2.
• Provide a framework for nonlinear regression with scalar observations: Sec- tion 11.3.
To provide a framework for statistical inference, we employ the additive observation model
Yi =f(⇠i,✓)+”i , i=1,…,n, (11.2)
detailed in Section 6.2.1, which relates measurements Yi to model outputs f(⇠i,✓). Here ⇠i denotes values of independent variables, such as time ti, polarization Pi, space xi or height xi in the previous example. As in previous chapters, ✓ = [✓1,…,✓p] denotes calibration parameters, which can include initial or boundary conditions. In this chapter, we delineate between random variables and realizations. We respectively denote the random and realized measurement errors by “i and ✏i and the resulting observations by Yi and yi.
The model (11.2) can be expressed in vector form as Y =f(✓)+”,
where Y = [Y1,…,Yn],f(✓) = [f(⇠1,✓),…,f(⇠n,✓)] and ” = [“1,…,”n]. Lin- early parameterized models have the form
where X denotes the n ⇥ p design matrix. We illustrate in subsequent examples how X can encapsulate various dependencies on ⇠i.
The mathematical inverse problem associated with parameter estimation can then be formulated as follows: given values of ⇠i and measurements Yi, determine ✓ in a robust manner. The associated statistical inverse problem—sometimes referred to as inverse uncertainty quantification—is to additionally quantify uncertainties associated with ✓ due to the measurement errors. The assumptions required to ap- proximate ✓ and quantify its uncertainty define frequentist and Bayesian techniques for parameter estimation.
MA 540 Proof

11.1. Linear Regression 259
As detailed in Section 4.8.1, a basic tenet of frequentist inference is the assump- tion that parameters are fixed but possibly unknown. We let ✓0 denote the true but unknown parameter values that generated the observations Y = [Y1 , . . . , Yn ]T . The deterministic nature of ✓ dictates that f(⇠i,✓) is a deterministic quantity and necessitates that we construct an estimator ✓ˆ that estimates ✓0 in a statistically reasonable manner.
We detail linear regression in Section 11.1 and nonlinear regression for the observation model (11.2) in Section 11.3. We refer readers to [37] and Section 7.3.2 of [357] for discussion of nonlinear regression with multiple responses.
11.1 Linear Regression
We focus here on models that depend linearly on the parameters. This includes the linear Helmholtz energy detailed in Example 3.1, convolution models for acoustics, and models employed in image processing and X-ray tomography. We refer readers to [157] for details regarding linear regression.
We employ the statistical observation model
Y =X✓0+”, (11.3)
where Y = [Y1,…,Yn]T and ” = [“1,…,”n]T are random vectors and the n ⇥ p design matrix X is deterministic and known. We let ✓0 denote the vector of true but unknown parameters and let y = [y1 , . . . , yn ]T denote realizations or observa- tions from an experiment with observation errors ✏ = [✏1, . . . , ✏n]. Throughout this discussion, we assume that there are more measurements than parameters so that n > p.
Example 11.2. Consider the quadratic model (11.1) which we employed in Exam- ple 11.1 to model height-weight data from [125]. Here n = 15, p = 3 and the n ⇥ p
design matrix is
1 xn /12 (xn /12)2
We note that the weight exhibits a quadratic dependence on the height but depends
linearly on the parameters.
Assumption 11.3. We assume that observation errors are unbiased and iid with
fixed but unknown variance 02; hence for j = 1, . . . , n, (i) E(“i) = 0,
2 1 x1/12 (x1/12)2 3 X = 64 . . . . . . 75 .
(11.4) For initial analysis, we make no further assumption regarding a distribution for “i.
(ii) var(“i) = 02 , cov(“i,”j) = 0 for i 6= j.
The first objective is to construct unbiased estimators ✓ˆ and ˆ2 for the un- known parameters ✓0 and 02.
MA 540 Proof

260 Chapter 11. Frequentist Techniques for Parameter Estimation
11.1.1 Parameter Estimator and Estimate
To construct an estimator ✓ˆ for ✓0, we seek ✓, which minimizes the OLS functional
J (✓) = (Y X✓)T (Y X✓). (11.5)
For scalar parameters, we would minimize (11.5) by setting the derivative with respect to ✓ equal to 0 and solving for ✓. For vector-valued problems, this is achieved using the gradient operation
(11.6) (11.7)
Remark 11.4. Throughout this chapter, we will discuss only OLS estimators and estimates. To simplify notation, we thus drop the subscript OLS and let ✓ˆ = ✓ˆOLS and ✓ = ✓OLS denote the least squares estimator and estimate.
Whereas the normal equations (11.7) provide an analytic minimum for (11.5), they are typically ill-conditioned for moderate to large numbers of parameters. In practice, it is common to instead solve the minimization problem (11.5) to avoid this numerical ill-conditioning. We discuss optimization techniques in Section 11.3.1.
11.1.2 Parameter Estimator Properties
Result 11.5. The parameter estimator ✓ˆ has the mean and covariance matrix
This yields the least squares estimator
r✓J =2[r✓(Y X✓)T][Y X✓]=0, r✓(Y X✓)T =r✓✓TXT =XT.
✓ˆOLS = (XT X)1XT Y . ✓OLS = (XT X)1XT y
The realization
is the least squares estimate for the unknown true parameter ✓0.
(i) E(✓ˆ) = ✓0,
(ii) V (✓ˆ) = 02(XT X)1.
Relation (i) follows directly from (11.6) since
E(✓ˆ) = E[(XT X)1XT Y ] = (XT X)1XT E(Y ) = ✓0.
Hence ✓ provides an unbiased estimate for the true parameter. To establish the covariance relation, we let A = (XT X)1XT and note that
V ( ✓ˆ ) = E [ ( ✓ˆ ✓ 0 ) ( ✓ˆ ✓ 0 ) T ]
=E[(✓0+A”✓0)(✓0+A”✓0)T], since✓ˆ=AY =A(X✓0+”) = AE(“”T )AT
= 02(XT X)1.
MA 540 Proof

11.1. Linear Regression 261
Since 02 is assumed to be fixed but unknown, we must construct an unbiased estimator ˆ2 for 02 before we can employ (11.8) to estimate the covariance matrix.25
11.1.3 Error Variance Estimator
Result 11.6. The unbiased error covariance estimator is
ˆ2= 1 RbTRb, np
Rb =Y X✓ˆ
denotes the residual estimator.
To obtain this result, we first note that the residual can be expressed as
Rb = ( I n H ) Y , where In denotes the n ⇥ n identity matrix and
H ⌘ X(XT X)1XT . One can easily show that H satisfies the properties
HT = H (Symmetric), H2 = H (Idempotent), (In H)2 = In H, (In H)X = 0.
From (11.3) and (11.11), it follows that
Rb = ( I n H ) ” RbTRb =”T(In H)”.
If we denote the ij entry of In H by hij, the quadratic form (11.12) can be expressed as
R R= hij”i”j.
25It is common in the statistics literature to represent estimates of 2 by s2. However, we avoid that notation to avoid confusion with our use of s to denote sensitivities.
MA 540 Proof

It then follows that
hb Xn ERR =
Chapter 11.
Frequentist Techniques for Parameter Estimation
hij E(“i “j )
= hijcov(“i,”j) , follows from (4.15) with E(“j) = E(“i) = 0
= hiivar(“i)
= 02tr(In H)
, “i independent
, “i identically distributed with variance 02.
Since the trace operator satisfies the properties tr(A + B) = tr(A) + tr(B) and tr(AB) = tr(BA), it follows that
tr(In H) = n tr[X(XT X)1XT ] = n tr[(XT X)1XT X]
Thus ˆ2 = 1 Rb T Rb is an unbiased estimator for 2. Furthermore, we can conclude
Example 11.7. We return to the height-weight data introduced in Example 3.2 and discussed 11.1. Based on the data plotted in Figure 11.2(a), we employ the quadratic model
Yi = ✓0 + ✓1(xi/12) + ✓2(xi/12)2 + “i, (11.14)
where xi is the height in inches and Yi is the corresponding weight. Solution of the normal equations (11.7) yields the parameter values ✓ = [261.88,88.18,11.96]T and model fit shown in the figure. The residuals in Figure 11.2(b) demonstrate
np 0 from (11.13) that the eigenvalues of H are 0 or 1.
160 150 140 130 120
Data Model
0.4 0.2 0 -0.2 -0.4 -0.6
58 60 62 64 66 68 70 72 Height (in)
58 60 62 64 66 68 70 72 Height (in)
Figure 11.2. (a) Model fit to the height-weight data in Table 11.1 and (b) residuals for the 15 heights.
MA 540 Proof
Weight (lbs)
Residual (lbs)

11.1. Linear Regression 263
that the assumption of identically distributed observation errors “i is reasonable. Whereas this data set is fairly small, the residuals do not exhibit a clear trend indicating that the assumption “i ⇠ N(0,02) is badly violated. One could further test this assumption using a Q-Q plot as described in Definition 4.19.
We note that the conditioning of the 3 ⇥ 3 matrix XT X is 6.7 ⇥ 107, thus illustrating the ill-conditioning of the normal equations. The variance estimate
provided by (11.9) is 2 = 0.15, which yields the
covariance matrix estimate 21.66 35
8.03 . 0.74
two standard deviations, are thus
24 634.88 V = 235.04 21.66
235.04 87.09 8.03
The estimated parameter values, plus and minus
✓0 = 261.88 ± 50.39
✓1 = 88.18 ± 18.66 ) ✓2 = 11.96 ± 1.72
11.1.4 Distribution for ✓ˆ
✓0 2 [211.48, 312.27]
✓1 2 [106.84, 69.51] (11.15) ✓2 2 [10.24, 13.68].
The estimator ✓ˆ has a distribution, which we will use to construct confidence in- tervals for the estimation process. The assumptions required to specify this distri- bution are more stringent than those in Assumption 11.3 and require either that errors be normally distributed or that samples be suciently large that the central limit theorem can be invoked.
Assumption 11.8. The distribution for ✓ˆ can be directly specified for problems in which errors are iid and “i ⇠ N (0, 02 ).
Property 11.9 (Distribution for ✓ˆ). With Assumption 11.8, ✓ˆ is normally dis- tributed, ✓ˆ ⇠ N(✓0,02(XTX)1). Furthermore, if we let k denote the kth diag- onal element of (XT X)1 and ✓k0 denote the kth element of the true parameter vector✓0,then✓ˆk ⇠N(✓k0,02k).
To verify this property, we note that each component ✓ˆk is the linear com- bination of independent random variables Yk; see (11.6). It then follows that ✓ˆ has a joint multivariate normal distribution. When combined with the fact that E(✓ˆ)=✓0 andcov(✓ˆ)=02(XTX)1,itfollowsthat✓ˆ⇠N(✓0,02(XTX)1).
For many applications, errors may be iid with variance 02 but not normally distributed. For suciently large sample sizes, asymptotic theory yields a result similar to Property 11.9.
Property 11.10 (Asymptotic Distribution for ✓ˆ). Consider the model (11.3) with iid errors having variance 02. For suciently large n, the distribution for ✓ˆ is asymptoticallynormal,whichwedenoteby✓ˆ⇠a N(✓0,02(XTX)1).
MA 540 Proof

264 Chapter 11. Frequentist Techniques for Parameter Estimation
Rather than provide a complete proof of Property 11.10, we instead summarize the approach and refer the reader to [348] for additional details. We first note that substitution of (11.3) into (11.6) yields ✓ˆ ✓0 = (XT X)1XT ” so that
p ˆ 0 ✓1 T ◆1 1 T n(✓✓)= nXX pnX”.
Because the first right-hand side term can be interpreted as an average, the law
of large numbers, Theorem 4.49, is used to establish that n1 XT X !P Y, where
Y is a p ⇥ p positive definite matrix and convergence in probability is defined in
Definition 4.45. Since E(p1 XT “) = 0, it follows that n
var✓p1 XT”◆=E✓1XT””TX◆!P 02Y. nn
The central limit theorem, discussed in Section 4.4, is then invoked to establish that p1 XT ” !D Z, where Z ⇠ N(0,02Y), so that pn(✓ˆ ✓0) ⇠a N(0,02Y1).
Finally, one shows that n1 XT X is a strongly consistent estimator of Y to obtain the asymptotic result.
UQ Crime 11.11. An obvious practical question concerns the size n required to justify using these asymptotic results. This is problem-dependent, and alternative methods, such as bootstrapping or Bayesian analysis, may be required to establish the normality of distributions when sample sizes are small. It is common to see asymptotic results cited for small sample sizes n, which can constitute a UQ crime.
11.1.5 Confidence Intervals
We showed in Section 4.2 that chi-squared and t-distributions are required to con- struct confidence intervals. We use the next two properties to construct confidence intervals for the parameters.
Property 11.12. For ˆ2 given by (11.9), the random variable ⌫ = (np)ˆ2
chi-squared distribution with n p degrees of freedom. To establish this, we note that
( n p ) ˆ 2 02
1 b T b = 02 R R
= 1″T(InH)” 02
= 1D”,U⇤UT”E 02
= 1 DUT”,⇤UT”E. 02
,InH=U⇤UT sincesymmetric
MA 540 Proof

11.1. Linear Regression 265
Since tr(In H) = rank(In H) = n p, one can express ⇤ as ⇤ =  Inp 0 ,
where Inp is the np identity matrix. Moreover, it is proven in [157] that since UT is an orthogonal matrix and “i ⇠ N(0,02), then u = UT ” is a vector of independent N(0,02) random variables. Because
( n p ) ˆ 2 h u , ⇤ u i X u 2i
⌫= 02 = 02 = 02 i=1
is the sum of n p independent squared N (0, 1) random variables, it has a chi- squared distribution with n p degrees of freedom.
Property 11.13. The random variable
T k = ˆ p k
✓ˆ k ✓ k0 has a t-distribution with n p degrees of freedom.
To verify Property 11.13, we note from Property 11.9 that Z = ✓ˆk✓k0 ⇠
N (0, 1). It then follows from Definition 4.13 that ✓ˆ k ✓ k0
Tk = ˆpk
✓ˆ k ✓ k0 0 p
= 0 p k ˆ p n p · n p
=p Z ,Z⇠N(0,1) ,⌫⇠2(np),
has a t-distribution with n p degrees of freedom.
To construct a (1 ↵) ⇥ 100% confidence interval, we employ the techniques of Example 4.36, with T = ✓ˆk✓k0 , to obtain
k ˆ p k
P ⇣✓ˆk tnp,1↵/2 · ˆpk < ✓ˆk < ✓ˆk + tnp,1↵/2 · ˆpk⌘ = 1 ↵. We then employ the parameter estimate ✓ = (XT X)1XT y and variance estimate 2 = 1 RTR,whereR=yX✓,toobtaintheinterval h✓k tnp,1↵/2 ·pk,✓k +tnp,1↵/2 ·pki. (11.16) We note that this is often expressed as ⇥✓k tnp,1↵/2 · SEk, ✓k + tnp,1↵/2 · SEk⇤ , (11.17) MA 540 Proof 266 Chapter 11. Frequentist Techniques for Parameter Estimation where SEk ⌘ pk is termed the standard error. To construct (11.16) or (11.17), one uses a table of t-distributions or t-value calculator to look up or compute values of tnp,1↵/2 for specified values of n, p, and ↵; e.g., MATLAB command: >> tcrit = tinv(1 – alpha/2,n-p). We caution the reader that whereas most tables are compiled in terms of one tail (1 ↵/2), some provide values for both tails (1 ↵). Hence care must be taken to employ ↵ consistent with the table or calculator.
The statistical model, estimators, and statistical properties of the linear re- gression model are summarized in Table 11.2.
Example 11.14. We revisit Example 11.7 and use the t-distribution to construct 95% confidence intervals for the parameters ✓0,✓1, and ✓2 in the quadratic model (11.14). Here we have n = 15 observations and p = 3 parameters. For ↵ = 0.05, we
Statistical Observation Model: Y =X✓0+”,
y = X✓0 + ✏ (realization)
Assumptions: E(“i) = 0 , “i iid with var(“i) = 02
Least Squares Estimator and Estimate:
✓ˆ=(XTX)1XTY , E(✓ˆ)=✓0 , V(✓ˆ)=02(XTX)1, ✓ = (XT X)1XT y
Error Variance Estimator and Estimate: Rb = Y X✓ˆ , R = y X✓ ˆ2= 1 RbTRb , 2= 1 RTR
np np Covariance Matrix Estimator and Estimate:
V (✓ˆ) = ˆ2(XT X)1 , V = 2(XT X)1
Statistical Properties: Requires “i ⇠ N (0, 02) or suciently large n • ✓ˆ ⇠ N ( ✓ 0 , 02 ( X T X ) 1 )
• (1 ↵) ⇥ 100% Confidence Interval: k = [(XT X)1]kk
h✓k tnp,1↵/2 · pk , ✓k + tnp,1↵/2 · pk i
Table 11.2. Statistical observation model, estimators, and statistical properties of the linear regression model. As noted in Remark 11.4, ✓ˆ = ✓ˆOLS and ✓ = ✓OLS are the OLS estimator and estimate.
MA 540 Proof

11.2. Prediction Intervals for Linear Problems 267
obtain the value tnp,1↵/2 = 2.18. This yields the 95% confidence intervals ✓0 2 [206.98, 316.78] , ✓1 2 [108.51, 67.85] , ✓2 2 [10.09, 13.84].
These intervals are slightly wider than those in (11.15) for two reasons: the intervals in (11.15) reflect 2 ⇡ 94.45% of a normal distribution, and the t-distribution has heavier tails than the normal distribution, as illustrated in Figure 4.3(b).
11.2 Prediction Intervals for Linear Problems
For the linear model
Y = X✓ + “, (11.18)
(11.6) and (11.7) yield the estimator ✓ˆ and estimate ✓ for the true but unknown
p ⇥ 1 parameter vector ✓0. Here we discuss predictions of the scalar response Yx⇤
at values x⇤ = [x⇤1,…,x⇤p] that are not among the data in the n ⇥ p design
matrix X used to infer ✓. Such predictions constitute a primary use of the model.
Throughoutthisdiscussion,weassumethatobservationerrors”=[“1,…,”n]T are iid 2
independent and identically distributed with “i ⇠ N(0,0).
We first consider predictions of the mean response μx⇤ = E(Yx⇤ ) at x⇤. This
provides a basis for subsequently predicting a new observation Yx⇤ , which is typi- cally the goal. As we will illustrate, the interval estimates di↵er for the two cases.
Because E(Yx⇤ ) is a parameter, we employ the unbiased point estimator Ybx⇤ =x⇤✓ˆ.
It follows from (4.19) and property (ii) of (11.8) that
var(Ybx⇤ ) = 02[x⇤(XT X)1x⇤T ]. (11.19)
Since 02 is typically unknown, we employ the variance estimator ˆ2 specified in (11.9), which yields the estimator
ˆ2(Ybx⇤ ) = ˆ2[x⇤(XT X)1x⇤T ]. (11.20)
From the assumption that “i ⇠ N(0,02), it follows from Property 11.9 that Ybx⇤ isalinearcombinationofjointmultivariatenormalrandomvariables,which implies that its distribution is a normal distribution with mean μx⇤ and variance (11.19) so that
q Ybx⇤ μx⇤ ⇠N(0,1). 0 x⇤(XT X)1x⇤T
From Definition 4.13, (11.9), and the independence of ✓ˆ and ˆ, it follows that
T = q Yb x ⇤ μ x ⇤
ˆ x⇤(XT X)1x⇤T
MA 540 Proof

268 Chapter 11. Frequentist Techniques for Parameter Estimation
has a t-distribution with n p degrees of freedom. The (1 ↵) ⇥ 100% confidence interval for μx⇤ is thus
Ybx⇤ ± tnp,1↵/2 · ˆqx⇤(XT X)1x⇤T . (11.21)
We now consider the construction of interval estimates for the new prediction Yx⇤ , which is a random variable since it has not yet occurred. We again assume that the estimators ✓ˆ and ˆ2 have been computed using previous data Y in which case Yxb⇤ will be independent from ✓ˆ and ˆ. It thus follows that the random variable Yx⇤ Yx⇤ will be normally distributed with mean
and variance
It follows that
E(Yx⇤ Ybx⇤)=0 var(Yx⇤ Ybx⇤)=var(Yx⇤)+var(Ybx⇤)
= 02 h 1 + x ⇤ ( X T X ) 1 x ⇤ T i . q Ybx⇤ Yx⇤ ⇠N(0,1)

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com