Numerical Simulations
Numerical Simulations
Pierre Chausse
15/09/2021
Introduction
In this document, I want to introduce you to numerical simulations. In the first lecture, we covered a lot of
topics related to sampling theory. You do have to understand the theory, but I don’t want you to focus too
much on it. This is meant to be a review to prepare you for the next topics. Instead, I want to show you how
to use simulations to illustrate the different results I presented. I think it is a good way to understand the
results and it is a very useful tool that we will often use in this course. In fact, it will be the topic of next
assignment.
What is a numerical simulation?
Simulations are used to derive the properties of estimators when they cannot be derived analytically. For
example, we saw in class that the distribution of X̄ is unknown unless we assume that our sample comes from
a normally distributed population. Since the distribution is a property in repeated samples, we can simulate
what we would obtain if we had the opportunity of collecting multiple samples. In general, suppose we are
interested in the properties of T(X), a function of the sample X = {x1, x2, …, xn}. To obtain the properties,
we collect many samples from a given population (to be determined) and we save all realizations of T (X). If
we compute a large enough number of realizations, we can use them to approximate the properties of T (X).
The general algorithm is:
Simulation algorithm
1. We determine a DGP (Data generating process). This DGP is basically a specification for
our population of interest. It can be as simple as a distribution for the sample X or as
difficult as a complete model for a causality study.
2. We generate N datasets (with a given sample size) from the selected DGP.
3. We compute the statistics of interest T(X), where X is either a random vector or a random
dataset.
4. We analyze the properties of T(X) by approximating the infinite population of T (X) by its
N realizations.
How do we code simulations
Since we are generating multiple samples and are computing T (X) each time, it is natural to construct a
simulation using loops. T (X) could be a scalar, a vector or even a matrix. Since we have to store N T (X), we
need to create the appropriate object to store it. Here are the possible options (we assume that the function
f(x) generates T (X) and g(n) generates our sample):
1
T (X) is a scalar:
Tvec <- numeric(N) # We create a numerical vector of N elements
# or #
Tvec <- numeric() # We create an empty numerical vector.
for (i in 1:N)
{
x <- g(n)
Tvec[i] <- f(x)
}
T (X) is a vector:
Here we want to store T (X) in a matrix. Each realization could be on each row of the matrix or on each
column. The choice is yours. In the following, they are stored on each row (we suppose that T (X) has l
elements).
Tmat <- matrix(nrow=N, ncol=l) # We create a Nxl matrix of zeros
for (i in 1:N)
{
x <- g(n)
Tmat[i,] <- f(x)
}
Another option is to stack each T (X) on top of each other using rbind:
Tmat <- numeric() # This command just creates Tmat so that it exists
for (i in 1:N)
{
x <- g(n)
Tmat <- rbind(Tvec, f(x))
}
T (X) is a matrix or a more complicated object:
This is not really something we will use in this course. But if you want to know, you have the option of
storing the results in an array (3D matrix) or a list. In the following, I show how to do it with a list:
Tmat <- list() # This command just creates an empty list
for (i in 1:N)
{
x <- g(n)
Tmat[[i]] <- f(x) # [[]] are used to select elements of a list
}
Can we avoid loops? (optional)
Unless you program in C, C++ or Fortran, it is recommended to avoid loops when it is possible. However,
it may require a little more thinking. Suppose we want to analyze the properties of X̄ for n = 40 and
xi ∼ U(−1, 1). Suppose we want N=1,000. You can create a 40× 1000 matrix. In that case, each column is a
sample. This can be done using the following code:
2
X <- matrix(runif(40000), 40, 1000)
Then we can simply compute the mean of each column:
Tvec <- colMeans(X)
Of course, this approach requires more RAM because you need to store a 40x1000 matrix. This is certainly
not a problem, but if you want n=2000 and N=10000, your computer may not have enough RAM. In that
case, you may have to rely on loops. Just to see the difference, you can compare the computational time of
the two approaches (written in a function for simplicity:
f1 <- function()
{
Tvec <- numeric()
for (i in 1:1000)
Tvec[i] <- mean(runif(40,01,1))
}
f2 <- function()
{
Tvec <- colMeans(matrix(runif(40000,40,1000)))
}
(t1 <- system.time(f1()))
## user system elapsed
## 0.037 0.000 0.036
(t2 <- system.time(f2()))
## user system elapsed
## 0.001 0.000 0.001
It is about 36 times faster without the loop. But, who cares if it takes 0.036 seconds instead of 0.001 seconds?
In that particular case, it is not worth the effort of avoiding the loop. Note that the system time computed
above changes each time I recompile the Rmd file. It depends on many factors (e.g. which other software
are running at the same time). It will also be different for you because you have a different system. If you
want a more precise measure of the relative time, the function benchmark from the #rbenchmark# package
measures the relative time by repeating the functions 100 times:
library(rbenchmark)
benchmark(loop=f1(), matrix=f2())
In the above function, I use colMeans to appy the function mean to each column. In fact, the function calls
the function apply. This is a very useful function when you want to run a function on each column or row.
The following lines are equivalent:
A <- matrix(rnorm(1000),200,5)
colMeans(A)
## [1] -0.004863901 -0.054034170 -0.074240232 0.027931203 -0.064889274
apply(A,2,mean) # applies mean to the second dimension (the columns)
## [1] -0.004863901 -0.054034170 -0.074240232 0.027931203 -0.064889274
colSums(A)
## [1] -0.9727802 -10.8068340 -14.8480465 5.5862406 -12.9778547
apply(A,2,sum) # applies sum to the second dimension (the columns)
3
## [1] -0.9727802 -10.8068340 -14.8480465 5.5862406 -12.9778547
You can even write more complicated functions:
apply(A,2,function(x) c(mean(x), sd(x), length(x), min(x),max(x)))
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.004863901 -0.05403417 -0.07424023 0.0279312 -0.06488927
## [2,] 0.934366230 0.95020050 1.00383562 1.0327412 1.08296687
## [3,] 200.000000000 200.00000000 200.00000000 200.0000000 200.00000000
## [4,] -2.671902412 -2.98612968 -2.59447978 -2.4786914 -2.95567114
## [5,] 2.448701870 3.01068154 2.22828940 2.6380261 2.63241659
Can we use parallel programing?
If you have multiple cores on your computer or have access to a super computer, it may be beneficial to do
some parallel programming: doing more than one estimation simultaneously. There are more than one way
to do it in R. I will show you the easiest one. Unfortunately, it does not work on Windows operating systems.
Consider the following function:
f <- function(n)
{
x <- runif(n, -1,1)
mean(x)
}
f(40)
## [1] -0.1271196
f(40)
## [1] 0.09172116
The function generate a random vector of size n from a U(−1, 1) distribution and returns its sample mean. You
can send that function to multiple cores simulataneously using the ‘mclapply’ function from the #parallel#
package:
library(parallel)
res <- mclapply(1:1000, function(i) f(40), mc.cores=8) # my laptop has 8 cores
Since res is a list, we can convert the list using the following:
Tvec <- simplify2array((res))
I used this function all the time in my research. When one estimation takes several seconds and I use my
32-core workstation, the simulation time is substantially reduced. It is worth it in this case? Not really.
Estimating the bias
The sample standard deviation
For this exercise, we want to estimate the bias of s. We saw that s2 is an unbiased estimator for the variance,
but s is biased since it is a non linear transformation of s2. For that purpose, our DGP is a chi-square
distribution with 8 degrees of freedom (the variance is 16, which implies σ = 4). In the following, I use a
function that returns the bias for a given sample size using 2000 replications. You do not have to use a
function, but it is easier this way since we will try different sample sizes.
4
f <- function(n)
{
Tvec <- numeric()
for (i in 1:2000)
{
x <- rchisq(n, 8)
Tvec[i] <- sd(x)
}
mean(Tvec)-4 ## estimated bias
}
Then, we can try with n=10:
set.seed(112233) # To make it reproducible
f(10)
## [1] -0.1918991
Therefore, it underestimates the standard deviation. Let see what happens if we increase the sample size:
n <- seq(10,200,by=5)
bias <- numeric()
for (i in 1:length(n))
bias[i] <- f(n[i])
plot(n, bias, type='l', main="Bias of the standard deviation S")
abline(h=0, lty=2, col="blue")
50 100 150 200
−
0
.1
5
−
0
.1
0
−
0
.0
5
0
.0
0
Bias of the standard deviation S
n
b
ia
s
We can see that the bias seems to converge to 0. In other words, s is asymptotically unbiased. The negative
bias that we observe is the result of the Jensen’s inequality: when f(x) is concave, E(f(x)) < f(E(x)). In
this case, E(
√
S2) ≤
√
E(S2) = σ. It is equal only if S2 = 0 or if the variable is not random.
The unbiasedness of S2
We can repeat the exercise for S2 (the true variance is 16):
f2 <- function(n)
{
Tvec <- numeric()
for (i in 1:2000)
5
{
x <- rchisq(n, 8)
Tvec[i] <- var(x)
}
mean(Tvec)-16 ## estimated bias
}
n <- seq(10,200,by=5)
bias <- numeric()
for (i in 1:length(n))
bias[i] <- f2(n[i])
plot(n, bias, type='l', main=expression("Bias of"~Sˆ2))
abline(h=0, lty=2, col="blue")
50 100 150 200
−
0
.5
−
0
.3
−
0
.1
0
.1
Bias of S2
n
b
ia
s
We can see that we do not get exactly 0. Since we are estimating the bias, the measure is a random variable
with a different realization every time to re-run the simulation. For simulations, N is the sample size, so the
precision increases with N.
Efficiency of estimators
To compare the efficiency of estimators, we use the same procedure, but we compute the standard deviation
instead. We never compute the variance because it is easier to interpret the standard deviation. It has
the same units as the random variable. We saw that X̄ is BLUE. It means that the variance of any other
estimator in the form µ̃ =
∑n
i=1 ωixi, with
∑n
i=1 ωi = 1, is smaller than the variance of X̄. Let’s test that
result.
For this simulation, the sample size is set to 30, X ∼ t5 and N = 5000. We also set ωi to 1/20 for i = 1, ..., 15
and to 1/60 for i = 16, ..., 30 (verify that the sum is indeed 1). In the following, I use the apply function:
set.seed(112233)
w <- c(rep(1/20,15), rep(1/60,15))
X <- matrix(rt(30*5000,5), 30, 5000)
Xbar <- apply(X, 2, mean)
muTilde <- apply(X, 2, function(x) sum(x*w))
c(Xbar=sd(Xbar), muTilde=sd(muTilde))
## Xbar muTilde
## 0.2371467 0.2636798
6
We can see that the standard deviation of X̄ is smaller than the one for µ̃. Therefore, X̄ is more efficient. Of
course, it only proves the superiority of X̄ for this particular vector of weights and this particular distribution.
Mean Squared Error (MSE)
When we compare estimators, especially if one of them or both are biased, we have to compare the MSE. In
fact, we usually compute the root mean squared error (RMSE) for the same reason we compute the standard
error. It is simply the square root of the MSE. The formula is
RMSE =
√√√√ 1
n
n∑
i=1
(θ̂ − θ)2 ,
where θ is the population value. We want to compare the following estimators of the variance:
S2 =
1
n− 1
n∑
i=1
(xi − x̄)2
σ̂2 =
1
n
n∑
i=1
(xi − x̄)2
For this simulation, the sample size is set to 20, X ∼ t5 and N = 5000. The population variance is therefore
10. We first compute the two estimators:
set.seed(112233)
X <- matrix(rt(20*5000,5), 20, 5000)
S2 <- apply(X, 2, sd)
sigma2 <- apply(X,2,function(x) mean((x-mean(x))ˆ2))
Then, we compare the RMSE:
rmse1 <- sqrt(mean((S2-10)ˆ2))
rmse2 <- sqrt(mean((sigma2-10)ˆ2))
c(S2=rmse1, Sigma2=rmse2)
## S2 Sigma2
## 8.752529 8.462327
We can see that σ̂2 is better than S2, because its RMSE is smaller. This is an example of a biased estimator
that dominates an unbiased one. This is hard to prove analytically because the variance of the estimators
depends on the distribution of X. We know that the variance of σ̂2 is smaller, but we do not know how
smaller it is.
Consistency of estimators
Proving consistency is harder, because it is a property when n goes to infinity. It is impossible to have such
large sample size in practice. However, we can observe some trends as n increases.
7
Consistency of X̄
Let’s analyze the propertied for n = 1, ..., 2000 and $N=2000. We can do it using the cumulative sum function
cumsum:
set.seed(112233)
X <- sapply(1:2000, function(i) rnorm(2000))
Xbar <- apply(X,2,cumsum)
Xbar <- Xbar/(1:2000)
Here, I use two new concepts. The sapply is like a loop. It goes from 1 to 2000, computes rnorm(2000) and
put the results side by side. It is similar to the preview approach of using matrix, but I wanted to teach you
another way. I also use an element by element operation on a matrix and a vector. This is allowed in R.
Since the length of the vector is equal to the number of rows of the matrix, the element by element operation
is applied to each column. Therefore, each column contains 2000 = barX’s for n=1„,2000 and we have 2000
such columns.
First, let’s compare the sample means for a few n’s:
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(Xbar[20,], main="Illustrating the consistency of the sample mean",
ylab="", xlab="Iterations")
points(Xbar[200,], col=2, pch=21, bg=2)
points(Xbar[1000,], col=3, pch=21, bg=3)
points(Xbar[2000,], col=4, pch=21, bg=4)
legend("topright", c("n=20", "n=200", "n=1000", "n=2000"), pt.bg = 1:4,
col=1:4, pch=c(1,21,21,21), inset=c(-0.22,0), bty='n')
0 500 1000 1500 2000
−
0
.5
0
.0
0
.5
Illustrating the consistency of the sample mean
Iterations
n=20
n=200
n=1000
n=2000
The graph shows that the values of X̄ gets more and more concentrated around its mean, which is what we
mean by convergence in probability: the probability of |X̄ − µ| > � is smaller and smaller when n increases.
We saw that it is sufficient to prove that the variance goes to zero, so another way to show convergence is to
plot the variance as a function of the sample size:
v <- apply(Xbar, 1, var)
plot(v[-(1:10)], type='l', main="Variance of the sample mean", ylab="variance",
xlab="n") # I remove n=1,...,10 to better see
8
0 500 1000 1500 2000
0
.0
0
0
.0
2
0
.0
4
0
.0
6
0
.0
8
Variance of the sample mean
n
va
ri
a
n
ce
We can also estimate the rate of convergence of the variance. If the rate is nα or O(nα), we should have the
relationship V = Cnα, which implies log(V ) = α0 + α log(n). This can be done using a regression:
lm(log(v)~log(1:2000))
##
## Call:
## lm(formula = log(v) ~ log(1:2000))
##
## Coefficients:
## (Intercept) log(1:2000)
## -0.003359 -0.996205
It shows that the variance converges to 0 at the same rate as n−1. Some students were asking why we always
write
√
n(X̄ − µ). We do that when we want to prevent the variance from going to 0. The variance of
this expression is nV ar(X̄). Since the variance is O(n−1), nV ar(X̄) is O(1), which means that it does not
converge to 0. We do it when we want the asymptotic distribution. The asymptotic distribution of (X̄ − µ)
is a “degenerate” distribution because it is 0 with probability 1. We can see what happens using a graph:
v <- apply(Xbar, 1, var)*(1:2000)
plot(v[-(1:10)], type='l', main="Variance of the sample mean times n", ylab="variance",
xlab="n") # I remove n=1,...,10 to better see
0 500 1000 1500 2000
0
.9
8
1
.0
0
1
.0
2
1
.0
4
1
.0
6
Variance of the sample mean times n
n
va
ri
a
n
ce
As a last exercise, you can use the definition of convergence in probability and compute the probability of
9
|X̄ − µ| > � for some epsilon:
p1 <- rowMeans(abs(Xbar)>0.1)
p2 <- rowMeans(abs(Xbar)>0.08)
p3 <- rowMeans(abs(Xbar)>0.05)
p4 <- rowMeans(abs(Xbar)>0.03)
matplot(1:2000, cbind(p1,p2,p3,p4), type=’l’, col=1:4,xlab=”n”,ylab=”Prob.”,
main=”Probability of the mean being greater than e”, lty=1)
legend(“topright”, c(“e=0.1″,”e=0.8″,”e=0.5″,”e=0.3”), col=1:4, lty=1, bty=’n’)
0 500 1000 1500 2000
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Probability of the mean being greater than e
n
P
ro
b.
e=0.1
e=0.8
e=0.5
e=0.3
We see that the probability converges to 0 for the three values, which is a necessary condition for convergence
in probability. In a simulation, if you set � to a too small value, you won’t observe convergence because n will
not be high enough.
Convergence in Distribution
To test the CLT, it is a little harder, because we have to compare densities. We can use normality tests to
get a sense of how close the distribution is from the normal, but we won’t do it here. I just want to present a
few visualization methods.
Distribution of the sample mean
We know that X̄ is N(µ, σ2/n) when the sample comes from a normally distributed population with mean
µ and variance σ2. We can verify by comparing the densities. In the following, µ = 5, σ2 = 4, n = 20 and
N = 5000.
X <- sapply(1:5000, function(i) rnorm(20, 5, 2))
Xbar <- apply(X,2,mean)
hist(Xbar, freq=FALSE) # We want the density not the fequency
curve(dnorm(x, 5, 2/sqrt(20)), 3, 7, add=TRUE, col=2, lty=2, lwd=2)
10
Histogram of Xbar
Xbar
D
e
n
si
ty
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
0
.0
0
.2
0
.4
0
.6
0
.8
The histogram does look normal. We can also use a smoothed density:
plot(density(Xbar), main="Density of Xbar")
curve(dnorm(x, 5, 2/sqrt(20)), 3, 7, add=TRUE, col=2, lty=2, lwd=2)
3 4 5 6
0
.0
0
.2
0
.4
0
.6
0
.8
Density of Xbar
N = 5000 Bandwidth = 0.07412
D
e
n
si
ty
Or we can use the QQ-plot that compares the quantiles of X̄ with the quantiles of the normal distribution.
Is it is normal, the points follow the straight line:
qqnorm(Xbar)
qqline(Xbar)
11
−4 −2 0 2 4
3
.5
4
.5
5
.5
6
.5
Normal Q−Q Plot
Theoretical Quantiles
S
a
m
p
le
Q
u
a
n
til
e
s
You can also compare some quantiles of the normal with the the empirical quantiles of X̄:
q <- c(.01,.05,.1,.99,.95,.99)
rbind(quantile(Xbar, q), qnorm(q, 5, 2/sqrt(20)))
## 1% 5% 10% 99% 95% 99%
## [1,] 3.962516 4.262441 4.428501 6.057918 5.746840 6.057918
## [2,] 3.959626 4.264399 4.426873 6.040374 5.735601 6.040374
We can see how close they are from each other.
For non-normal observations, we need to rely on the asymptotic distribution. In that case, we use the CLT
which states that
√
n(X̄ − µ) → N(0, σ2). It implies that X̄ ≈ N(µ, σ2/n). There is a more useful result
that we can use. Since σ̂2 converges in probability to σ2, then
√
n(X̄ − µ)
σ̂
→
1
σ
N(0, σ2) = N(0, 1)
This is from: If cn converges in probability to c and xn converges in distribution to ξ, then cnxn converges in
distribution to cξ. We use this result to justify
√
n(X̄ − µ)
σ̂
≈ N(0, 1)
How good is the approximation? Let’s try with n=20 and xi = 1 with probability 0.5 and xi = 0 with
probability 0.5:
X <- sapply(1:5000, function(i) rbinom(20, size=1, prob=0.5))
z <- apply(X, 2, function(x) sqrt(20)*(mean(x)-0.5)/sd(x))
q <- c(.01,.05,.1,.99,.95,.99)
rbind(quantile(z, q), qnorm(q, 0, 1))
## 1% 5% 10% 99% 95% 99%
## [1,] -2.516611 -1.902379 -1.370810 2.516611 1.902379 2.516611
## [2,] -2.326348 -1.644854 -1.281552 2.326348 1.644854 2.326348
We can see that the quantiles are different when n=20. If we try with n=100 we get the following:
X2 <- sapply(1:5000, function(i) rbinom(100, size=1, prob=0.5))
z2 <- apply(X2, 2, function(x) sqrt(100)*(mean(x)-0.5)/sd(x))
q <- c(.01,.05,.1,.99,.95,.99)
rbind(quantile(z2, q), qnorm(q, 0, 1))
12
## 1% 5% 10% 99% 95% 99%
## [1,] -2.459865 -1.612757 -1.202676 2.459865 1.612757 2.459865
## [2,] -2.326348 -1.644854 -1.281552 2.326348 1.644854 2.326348
The quantiles get a little closer to the N(0,1) when the sample size increases. In fact, it is not bad at all even
with such small sample. Let’s look at the QQ-plot
qqnorm(z2)
qqline(z2)
−4 −2 0 2 4
−
2
0
2
4
Normal Q−Q Plot
Theoretical Quantiles
S
a
m
p
le
Q
u
a
n
til
e
s
In the graph, we see that the quantiles deviate from the normal distribution at the tails of the distribution.
This is important, because when we perform hypothesis tests, we are interested in tails.
13
Introduction
What is a numerical simulation?
Simulation algorithm
How do we code simulations
T(X) is a scalar:
T(X) is a vector:
T(X) is a matrix or a more complicated object:
Can we avoid loops? (optional)
Can we use parallel programing?
Estimating the bias
The sample standard deviation
The unbiasedness of S^2
Efficiency of estimators
Mean Squared Error (MSE)
Consistency of estimators
Consistency of \bar{X}
Convergence in Distribution
Distribution of the sample mean