ESS116
ESS 116
Introduction to Data Analysis in Earth Science
Image Credit: NASA
Instructor: Mathieu Morlighem
E-mail: mmorligh@uci.edu (include ESS116 in subject line)
Office Hours: 3218 Croul Hall, Friday 2:00 pm – 3:00 pm
This content is protected and may not be shared uploaded or distributed
Part 1: 60 min “in class” next week
May 12th, 2:00 pm
cheat sheet (Letter paper, double sided, hand written)
mix of multiple choice and short answer questions
Everything from Lecture 1 to 5 (no hypothesis testing)
Bring a scientific calculator
Part 2: 2 hours “in lab”
May 13th during your regular lab session
Open Book
Similar to what you have experienced with your labs
No make up exam for the midterm
Midterm exam
Lecture 5 quick review
Lecture 6 – Hypothesis testing
Sampling Distribution of the sample mean
Central Limit Theorem (CLT)
Confidence intervals
Hypothesis Testing
t-test (Comparing means)
χ2-test (Goodness of fit)
Today’s lecture
Population: the actual properties of the real world
Sample: set of values imperfectly representing the population
Parameters: refer to the population (e.g., μ and σ)
Statistics: refer to the sample (e.g., and s)
Accuracy: quality of being close to the true value
Precision: number of significant digits in a numerical value (measurements or calculation)
Lecture 5 – review
Sample visualization
Frequency Table
Cumulative Frequency
Histogram
Rules for a good histogram
number of bins ≈ sample size
histogram takes either a number
of bins, or a list of bin edges
Lecture 5 – review
Central Tendency:
Mean (average)
Median (50% higher, 50% lower)
Mode(s) (peak value(s))
Dispersion:
Range (max – min)
Standard deviation (average distance to mean)
Shape:
Skewness (positive: tail to the right, negative: tail to the left)
Kurtosis (<3: flat, >3: peaked)
Know how they relate to visual features on a histogram
Statistical parameters
what you need to remember
Histograms: empirical frequency distribution of our sample.
A histogram for and an infinitely small bin size will produce a Probability Density function (PDF)
The probability that x is between x1 and x2 is:
Examples of theoretical Distributions:
Normal distribution (2 parameters: μ and σ)
Z distribution (0 parameters)
Student’s t distribution (1 parameter: Φ)
Probability Density Functions
Normal (μ,σ)
Given x0, find p0
>> p0 = normcdf(x0,mu,sigma);
Given p0, find x0
>> x0 = norminv(p0,mu,sigma);
Z-distribution
>> p0 = normcdf(x0);
>> x0 = norminv(p0);
t-distribution
>> p0 = tcdf(x0,V);
>> x0 = tinv(p0,V);
χ2-distribution
>> p0 = chi2cdf(x0,V);
>> x0 = chi2inv(p0,V);
MATLAB theoretical distributions
p0 = P( x < x0) e.g.: 0.88 = P(x < 1.17) ESS 116 grades follow a Normal distribution of mean 800 with a standard deviation of 100. What is the probability of having a grade below 500? 1 – normcdf(500,800,100); 1 – norminv(500,800,100); normcdf(500,800,100); norminv(500,800,100); i>Clicker question
Lecture 6 – Hypothesis testing
Central Limit Theorem
Central Limit Theorem
Central Limit Theorem
Population distribution
Sampling distribution of the sample mean
Sample size
Central Limit Theorem in Action
The average male drinks 2 L of water when active outdoors (with a standard deviation of 0.7 L). You are planning a full day in nature with 50 men and will bring 110 L of water. What is the probability that you run out ?
Example
Example
Sampling distribution of the Sample mean
P(run out) = P(average water use > 110/5 L)
= P(average water use > 2.2 L)
= P( > 2.2)
= 1 – P( < 2.2)
= 1 – normcdf(2.2,2,0.7/sqrt(50))
= 0.0217
The probability of running out of water is 2.17%
Population distribution
Confidence Intervals
Confidence Intervals: provide statistical limits for your mean values based on a degree of statistical confidence.
Ex: We can say with 95% confidence that the average temperature in Irvine is within [18°C 24°C] or 21 ±3°C
How to calculate this interval?
Set the level of significance α (α = 0.05 for a 95% Confidence Interval)
Find such that
Confidence interval in the mean
Using the Central Limit Theorem
1- α (95%)
α/2
α/2
Find such that:
Normal Distribution
Normalizing (Subtract divide by )
1- α (95%)
α/2
α/2
Find such that:
1
0
Z-Distribution
What if we don’t know σ, can we say σ ≈ s ?
If the sample size n > 30: Yes
If the sample size n < 30: Yes but
the distribution of X needs to be roughly normal
we pay a penalty: use a t-distribution instead of a z-distribution (fatter tails)
Distribution of the sample mean
Set the level of significance α (α = 0.05 for a 95% CI)
follows a:
z-distribution if n>30
t-distribution if n<30 (Φ = n-1)
Find such that:
Confidence interval in the mean
Sampling distribution of
the Sample mean – “Normalized”
0
-
1- α
α/2
α/2
if n<30:
deltax = -tinv(alpha/2,n-1)*s/sqrt(n);
Or (equivalent)
deltax = tinv(1-alpha/2,n-1)*s/sqrt(n);
Else:
deltax = -norminv(alpha/2)*s/sqrt(n);
Or (equivalent)
deltax = norminv(1-alpha/2)*s/sqrt(n);
You sample 36 apples from your farm’s harvest of over 200,000 apples. The mean weight of the sample is 112 grams (with s = 40 grams).
What is the probability that the mean weight of the 200,000 apples is within 100 and 124 grams?
(i.e. mean is 112±12 g)
Example n>30
P(μ within 12 of ) = P( within 12 of μ)
= P( within 12 of )
= P( )
= normcdf(12/(40/6))
– normcdf(-12/(40/6))
= 0.9281
Example n>30
We have a 92.8% chance that the actual mean is within 12 grams of our sample mean
Sampling distribution of
the Sample mean – “Normalized”
0
7 patients’ blood pressures have been measured after having been given a new drug for 3 months. They had blood pressure increases of 1.5, 2.9, 0.9, 3.9, 3.2, 2.1 and 1.9
Construct the 95% Confidence Interval (CI) for the true expected blood pressure increase for all patients in a population.
Example n<30
Example n<30
Sampling distribution
of the Sample mean
(Student’s t distribution with
Φ = n-1 = 6)
Here are our statistics for n = 7
What is our 95% confidence interval ?
95%
2.5%
2.5%
0
-Δz = tinv(0.025,7-1)
= -2.4469
Δz = tinv(0.975,7-1)
= +2.4469
- Δz
Δz
There is a 95% chance
that the mean, μ, is within 2.3429 ± 0.9639
%Compute sample size, sample mean and standard deviation
n = length(data);
xbar = mean(data);
s = std(data);
%Set level of significance (95% CI)
alpha = 0.05;
%Depending on the sample size, we either use a z or t distrib.
if n<30
deltax = tinv(1-alpha/2,n-1)*s/sqrt(n);
else
deltax = norminv(1-alpha/2)*s/sqrt(n);
end
fprintf('Population mean is %g ± %g (95%% CI)\n',xbar,deltax);
MATLAB code for confidence Interval
Hypothesis Testing
You read that, on average, a volcanic eruption lasts 7 weeks (μ=7). But we suspect that this number is wrong and should higher (μ>7).
How can we prove this, for a given level of significance (α=0.05)?
We look at the past n=100 eruptions and find =7.2 and s =1 week.
Introduction
Assuming that μ = 7, there is only a 2.3% chance of finding a mean of 7.2 weeks, so we can reject μ=7
Conclusion: μ>7
P( ≥ 7.2) = 1 – P( < 7.2) = 1 – normcdf(7.2,7,1/10) = 0.0228 < α Assuming that μ=7 P( ≥ 7.2) = 1 – P( < 7.2) = 1 – normcdf(7.2,7,1/10) = 0.0228 < α You read that, on average, a volcanic eruption lasts 7 weeks (μ=7). But we suspect that this number is wrong and should higher (μ>7).
How can we prove this, for a given level of significance (α=0.05)?
We look at the past n=100 eruptions and find =7.2 and s =1 week.
Testing one population mean
Null Hypothesis H0
Alternative Hypothesis H1
p-value
Assuming that μ = 7, there is only a 2.3% chance of finding a mean of 7.2 weeks, so we can reject μ=7
Conclusion: μ>7
Assuming that μ=7
Null and alternative hypotheses:
Prepare a statement about a fact for which it is possible to calculate its probability of occurrence (e.g.: μ=7, μ1> μ2, etc.)
This statement is the null hypothesis, H0, and its counterpart is the alternative hypothesis, H1.
H0 is often the reverse of what the experimenter actually believes for tactical reasons.
Level of significance (α): If P(H0) < α, we reject H0
p-value: the probability of rejecting H0, while H0 is actually true
If p-value > α, H0 is not rejected
The lower the p-value, the stronger is the evidence provided by the data against the null hypothesis.
Hypothesis testing
t-test “Comparing Means”
Unpaired t-test
the two samples are from independent populations
Ex1: Are tropical fish larger than temperate fish?
Ex2: Are the temperatures in Long Beach and Death Valley significantly different?
Paired vs Unpaired t-test
Paired t-test
the two samples are from the same population
Ex1: Do fish get larger as they age?
Ex2: Is the annual temperature in the last 5 years in Death Valley significantly higher than in the Earlier 5 years?
Unpaired t-test (independent populations)
Sample 1: size n1, mean m1 and standard deviation s1
Sample 2: size n2, mean m2 and standard deviation s2
Paired t-test (same population)
Both samples should have the same size, n
We look at the differences between all n pairs:
If the population means are the same (Null Hypothesis):
The statistic tstat follows a t-distribution (i.e. it should be close to 0!)
Paired vs Unpaired t-test
Φ = n1 + n2 – 2
Φ = n – 1
Vector of paired differences:
Paired t-test
Sample 1
Sample 2
Student’s t test (2-tailed)
There is only a 5% probability of finding
tstat higher than this value purely by chance…
There is only a 5% probability of finding
tstat higher than this value purely by chance…
There is
a 90% chance
of finding
tstat in this
range by chance
is tstat statistically distinguishable from zero? Example (90% conf)
tcrit
– tcrit
Two-tailed t-test
Null-Hypothesis: (H0): μ1 = μ2
Population means are not statistically different
Alternative Hypothesis: (H1): μ1≠ μ2
Population means are statistically different
two-tailed vs one-tailed t-test
One-tailed t-test
Null-Hypothesis: (H0): μ1 ≤ μ2
μ1 is not statistically significantly greater than μ2 (use H0: μ1= μ2 to test H1)
Alternative Hypothesis: (H1): μ1> μ2
μ1 is statistically significantly greater than μ2
two-tailed vs one-tailed t-test
Distribution of tstat TWO-tailed test
0
1- α
α/2
α/2
-tcrit
+tcrit
tcrit = – tinv(alpha/2,phi)
OR
tcrit = tinv(alpha/2+1-alpha,phi)
Distribution of tstat ONE-tailed test
0
1- α
α
tcrit
tcrit = tinv(1-alpha,phi)
OR (depending on H0)
0
1- α
α
tcrit
tcrit = tinv(alpha,phi)
Is it a paired (same population) or unpaired test ?
Is it a one-tailed (e.g. μ1> μ2) or two-tailed test (e.g. μ1 ≠ μ2) ?
Decide upon a level of significance α.
e.g. 99% and 95% are typical (α = 0.01 or 0.05)
Find tcrit using tinv (one- vs. two-tailed)
Find tstat from your sample (paired vs. unpaired)
Compare tstat and tcrit
If |tstat| > tcrit: the difference is significant (you can reject H0)*
else: the difference is not significant (you cannot reject H0)
Optional: determine the p-value (using tcdf of your tstat)
*This example if for a two-tailed test.
Summary
We are interested in ocean acidification. We measure the pH of ocean water at the pier of Newport Beach at two different dates:
In 1994: 8.03, 8.08, 7.99, 8.00, 7.93, 7.98
In 2004: 7.99, 8.02, 7.92, 7.94, 8.01, 7.93
From our two sample, we have:
In 1994: m1 = 8.0017 and s1 = 0.0504
In 2004: m2 = 7.9683 and s2 = 0.0406
Does the difference between the two means show a significant decrease or is it likely caused just by chance?
Example
What kind of t-test should you perform?
Paired t-test
Unpaired t-test
I don’t know
i>Clicker question
What kind of t-test should you perform?
One-tailed t-test
Two-tailed t-test
I don’t know
i>Clicker question
Choose a level of significance α = 0.1 (CI 90%)
This is a “one tailed” test (H1: m2 < m1)
Numbers of degrees of freedom: Φ = 6 -1 = 5
tcrit = tinv(1-0.1,5)
= 1.4759
Now, we have our critical value, what is our statistics?
d = [8.03, 8.08, 7.99, 8.00, 7.93, 7.98] - [7.99, 8.02, 7.92, 7.94, 8.01, 7.93];
tstat = mean(d)/(std(d)/sqrt(length(d))); = 1.4464
tstat < tcrit : We cannot reject H0
Optional: p-value = 1-tcdf(tstat,5) = 0.1039
Example
χ2-test “Goodness of fit”
We want to compare an observed frequency distribution to a theoretical distribution
Ex: we want to show that the yearly averaged rainfall in Irvine follows a normal distribution
Ex: we want to make sure that a dice is not loaded
χ2-test “Goodness of fit”
We decompose the number of observations (n) over k intervals (or bins, or classes)
k must satisfy n/k ≥ 5
k ≥ 10
So n ≥ 50
The Expected number of counts in any cell is Ei
The Observed number of counts is Oi
χ2 statistic
χ2stat measures the mismatch between the Expected and the Observed distributions
χ2stat = 0 perfect fit
χ2stat large: poor fit
χ2 statistic
Our statistic χ2stat follows a χ2 -distribution !
Formulate a null and alternative hypothesis:
H0: The data are consistent with a specified distribution
H1: The data are not consistent with a specified distribution
Choose a Significance level: α = 0.05 (5%)
use MATLAB’S chi2inv function to find χ2crit
Analyze Sample data
Degrees of freedom = k-1
Calculate the expected frequency counts Ei
Calculate the test statistic
Interpret the results
Conducting a χ2 test
Is our dice loaded? Compare to a uniform distribution
For alpha = 0.02:
chi2crit = chi2inv(1-0.02,6-1) =13.3882
The die is loaded (98% confidence)
Example
Value Observed freq. Expected freq. (O-E)^2/E
1 16 10 3.6
2 5 10 2.5
3 9 10 0.1
4 7 10 0.9
5 6 10 1.6
6 17 10 4.9
Total 60 60 13.6
Lab 6: Hypothesis testing
Lecture 7: Curve Fitting and interpolation
Midterm Part 1 in class next Tuesday
Midterm Part 2 in lab Wednesday next week
What’s next?
x̄
q
N ! 1
P (x1 < x < x2) =
Z
x2
x1
f(x)dx
µ
X̄
= µ
�
X̄
=
�
p
n
µ
X̄
= µ
�
x̄
=
�
p
n
µ
X̄
= µ = 2L
�
X̄
=
�
p
N
=
0.7
p
50
P (x̄ > 2.2L)
� = 0.7L
µ = 2L
�x̄
P (µ within �x̄ of x̄) = P (x̄ within �x̄ of µ)
= P (µ��x̄ < x̄ < µ+�x̄) = 1� ↵ < l a t e x i t s h a 1 _ b a s e 6 4 = " U L t N n g P o I E / y P t I Q 1 C s L S i p 2 u C A = " >
A
A
A
C
7
3
i
c
h
V
J
N
j
9
M
w
E
H
U
C
C
0
v
4
6
s
K
R
i
0
W
X
1
S
K
0
V
Y
I
q
w
W
G
R
V
s
C
B
Y
5
H
o
7
k
p
1
V
T
n
u
p
L
H
W
d
o
I
9
g
a
2
i
/
A
k
u
H
E
C
I
K
3
+
H
G
/
8
G
t
8
2
B
t
i
s
x
i
p
X
n
9
2
Z
e
M
m
O
n
p
Z
I
O
4
/
h
P
E
F
6
7
v
n
P
j
5
u
6
t
6
P
a
d
u
/
f
u
d
/
Y
e
n
L
q
i
s
g
K
G
o
l
C
F
P
U
+
5
A
y
U
N
D
F
G
i
g
v
P
S
A
t
e
p
g
r
P
0
4
s
1
C
P
/
s
E
1
s
n
C
f
M
B
5
C
W
P
N
Z
0
Z
m
U
n
D
0
1
G
Q
v
2
G
E
p
z
K
S
p
u
b
V
8
3
t
R
W
q
C
Z
i
C
J
d
Y
D
x
q
m
I
M
N
D
p
i
u
6
Y
u
h
n
i
b
k
0
d
J
+
9
B
Y
W
c
p
d
z
W
l
8
0
+
L
T
L
a
t
B
t
m
5
S
z
H
p
/
T
g
1
Q
H
d
s
G
k
T
/
u
+
k
q
9
a
F
s
c
g
/
V
1
j
p
6
m
i
t
8
L
h
9
0
2
M
v
P
V
u
T
t
q
y
S
I
8
Z
V
m
f
O
I
g
Z
m
2
b
U
e
T
T
j
f
u
x
c
u
g
2
y
B
p
Q
Z
e
0
M
Z
h
0
f
r
N
p
I
S
o
N
B
o
X
i
z
o
2
S
u
M
S
x
t
0
M
p
F
P
g
R
V
g
5
K
L
i
7
4
D
E
Y
e
G
q
7
B
j
e
v
l
e
T
X
0
i
W
e
m
N
C
u
s
X
w
b
p
k
v
2
3
o
u
b
a
u
b
l
O
f
a
b
m
m
L
t
N
b
U
F
e
p
Y
0
q
z
F
6
O
a
2
n
K
C
s
G
I
1
Y
e
y
S
l
E
s
6
O
L
w
6
V
R
a
E
K
j
m
H
n
B
h
p
f
9
X
K
n
J
u
u
U
B
/
R
R
Z
D
S
D
Z
b
3
g
a
n
z
3
t
J
v
9
d
/
3
+
+
e
v
G
7
H
s
U
s
e
k
c
f
k
k
C
T
k
B
T
k
h
7
8
i
A
D
I
k
I
V
P
A
l
+
B
Z
8
D
z
+
G
X
8
M
f
4
c
9
V
a
h
i
0
N
Q
/
J
W
o
S
/
/
g
K
m
X
e
y
+
< / l a t e x i t >
µ
X̄
= µ
�
X̄
=
�
p
n
µ��x̄
µ+�x̄
�
�x̄
�/
p
n
�x̄
�/
p
n
µX̄
�
X̄
�x̄
s/
p
n
�x̄
µ
X̄
�12
�
X̄
<
x̄� µ
X̄
�
X̄
<
+12
�
X̄
�12/�
X̄
12/�
X̄
Z̄ =
X̄ � µ
X̄
�
X̄
=
X̄ � µ
X̄
sp
n
x̄ = 2.3429 s = 1.0422
Z̄ =
X̄ � µ
X̄
�
X̄
'
X̄ � µ
X̄
sp
n
�x̄
s/
p
n
= 2.4469
�x̄ = 2.4469
s
p
n
�x̄ = 0.9639
µ = 7
�
X̄
=
�
p
n
‘
s
p
n
x̄ = 7.2
(x̄d, sd)
tstat =
m1 �m2q
s21
n1
+
s22
n2
tstat =
x̄d
sd/
p
n
2
6666666
4
a1
a2
a3
…
an�1
an
3
7777777
5
2
6666666
4
b1
b2
b3
…
bn�1
bn
3
7777777
5
d =
2
6666666
4
a1 � b1
a2 � b2
a3 � b3
…
an�1 � bn�1
an � bn
3
7777777
5
/private/tmp/tp16b596be_bb00_4e4f_9701_f9785f96c170.eps
Critical Value
-4 -3 -2 -1 0 1 2 3 4
D
e
n
s
it
y
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
�2stat =
kX
i=1
(Oi � Ei)
2
Ei