ESS116
ESS 116
Introduction to Data Analysis in Earth Science
Image Credit: NASA
Instructor: Mathieu Morlighem
E-mail: mmorligh@uci.edu (include ESS116 in subject line)
Office Hours: 3218 Croul Hall, Friday 2:00 pm – 3:00 pm
This content is protected and may not be shared uploaded or distributed
Overall: things seem to be going (mostly) well
Many good advice for next year…
I will keep:
quick reviews & quizzes & videos broken in chunks & pdf/ppt posted on canvas
working with real datasets
pre-recorded lecture, self-paced lab (a couple of students did not like that though)
What I will change:
Go slower
Make datasets used in lecture available
Open new modules on Saturday instead of Monday (no change in deadlines)
More hands on examples during the lecture
Go over a “typical final exam” on week 10 (review session “live”)
Make lab instructions easier to follow
What I cannot change:
Less statistics
Zoom is a bit awkward, it’s awkward for me too!
Midterm evaluation results
Lecture 6 quick review
Central Limit Theorem
Confidence Interval in the mean
Student’s t test “Comparing means”
Lecture 7 – Curve fitting and Interpolation
Polynomials
Polynomial fit
Interpolation and Extrapolation (linear, spline, polynomials)
Midterm part 1…
Today’s lecture
Dice question
histogram(round(6*rand(1000)))
histogram(round(0.5+6*rand(1000)))
Central Limit Theorem
Population distribution
Sample size
Sampling distribution of the sample mean
We generally don’t know the population parameters!
If the sample size n>30
Normal distribution with σ ≈ s
If the sample size n<30 Student’s t distribution with Φ = n - 1 Use data Normalization with σ ≈ s: Central Limit Theorem Confidence Intervals: provide statistical limits for your mean values based on a degree of statistical confidence (e.g. ±3 °C) How to calculate this interval? Set the level of significance α (α = 0.05 for a 95% CI) Use a “Normalized” sample distribution of the sample mean Confidence interval in the mean Follows a z-distribution (or t if n<30) 1- α (e.g.,95%) α /2 0 - Δz Δz = tinv(1-alpha/2,n-1) if n<30 Δz = norminv(1-alpha/2) if n>30
α /2
For a given level of significance α, the width of the confidence interval
increases as n increases
decreases as n increases
Don’t know
i>Clicker question
We have two samples of size n, and we would like to know if the difference between the two means is only due to chance
Null Hypothesis H0: μ1 = μ2
Alternative Hypothesis H1: μ1 ≠ μ2
Choose a level of significance α (probability of rejecting H0 even though it is true α = 0.01 or 0.05)
We create a vector of paired differences d and compute
Assuming that H0 is true, the following statistics follows a Student’s t distribution with Φ = n – 1
Compute tcrit using α and tinv(1-alpha/2,n-1)*
Compare tstat and tcrit and conclude (Can we reject H0?)
*this expression is for a 2-tailed t-test…
Student’s t-test: paired test
We have two samples of size n1 and n2, and we would like to know if the difference between the two means is only due to chance
Null Hypothesis H0: μ1 = μ2
Alternative Hypothesis H1: μ1 ≠ μ2
Choose a level of significance α (probability of rejecting H0 even though it is true α = 0.01 or 0.05)
Assuming that H0 is true, the following statistics follows a Student’s t distribution with Φ = n1 + n2 – 2
Compute tcrit using α and tinv(1-alpha/2,n1+n2-2)*
Compare tstat and tcrit and conclude (Can we reject H0?)
*this expression is for a 2-tailed t-test…
Student’s t-test: unpaired test
Lecture 7 – Curve Fitting
and interpolation
Fitting curves to data is very common in Earth sciences
Has applications in virtually all subdisciplines
Two things to keep in mind:
Data is noisy
Data is discrete (non-continuous)
Curve fitting can help overcome these issues !
MATLAB provides several built-in functions to fit curves
Many require the “Curve Fitting Toolbox”, or other toolboxes.
We will only use the basic curve fitting functions that are part of standard MATLAB
We will focus on: polyfit and polyval
Curve Fitting in Earth Sciences
Extrapolations of Future Global Warming, IPCC (2007)
Polynomials
Polynomials come in different orders or degrees
0th order: a single constant value
Examples: y = 4 y = 2.75 y = -12.1
1st order: a linear equation (independent var is to 1st power)
Examples: y = 4x y = 3.2x + 7 y = -8.2x – 21.3
2nd order: a quadratic equation
Examples: y = 5×2 y = 2.9×2 + 7 y = -1.8×2 – 7.4x + 1.4
3rd order: a cubic equation
Examples: y = 7×3 y = 4.6×3 + 2 y = 2.4×3 + 3.5×2 + 3.2x + 7.3
nth order: a polynomial where “n” is the largest exponent
Can be represented as a row vector in MATLAB
Interpreted by polyval as coefficients of a polynomial
Math Refresher: Polynomials
Let’s make data for:
Using Polynomials: polyval
Using polyval requires a lot less typing (Saves time!)
What is the degree of the polynomial represented by the following vector:
p = 0:3:6
1st order (linear)
2nd order (quadratic)
3rd order (cubic)
6th order
don’t know
i>Clicker question
Fitting Polynomials to data
%Load data points
data = load(‘datafile.txt’);
%Get coeffs of the best linear fit
p = polyfit(data(:,1),data(:,2),1);
%Get the y’s from the linear fit
yFit = polyval(p,data(:,1));
%Plot data points as red circles
plot(data(:,1),data(:,2),’ko’,’MarkerFaceColor’,’r’);
%Plot best linear fit
hold on
plot(data(:,1),yFit,’-k’,’LineWidth’,2);
title([‘Equation of best fit y = ‘ num2str(p(1)) ‘x +’ num2str(p(2))])
Use polyfit to perform a least squares fit of a 1st order polynomial (i.e. a linear fit)
Fitting Data With polyfit
%Decide the degree of the polynomials
order = 1;
%Generate data points (sqrt: non linear)
x = 0:0.15:5;
y = 5*sqrt(x)-1.5*x;
%Fit using the order provided
p = polyfit(x,y,order);
yFit = polyval(p,x);
%Plot data points as red circles
plot(x,y,’k^’,’MarkerFaceColor’,[0 0.7 0.1]);
%Plot best linear fit
hold on
plot(x,yFit,’-k’,’LineWidth’,2);
title([‘Fit with polynomials of degree ‘ num2str(order)]);
Fitting Non-Linear Data
Increasing the order of the polynomial allows for more complex curves to be fit
Be careful to not over-fit your data!
You should not go higher than 4 or 5
MATLAB will give a warning if the result is poorly conditioned
Fitting Non-Linear Data
Interpolation and Extrapolation
What if I want to get an estimate in-between data points?
Data are discrete: you will rarely get a measurement at the exact location/time that you may be interested in
Interpolation: The process of estimating values in between data points
What if data is limited in range?
How could you estimate data beyond your data range?
Extrapolate values
Very prone to errors. Should always be done with extreme caution
Extrapolation: The process of estimating values beyond the bounds of your data
Interpolating and Extrapolating
Time (year)
Temperature (C)
Interpolation
Extrapolation
Extrapolation
%Get coeffs of the best polynomial fit (degree 7)
data = load(‘unevenData.dat’);
p = polyfit(data(:,1),data(:,2),7);
%Create a new set of X variables that are evenly spaced
xI = linspace(0,10,40);
%Get interpolation
yI = polyval(p,xI);
%Plot data points as red squares
clf
plot(data(:,1),data(:,2),’rs’,’MarkerFaceColor’,’r’,’MarkerEdgeColor’,’k’,’MarkerSize’,7)
%Plot best fit
hold on
plot(xI,yI,’co-‘,’MarkerFaceColor’,’c’,’MarkerEdgeColor’,’k’,’MarkerSize’,4);
axis([0 10 0 30])
Interpolating: Using a Best Fit Curve
You can use a best fit curve to interpolate
Make sure the curve fits data well
Best fit curves tend to smooth data
Will not honor your collected data points!
Be careful about extrapolating!
%Load data
data = load(‘unevenData.dat’);
x = data(:,1);
y = data(:,2);
%Create a new set of X variables that are evenly spaced
xI = linspace(min(x),max(x),40);
%Get interpolation using interp
yI = interp1(x,y,xI,’linear’);
%Plot data points as red squares
clf
plot(data(:,1),data(:,2),’rs’,’MarkerFaceColor’,’r’,’MarkerEdgeColor’,’k’,’MarkerSize’,7)
%Plot best fit
hold on
plot(xI,yI,’co-‘,’MarkerFaceColor’,’c’,’MarkerEdgeColor’,’k’,’MarkerSize’,4);
axis([0 10 0 30])
interp1: interpolates 1D data
See also interp2 and interp3 for 2D/3D
interp1 has several options
Read the documentation
We will only use linear or spline methods
Linear Interpolation
%Load data
data = load(‘unevenData.dat’);
x = data(:,1);
y = data(:,2);
%Create a new set of X variables that are evenly spaced
xI = linspace(min(x),max(x),40);
%Get interpolation using interp
yI = interp1(x,y,xI,’spline’);
%Plot data points as red squares
clf
plot(data(:,1),data(:,2),’rs’,’MarkerFaceColor’,’r’,’MarkerEdgeColor’,’k’,’MarkerSize’,7)
%Plot best fit
hold on
plot(xI,yI,’co-‘,’MarkerFaceColor’,’c’,’MarkerEdgeColor’,’k’,’MarkerSize’,4);
axis([0 10 0 30])
Linear Interpolation
Resultant data is boxy
Min/Max will not exceed the original data
Interpolation using splines
Resultant data has smooth curves
Min/Max may exceed the original data
Both methods honor y-vals at original data
Interpolation With Splines
What gives you the “best” interpolation?
Polynomial fit
Linear interpolation
Spline interpolation
It depends
i>Clicker question
Unevenly sampled equation to make synthetic data
In this case, splines work best, but the polynomial fit is not bad
Interpolation: Linear vs Spline vs Polyfit
Unevenly sampled equation with some random noise (± 2) added
In this case, linear interp is not bad, but the linear fit is best
“interp1” can be used to extrapolate beyond input data limits: use ‘extrap’ option
interp1(x,y,’linear’,’extrap’)
Use with great caution!
Extrapolation is highly prone to errors
Extrapolation should only be a last resort
Which method worked best?
None! Extrapolation is a bad idea
If you have to do it, only go very slightly beyond your data limits
Extrapolation
Exams…
Lab 6 due next week
Lecture 8: Image Processing
What’s next?
µ
X̄
= µ
�
X̄
=
�
p
n
µ
X̄
= µ
�
x̄
=
�
p
n
X̄ � µ
X̄
�
X̄
‘
X̄ � µ
s/
p
n
P (µ within �x̄ of x̄) = P (x̄ within �x̄ of µ)
= P (��x̄ < x̄� µ < �x̄)
= P
✓
�
�x̄
�X̄
<
x̄� µ
�X̄
<
�x̄
�X̄
◆
= 1� ↵
Z̄ =
X̄ � µ
X̄
�
X̄
�x̄ = �z ⇥ s/
p
n
tstat =
x̄d
sd/
p
n
(x̄d, sd)
tstat =
m1 �m2q
s21
n1
+
s22
n2
⇥
3 2.7 1 �5.7
⇤
3×3 + 2.7×2 + x� 5.7
y = 2.7×4 + 4×3 � x2 + 1.8x� 12