ESS116
ESS 116
Introduction to Data Analysis in Earth Science
Image Credit: NASA
Instructor: Mathieu Morlighem
E-mail: mmorligh@uci.edu (include ESS116 in subject line)
Office Hours: 3218 Croul Hall, Friday 2:00 pm – 3:00 pm
This content is protected and may not be shared uploaded or distributed
Exceptionally:
No office hours this Friday
Monday May 6 3:00-4:15pm
Announcement
Lecture 4 quick review
Lecture 5 – Descriptive statistics
Sampling data
Descriptive Statistics
Histograms
Theoretical distribution
Distributions and probabilities
Sampling distribution (time permitting…)
Today’s lecture
Relational/Boolean expression
Examples:
var1=10; var2=5;
(var1>=5 && var2>7 ) → false
(var1<5 || var2>3 ) → true
(var1==10 && var2~=5) → false
The If-Elseif-Else Statement
if condition1
action(s)
elseif condition2
action(s)
else
action(s)
end
Only executed if condition1 is true
Only executed if condition1 is FALSE and condition2 is TRUE
Only executed if condition1 and condition2 are BOTH FALSE
if condition
action(s)
else
action(s)
end
Actions only executed if condition is TRUE
Actions only executed if condition is FALSE
Used with a pre-defined number of iterations of the loop variable
The For Loop
for loopVar = range
action(s)
end
The “for” keyword
The loop variable, which can be any valid variable name
Traditionally, we use i, j, k, or other single letters
The range or list of values the loopVar will take on
Can be any vector, but typically the colon operator is used
The action(s) that will be performed during each iteration
The “end” keyword. Signifies that the code should go back to beginning of the loop statement if more iterations remain
What is the value of A at the end of the program?
A = 2
A = 4
A = 6
A = 8
I don’t know
i>Clicker question
A = 1;
for i=1:3,
A=A*2;
end
Lecture 4 – Descriptive Statistics
Population and Sample
Data variables in Earth science are usually continuous
But samples of data are inherently discrete
Consequence: most datasets in Earth Sciences have a limited sample size and contain a number of uncertainties
e.g. Air temperature, Precipitations, topography,…
Sampling
World weather stations
Argo array (monitor ocean chemistry)
Population: the actual properties of the real world
Sample: set of values imperfectly representing the population
WARNING: Unfortunately, the term “sample” is employed with different meanings in geology and statistics
Population vs Sample
Geology Statistics
Collection Sample
Sample Observation
Sample size: number of measurements in a sample
We want to predict the characteristics of a larger population from a much smaller sample
Only if the sample is very large does the sample become an OK representation of the population
Parameters: refer to the population (e.g., μ and σ)
Statistics: refer to the sample (e.g., and s)
Population vs Sample
Which is the most accurate representation of π ?
3.2
3.2415926536
i>Clicker question
Which is the most precise representation of π ?
3.2
3.2415926536
i>Clicker question
Accuracy: quality of being close to the true value
Ex: π = 3.1415926536
3.2 is a more accurate representation of π than 3.24159265
Precision:
Mathematics: number of significant digits in a numerical value (measurements or calculation)
Experiments: related to instrumental resolution (ability of replicating measurements)
Ex: π = 3.1415926536
3.2 is a less precise representation of π than 3.24159265
Accuracy vs Precision
While precision is obvious to assess, accuracy is not. To a large extent, what is behind statistics is an effort to evaluate accuracy.
Accuracy vs Precision
Sample Visualization
We divide the interval of variation of the data into bins (generally of the same length)
All observations are assigned to their corresponding bin
Result: count of relative frequency of the classes
Frequency Table
Example: Maximum temperature in SNA Aug 2014 to Aug 2015 (from NOAA: http://www.ncdc.noaa.gov)
Frequency Table
Frequency Table
14 °C
16 °C
18 °C
20 °C
22 °C
15 °C
17 °C
19 °C
21 °C
Bin edges
Bin centers
How many
values
in this bin?
How many
values
in this bin?
How many
values
in this bin?
How many
values
in this bin?
Frequency Table
Bin/Class Count Frequency (%)
1 12-14 °C 0 0
2 14-16 °C 1 0.27
3 16-18 °C 10 2.74
4 18-20 °C 38 10.41
5 20-22 °C 48 13.15
6 22-24 °C 69 18.90
7 24-26 °C 46 12.60
8 26-28 °C 67 18.36
9 28-30 °C 43 11.78
10 30-32 °C 14 3.84
11 32-34 °C 21 5.75
12 34-36 °C 4 1.10
13 36-38 °C 3 0.82
14 38-40 °C 1 0.27
A histogram is a graphical representation of a frequency table
Histograms
Rules for choosing histogram bins
Number of bins large enough to bring out the shape of the data
Not too large to leave empty bins across the histogram
Rule of thumb:
number of bins ≈ number of data values
All data values must be covered
Bin widths should be equal if possible.
Histograms
Effect of bin spacing
Histograms in MATLAB
>> histogram(tmax);
If only one argument is given to “histogram”…
Matlab automatically chooses the bin number and properties
To better control the bins, you can optionally provide a second argument to “histogram”. Either:
A number, to specify a different number of bins
>> histogram(tmax,20);
>> histogram(tmax,round(sqrt(length(tmax))));
Or a N+1-length vector specifying the actual bin edges that are desired (where N is the number of bins)
>> histogram(tmax,12:2:40);
Display each observation xi versus the proportion of the sample that is not larger than xi.
Cumulative Frequency
88% of the temperature measurements are <30˚C
MATLAB: use “ecdf” (Empirical Cumulative Density Function)
>> [cf,x] = ecdf(tmax);
>> plot(x,cf*100,’+’)
Cumulative Frequency
Descriptive Statistics
We have collected N measurements xi from a specific object (e.g., ocean pH, glacier thickness, etc)
Size of x might be large: we use descriptive statistics to summarize the characteristics of the data:
Median
Mean
Standard Deviation
etc
Statistical description of the data
Outliers: values so markedly different from the rest of the sample that they rise the suspicion that they may be from a different population or that they may be in error,
Outliers
doubts that frequently are hard to clarify. In any sample, outliers are always few, if any.
Outlier
Summary Statistics
The histogram displays graphically the essential properties of univariate data:
Central tendency (or location): “average” position of the whole distribution along the scale (mean, median, mode)
Dispersion: the extent to which the distribution is spread along the scale from the central value (range and standard deviation)
General shape: both the symmetry and the pattern of the frequency distribution (skewness and kurtosis)
Properties of Histograms
Arithmetic Mean (or mean):
Sensitive to outliers (extreme values that may be very different from the majority of the data)
Central Tendency: MEAN
Median: the x-value that is in the middle of the dataset
50% of the observations < median
50% of the observations > median
For a dataset sorted in ascending order:
Central Tendency: MEDIAN
If N is odd
If N is even
After sorting our temperature measurements:
T = [15.2 15.4 16.0 18.0 18.2 19.0 20.8 21.0]
What is the median of this dataset?
16.0
18.0
18.1
18.2
i>Clicker question
Robustness: ability of statistical methods to work well not only under ideal conditions, but in the presence of data problems, mild to moderate departures from assumptions, or both.
Example: in the presence of large errors, the median is a more robust statistic than the mean
Robustness
Mode: the most frequent x-value or the center of the class with the largest number of observations
e.g., no mode, unimodal, bimodal, trimodal, multimodal
Central Tendency: MODE
Central Tendency
Range: difference between the highest and lowest value in the dataset
Defined by two extreme data points: very sensitive to outliers
Dispersion: RANGE
Standard deviation: average deviation of each data point from the mean
Variance: square of the standard deviation
Dispersion
Which of the following distributions has the highest standard deviation?
i>Clicker question
0 100 200 300
0 2 4 6 8 10 12
A
B
Skewness: measure of the asymmetry of the tails of a distribution
Shape: Skewness
Which of the following histograms has a negative skew?
i>Clicker question
A
B
Kurtosis: measure of whether the data are peaked or flat relative to a normal distribution
Shape: Kurtosis
Platykurtic
Kurtosis < 3
Flat
Mesokurtic
Kurtosis = 3
“normal”
Leptokurtic
Kurtosis > 3
Peaked
Central Tendency:
Mean (average)
Median (50% higher, 50% lower)
Mode(s) (peak value(s))
Dispersion:
Range (max – min)
Standard deviation (average distance to mean)
Shape:
Skewness (positive: tail to the right, negative: tail to the left)
Kurtosis (<3: flat, >3: peaked)
Know how they relate to visual features on a histogram
Statistical parameters
what you need to remember
Probability Density Function
Probability is a measure of the likelihood that an event, A, may occur. It is commonly denoted by P(A)
0 ≤ P(A) ≤ 1
P(A) = 0, A is an impossible event
P(A) = 1, A will happen with certainty
P(A) is a degree of belief in A, even if no random process is involved nor a count is possible
Probability
Histograms: empirical frequency distribution of our sample.
If we sample the variable sufficiently often and the output ranges are narrow, we obtain a very smooth version of the histogram
A histogram for and an infinitely small bin size will produce a Probability Density function (PDF)
The probability that x is between x1 and x2 is:
Probability Density Functions
Theoretical Distributions
Normal distribution
A.k.a. Gaussian distribution
2 parameters:
Mean, μ
Standard deviation, σ
99.7% of its values lie in μ ±3σ
Theoretical Distributions: NORMAL
Example: μ = 3
Normal Distributions
For a distribution f, the probability that x
Used in hypothesis testing
Theoretical Distributions: t
χ2 distribution (Chi square):
Number of degrees of freedom: Φ
In this course: Φ = N-1
Properties:
Used in hypothesis testing
Theoretical Distributions: χ2
Normal (μ,σ)
Given x0, find p0
>> p0 = normcdf(x0,mu,sigma);
Given p0, find x0
>> x0 = norminv(p0,mu,sigma);
Z-distribution
>> p0 = normcdf(x0);
>> x0 = norminv(p0);
t-distribution
>> p0 = tcdf(x0,V);
>> x0 = tinv(p0,V);
χ2-distribution
>> p0 = chi2cdf(x0,V);
>> x0 = chi2inv(p0,V);
MATLAB theoretical distributions
p0 = P( x < x0)
e.g.: 0.88 = P(x < 1.17)
The average concentration of salt in seawater 35 ‰ (standard deviation 3 ‰)
What is the maximum salinity you should be able to measure to cover 90% of the possible values?
Example 1
Example 1
Distribution of seawater salinity
μ = 35 F
σ = 3 ppt
P(can measure) = P(S
What would happen to the standard deviation of the sample mean (right) if we increase the number of rolls for all sample (n=20 or n=100) ?
The standard deviation would increase
The standard deviation would decrease
The standard deviation would remain unchanged
Don’t know….
Variability of X is measured by the standard deviation
There might be a “gap” between the sample mean and the population mean μ
Standard Error: variability in the sample mean
Standard Error (SE)
Population standard deviation
Sample size
Decreases as the sample size increases (more precise)
MATLAB Commands to remember
Lab 5: Descriptive statistics
DUE: one week after the lab starts (canvas)
Bring a USB drive
Lecture 6: Hypothesis testing
What’s next?
Open un’l next lecture (0.5% Extra Credit!!)
• What should be improved ?
• What can I do to help you learn or is there something
that isn’t working ?
• Are the quick reviews useful ?
• Are the lab useful ? Should they be longer ?
• Do you like working with real data or would you like
simpler formats ?
• Do you want more MATLAB, more stats, or is this a good
balance ?
Midterm Evaluation
Open un’l next lecture (0.5% Extra Credit!!)
•What should be improved ?
•What can I do to help you learn or is there something
that isn’t working ?
•Are the quick reviews useful ?
•Are the lab useful ? Should they be longer ?
•Do you like working with real data or would you like
simpler formats ?
•Do you want more MATLAB, more stats, or is this a good
balance ?
Midterm Evaluation
x̄
q
x = (x1, x2, x3, …, xN )
x̄ =
x1 + x2 + …+ xN
N
=
1
N
NX
i=1
xi
x̃ = x(N+1)/2
x̃ =
x(N/2) + x(N/2)+1
2
�x = x
max
� x
min
s
2 =
1
N � 1
NX
i=1
(xi � x̄)
2
s =
vuut 1
N � 1
NX
i=1
(xi � x̄)
2
skewness =
NX
i=1
(xi � x̄)
3
s
3
kurtosis =
NX
i=1
(xi � x̄)
4
s
4
N ! 1
P (x1 < x < x2) = Z x2 x1 f(x)dx f(x) = 1 � p 2⇡ exp � 1 2 ✓ x� µ � ◆2! P (x x1) = Z x1 �1 f(x) dx P (x 1.17) = Z 1.17 �1 f(z) dz = 0.88 f(z) = 1 p 2⇡ exp ✓ � z2 2 ◆ Z = X � µ � f(x) = � ✓ �+ 1 2 ◆ � ✓ � 2 ◆ 1p �⇡ 1 ✓ 1 + x 2 � ◆�+1 2 � ! 1 f(x) = 1 2 � 2 � (�/2) x ��2 2 e � x 2 x̄ � x̄ = � p n ESS116: MATLAB Cheat Sheet 1 Path and file operations cd Change Directory (followed by absolute or relative path of a directory) cd ../../Shared (relative path) cd /Users/Shared (absolute path) pwd display current directory’s absolute path (Path Working Directory) ls display list of files and directories in the current directory (can be followed by a path and/or file name pattern with *) ls ../file*mat ls *.txt ls /Users/mmorligh/Desktop/ copyfile copy existing file into a new directory, and/or rename a file copyfile('/Users/Shared/foo.txt','.'); copyfile('foo.txt','bar.txt'); mkdir create a directory mkdir Lab1 2 Fundamental MATLAB classes double floating point number (1.52, pi, ...) → MATLAB’s default type int8 Integer between -128 and 127 (8 bits, saves memory) uint8 Unsigned integer between 0 and 255 (used primarily for images) int16 Integer between -32768 and 32767 (16 bits) logical true/false string data type for text (str = 'This is a string';) cell cell array, used by textscan 3 Matrices Use square [] to create a matrix, and ; to separate rows A=[1 2 3;4 5 6;7 8 9]; ones, zeros create a matrix full of ones or zeros A=ones(5,2); ' transpose a matrix B=A'; length return length of a vector (do not use for matrices) size returns the size of a matrix (number of rows then columns, then 3rd dimension if 3D, etc) [nrows,ncols]=size(A); [nrows,ncols,nlayers]=size(A3D); linspace and : to create vectors A=2:3:100; A=linspace(2,100,10); find return the linear indices where a condition on the elements of a matrix is met pos=find(A==−9999); pos=find(A>100);
Extract the first 10 even columns of a matrix
B=A(:,2:2:20);
Removing elements: use empty brackets
A(:,2)= [];
Concatenate matrices
A=’This is ‘; B=[A ‘an example’];
Replacing elements in a matrix (use either linear or row,col notation)
A(10,3)=5.5;
pos=find(A==−9999);
A(pos)= 0;
Element-by-element operation: use a dot (.) before the operator
A= C.*D;
4 I/O
load loads a MATLAB file (*.mat) into the workspace, or a text file with only numbers
and consistent number of columns
load(‘data.mat’);
data=load(‘data.txt’);
textscan loads a text file into a cell array (as many elements as there are columns in the file)
Use %d for integers, %f for floating point numbers %s for strings
fid = fopen(‘filename’);
data = textscan(fid,’%d %f %s %s’,’Headerlines’,5);
fclose(fid);
%Put first column in A, and second column in B
A = data{1}; B = data{2};
5 fprintf
fprintf print text (and variables) to the screen. First argument is a string with placeholders.
fprintf(‘The radius is %7.2f and A = %d !!\n’,EarthRadius,10);
– Special characters: \n (new line) %% (percent sign) ” (apostrophe)
– Variable specifiers: %s (string) %d (integer) %e (exponential) %f (float)
– %010.3f: leading 0, 10 total spaces, 3 decimals. Ex: 000003.142
6 Visualization
plot displays a list of points (x,y)
plot(x,y,’−r’);
plot(x,y,’r+:’,’MarkerFaceColor’,’g’,’MarkerSize’,5,’LineWidth’,2);
axis controls x and y axes
axis([xmin xmax ymin ymax]);
axis equal tight
legend adds a legend to previously plotted curves
legend(‘First curve’,’second curve’)
figure creates a new figure window
figure(2)
xlabel/ylabel/title control x/y axis labels and plot title
xlabel(‘Distance (km)’);
hold on keep current plot so that whatever follows is plotted on the same plot
subplot divide figure into several subplots
subplot(2,3,1)
histogram make a histogram for a vector
histogram(tmax,20);
histogram(tmax,round(sqrt(length(tmax))));
7 Relational and Logical operators
== equal to, ~= not equal to, > greater than, >= greater than or equal to,
< less than, <= less than or equal to.
&& and, || or, ~ not.
A=( (1>10)|| (3~=4));
8 If/elseif/else and for loops (examples)
Counting algorithm
%Initialize counters
counter1 = 0;
counter2 = 0;
%Go over all of the elements of T and increment counters when a condition is met
for i=1:length(T)
if T(i)>100
counter1 = counter1+1
elseif T(i)<0
counter2 = counter2+1
end
end
fprintf('Found %d days with T>%f, and %d with T<%f\n',counter1,100,counter2,0)
Extracting (after counting!)
%You first need to count how many times T>100 (for example), then: allocate memory
hotdays = zeros(counter1,1);
%Go through T, again, and store temperatures>100 in hotdays
count = 1;
for i=1:length(T)
if T(i)>100
hotdays(count)=T(i);
count = count+1;
end
end
9 Statistic
mean computes mean of a vector
median computes median of a vector
std computes standard deviation
min returns minimum value in a vector
max returns minimum value in a vector
skewness returns skewness
kurtosis returns kurtosis
normcdf/tcdf/chi2cdf cumulative density function for a normal, t and χ2 distributions
norminv/tinv/chi2inv inverse of the cumulative density function
p0 = normcdf(x0,mu,sigma);
x0 = norminv(p0,mu,sigma);
p0 = tcdf(x0,V);
x0 = tinv(p0,V);
10 Polynomials and interpolations
polyval returns the value of a polynomials (represented by its coefficient) for some x
coeff = [3 2.7 1 −5.7];
x=0:0.2:0.6;
y=polyval(coeff,x)
polyfit returns the coefficient of the polynomials that best fit data points
coeff = polyfit(datax,datay,3); %3 means cubic polynomial
interp1 interpolates between data points (spline or linear)
y1=interp1(datax,datay,’linear’);
y2=interp1(datax,datay,’spline’);
11 Image processing
imread loads an image (as a matrix) into the workspace
A=imread(‘image.png’)
image display an image (2D or 3D)
imagesc display a 2D image and scale indices to use all the colors in the color map
colormap set a colormap (only for indexed images)
Indexed Images (2D)
Indexed image, need two matrices:
iMat a 2D matrix with indices
cMap is a nx3 2D matrices with the RGB code for each index
To display this image:
image(iMat);
colormap(cMap);
If the colormap is not consistent with indices, you need to use imagesc(iMat).
True-color image (3D, RGB)
No need to prescribe a colormap:
iMat(:,:,1) Red matrix (between 0–255 if uint8, or 0–1 if double)
iMat(:,:,2) Green matrix
iMat(:,:,3) Blue matrix
To display this image: image(iMat).
12 Miscellaneous
rand returns a random floating point number between 0 and 1
x=rand
x=rand(10,2)
round round input to closest integer
A=round(rand*10);
whos displays list of all variables in MATLAB workspace
tic/toc displays cpu time for a chunk of code
sqrt square root
13 Functions
Calling a function: [output1,output2] = functionname(arg1,arg2);
Function header (top lines of the file that implements this function):
function [output1,output2] = functionname(arg1,arg2)
% H1 line: describe what the function does
ESS116, M. Morlighem, Updated: March 14, 2019