Statistical Inference STAT 431
Lecture 2: Summarizing Data – One Variable
Types of Variables / Data
– Classifies each sampled unit into one of several distinct categories
• Categorical
– Nominal: different categories are not ordered
• E.g., color (red / green / blue), gender (male / female), etc.
– Ordinal: categories are ordered
• E.g.,rating(good/fair/bad),grade(A/B/C/…),etc.
• Numerical coding of ordinal variables can be misleading! (GPA?)
• Numerical
– Takes values from a set of numbers
– Discrete: e.g., number of siblings / number of cars in a household / …
– Continuous: e.g., time / distance / height / income / …
STAT 431
Concept: Empirical Distribution
• Recall: A distribution is used to describe what values the population takes and how
frequently these values occur.
• Analogously, an empirical distribution describes what values the sample takes and
how frequently these values occur.
• Different types of data require different ways to describe their empirical distributions.
STAT 431
Categorical Data
• An example of nominal data: Cause of Flight Delay
– Causes of delay for 70 arrival flights at an airport were recorded in Jan, 2010.
– Frequency table: shows number of occurrence of each category
Cause of flight delay
Frequency
Traffic control delay Cancelled or diverted Weather delay Aircraft arriving late Air carrier delay Security delay
28 7 5 17 11 2
• For ordinal data, the list of values should be sorted in order STAT 431
Numerical Data
• The empirical distribution of numerical data can be described numerically and
graphically in terms of:
– Center: Where are most of the values located around?
– Dispersion: How variable are the values?
– Shape:
• Is the distribution symmetric or skewed?
• Are there multiple peaks (modes) or just one, i.e., unimodal or multi-modal?
– Outliers: Are there certain values that seem surprisingly large or small?
STAT 431
Example: Pearson’s Father-Son Height Data
• K. Pearson (England, 1857-1936) measured the heights of 1078 fathers, and their sons at maturity.
[pearson.txt]
• Focus only on the heights of fathers.
• 1078 numbers are hard to grasp
Use numbers to summarize the empirical distribution of data!
Father’s height (inch)
65.05 63.25 64.95 65.75
Son’s height (inch)
59.78 63.21 63.34 62.79
…… 71.33 68.27 71.78 69.31 70.74 69.30 70.31 67.02
STAT 431
Center & Dispersion: Mean & SD
• Suppose: sample size n, raw data x1 , . . . , xn
• Center:samplemean x1+···+xn 1Xn x ̄= n =n xi
• Dispersion: sample standard deviation i=1
vu 1 Xn
s = t n 1
( x i x ̄ ) 2
– 2.2, 4.0, 3.7, 0.9, 6.7, 4.2, 7.0, mean = 4.10, SD = 2.62
– 2.2, 4.0, 3.7, 0.9, 6.7, 4.2, 70, mean = 13.1, SD = 25.15
• Drawback: sample mean and sample SD are sensitive to extreme values in data
• Can we have more robust summaries of center and dispersion? STAT 431
s2 is called sample variance. • Example
i=1
Center & Dispersion: Robustness
More robust measures of center & dispersion can be defined by sample quantiles
● ●
● ●
(xi, i ) n+1
● ●
x ̃0.60
● ●
● ●
●
● ●
● ●
●
−2 −1 0 1 2 3 x
Schematic plot of sample quantile definition
STAT 431
percent
0.0 0.2 0.4 0.6 0.8 1.0
Center & Dispersion: Robustness Precise definition of sample quantiles:
Raw data: x1,…,xn
1. Order the data values:
xmin =x(1) x(2) ··· x(n) =xmax
2. Define the i -th sample quantile:
x ̃i =x(i) n+1
n+1
3. Define all the other sample quantiles by linear interpolation
STAT 431
Center & Dispersion: Median & IQR • Sample quartiles: x ̃0.25 = Q1 , x ̃0.5 = Q2 , x ̃0.75 = Q3 are
the lower / middle / upper sample quartiles m = Q2
• Center: median
• Dispersion: interquartile range (IQR) – range for the middle half of the data
IQR=Q3 Q1
• Example
– 2.2, 4.0, 3.7, 0.9, 6.7, 4.2, 7.0, mean = 4.10, SD = 2.62
median = 4, IQR = 4.5
– 2.2, 4.0, 3.7, 0.9, 6.7, 4.2, 70, mean = 13.1, SD = 25.15
median = 4, IQR = 4.5 • Median / IQR are more robust than mean / SD
STAT 431
Summary Statistics for Father Height
mean
SD 2.74
median IQR
67.76 3.82
67.69
STAT 431
From Numbers to Plots
• Mean & SD or median & IQR summarize center & dispersion of data
• But little is said about the shape of empirical distribution. Outliers?
• Histogram
Father’s height
Frequency of each bin
60 65 70 75
STAT 431
fheight
Values divided into bins
Frequency
0 20 40 60 80
• Constructing a histogram
Histogram
1. Divide the range of the data into class intervals (bins)
• One possible choice of the number of intervals ⇡ pn
• Intervals can have equal or unequal widths
2. Count the number of data values in each interval
3. Make a plot
• x-axis: data value; y-axis: frequency / relative frequency
• On each class interval, draw a rectangle, with width = interval width, and area proportional to frequency!
• Using histograms
– Gain understanding about the shape of the distribution • E.g., unimodal vs. multi-modal, symmetric vs. skewed
– Reveal outliers in data
STAT 431
Symmetric
Positively skewed
Negatively skewed
symmetric
positively skewed
negatively skewed
symmetric
Orders
?
Histogram: Examples
−4 −2 0 2
0 1 2 3
0.6 0.7 0.8 0.9 1.0
Symmetric & bi-modal
Possible outlier
?
xxx
−2 0 2 4 6
x
60 80 100 120 140
orders$Orders
0.0 0.2 0.4 0.6 0.8 1.0
x
STAT 431
0 5 10 15
0 20 40 60
Frequency
Frequency
0 20 40 60 80 100
0 10 20 30 40 50 60
Frequency
Frequency
Frequency
Frequency
0 10 20 30 40
0 20 40 60 80
Possible outliers
•
•
•
Smallest data value inside fences
•
Fences
Lower Fence = Q1 1.5 ⇥ IQR Upper Fence = Q3 + 1.5 ⇥ IQR
Box
– Rectangle drawn between upper / lower quartiles (also called hinges of the box plot)
– Divided at median Whiskers
– Two lines extending the ends of the box to the extreme data values inside the fences
Possible outliers marked by dots / asterisks / circles …
Box Plot
●
}
●
● ●
●
Upper quartile Median
Lower quartile
Possible outliers
Largest data value inside fences
Inter-quartile Range (IQR)
STAT 431
60 65 70 75
• Possible uses of box plots
– Outliers
– Symmetry of the data
– Comparing two or more samples (side-by-side box plots)
• Drawback
– One peak or multiple peaks?
Box Plot
● ●
● ●
● ●
●
●●
● ●
● ●
● ●
●
●
● ●
– Long / short tails (whiskers)
STAT 431
fheight
sheight
60 65 70 75
Diagnostic Plot
• Many inference tools to appear in this course assume that The population distribution is normal.
• Is this assumption reasonable?
• Is there any tool to check about this assumption?
Possible solution: histogram Drawback
• Depends on how one groups the raw data
• Hard for human eye to judge whether a plot looks like a normal density
(bell-shaped) curve
Histogram from 30 normal samples
STAT 431
−2.0 −1.5 −1.0 −0.5
0.0 0.5 1.0 1.5
x
Frequency
02468
Normal Plot
• Normal plot (normal quantile-quantile plot / QQ plot)
– Compute z-scores: zi = xi x ̄ (mean = 0, SD = 1) isi
– Plot n+1 -th theoretical quantile against n+1 -th sample quantile (standardized) i.e., 1( i ) vs. z(i)
n+1
• Normal approximation is reasonable ()points approximate the straight line [
]y=x
Histogram from 30 normal samples
Normal Q−Q Plot
●
● ●
● ●
●●
●
●● ●●
● ●
● ●
●
●
●
●
●●●● ●
● ●
●● ●
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x
−2
−1
0 1 2 Theoretical Quantiles
STAT 431
Frequency
02468
Sample Quantiles
−1 0 1 2
• We shall have X-coordinate for normal quantiles, and Y-coordinate for sample quantiles
(the R convention)
• Normalized order statistics allow us to compare the plotted points with the line y = x
• Otherwise, we shall compare with y = x ̄ + s ⇥ ( x x ̄ )
Normal Q−Q Plot
−2 −1 0 1 2 Theoretical Quantiles
Normal Plot (Cont’d)
y=x
●
●
●●●●●
● ●
●
●
●●●●
● ●
●● ●●
● ●
●
● ●
● ●
●●
( 1( i n+1
), z(i))
STAT 431
Sample Quantiles
−1 0 1 2
Normal Plot: Father Height
Normal Q−Q Plot
● ●
● ● ●●
●
● ●
●
● ● ●
● ● ●
● ●
● ●
●
● ●
●
● ●
● ●
●
● ●
●● ● ●
●
−3 −2 −1 0 1 2 3 Theoretical Quantiles
STAT 431
Sample Quantiles
−3 −2 −1 0 1 2 3
Data Transformation
• Sometimes, there is strong evidence that the sample data do not come from a
normal (population) distribution.
Example: costs for 30 randomly selected elderly patients receiving “geriatric team
care” in a hospital (p.122, Example 4.7 of the textbook)
Normal Q−Q Plot
●
●●●●● ●●●
●
● ●
● ●
● ●
●
●●
● ●
●●●●●● ●●
●
●
● ●
STAT 431
−2 −1 0 1 2 Theoretical Quantiles
0 5000 10000 15000
Sample Quantiles
−1 0 1 2 3
Data Transformation (Cont’d)
• However, normal assumption might be reasonable for transformed data (for the example, try the logs of the costs)
Normal plot: original costs
Normal plot: log costs
●
●●
● ●
● ●
● ●
●
●●
● ●
●●●● ●●
●● ●●●●●
●●
●
● ●
●● ●
● ●●●
● ●
●
●● ●
● ●
● ●
● ●
●
●●
●
● ●
● ●
●
−2 −1 0 1 2 Theoretical Quantiles
−2 −1 0 1 2 Theoretical Quantiles
• Commonly used transformations:
– log / square root (for positively skewed data)
– exponential / square (for negatively skewed data)
STAT 431
Sample Quantiles
−1 0 1 2 3
Sample Quantiles
−2 −1 0 1 2
Class Summary – Types of variables / data
• Key points of this class:
– Summarizing categorical data
• frequency table
– Summarizing numerical data
• Summary statistics: mean / SD, median / IQR
• Graphical display: histogram / box plot
– Checking distribution assumption
• Normal QQ plot
• Data transformation
• Reading: Sections 4.1—4.3 of the textbook
• Next class: Summarizing Data – Multiple Variables (Section 4.4) STAT 431