2 Representing sample data
It can be difficult to interpret a data set when it is specified as a raw list of
numbers. In order to help us understand the key features of a particular set of
numerical data, it is useful to be able to compute and present summaries. These
summaries can either be numerical or graphical.
2.1 Numerical summaries
2.1.1 Sample mean and median
Measures of location give the data analyst a sense of where the centre of the
data is located. The simplest measure of location is the sample mean, which
is the numerical average of the data. Suppose the n observations are x1, . . . , xn,
then the sample mean is
x̄ =
1
n
n∑
i=1
xi =
x1 + . . .+ xn
n
.
The sample mean can be influenced by the presence of extreme values, or outliers,
in the data set. It may also be an inappropriate measure of location of the
distribution if the data are skewed (see Section 2.2.3). In either case, it may be
more appropriate to use the sample median defined below.
We define the sample median in terms of the order statistics, denoted
x(1), x(2), . . . , x(n). The order statistics are the sample values arranged in in-
creasing order. Thus, x(1) is the smallest value, x(2) is the second smallest value,
and so on.
In the case that n is odd, the sample median is the middle observation of
the data after arranging in increasing order. For even n, it is the average of the
middle two observations after arranging the data in increasing order. In terms
of the order statistics, the sample median is
x̃ =
x(n+1
2
) if n is odd
1
2
(
x(n
2
) + x(n
2
+1)
)
if n is even .
Example 2.1. For the car battery lifetime data from Chapter 1, n = 40 and so
the median is x̃ = 1
2
(x(20) + x(21)) =
1
2
(3.4 + 3.4) = 3.4. The sample mean and
median can be computed in R easily as follows:
1
> battery <- read.table("battery.txt", header=TRUE) > battlife <- battery$life # these two commands read in the data > mean(battlife)
[1] 3.4125
> median(battlife)
[1] 3.4
2.1.2 Sample variance and standard deviation
As well as measures of location, it is important to be able to give measures of
the variability or spread of the data. One of the simplest and most commonly
used measures is the sample variance,
s2 =
1
n− 1
n∑
i=1
(xi − x̄)2
=
1
n− 1
(
n∑
i=1
x2i − nx̄
2
)
,
and sample standard deviation, s =
√
s2, s > 0.
Importantly, the sample variance is given by the sum of the squared devia-
tions (xi− x̄)2 from the sample mean divided by the degrees of freedom n−1.
The reason for dividing by (n− 1) rather than n is somewhat technical and will
be discussed in further detail later in the course. For now, a brief explanation is
that there are only (n− 1) independent deviations from the mean. Specifically,
it is true that
n∑
i=1
(xi − x̄) = 0 ,
and so once (x1 − x̄), . . . , (xn−1 − x̄) are known the value of (xn − x̄) is fixed.
The importance of this will become clearer in later chapters.
The sample variance is equal to zero only if all observations have the same
value, i.e. if xi = x̄ for i = 1, . . . , n. If the observations are more spread out,
then (xi − x̄)2 will become larger and so too will s2. Thus s2 is a reasonable
measure of spread. An alternative measure of variability is the interquartile
range, discussed in Section 2.1.5.
2
2.1.3 Population quantiles
Suppose that X is a continuous random variable with strictly increasing c.d.f.
FX(x) and p.d.f. fX(x). The population quantile corresponding to probability
p ∈ [0, 1], also known as the population p quantile, is the value x such that
P(X ≤ x) = FX(x) = p ,
or equivalently such that ∫ x
−∞
fX(t) dt = p .
We use the notation Q(p) to refer to the p quantile defined above, so that
P{X ≤ Q(p)} = p. Note that if F−1X is the inverse function of the c.d.f., then
Q(p) = F−1X (p) .
The quantile corresponding to p = 0.5 is referred to as the (population)
median, and the 0.25 and 0.75 quantiles are referred to as the population lower
quartile and upper quartile respectively.
2.1.4 Sample quantiles
Given a random sample X1, . . . , Xn from the distribution FX(x), we wish to be
able to estimate the population quantile Q(p) using the corresponding sample
quantile Q̂(p). There is no single answer as to how to calculate a sample
quantile: Statisticians have proposed several different definitions, which all have
different theoretical properties.
In this course, we define the sample p quantile Q̂(p) to be the observation
in the p(n + 1)th position when the data are arranged in increasing order. If
p(n+ 1) is not an integer, linear interpolation is used to calculate the quantile.
More precisely, the value of Q̂(p) defined above can be calculated as follows using
the order statistics:
1. Calculate r = p× (n+ 1)
2. If r is an integer, set Q̂(p) = x(r)
3. If r < 1, set Q̂(p) = x(1) 3 4. Otherwise, set Q̂(p) = x(brc) + (r − brc)(x(brc+1) − x(brc)) , where bkc denotes the largest integer not exceeding k ∈ R. bkc is known as the floor of k. Some important special cases are the sample lower quartile, median, and upper quartile, which are the sample 0.25 quantile, 0.5 quantile and 0.75 quan- tile respectively. Example 2.2. Suppose that we have the following data set of size n = 19, arranged in ascending order for convenience: 0.04 0.05 0.08 0.20 0.32 0.40 0.43 0.44 0.54 0.62 0.72 0.74 0.74 0.88 0.89 0.90 0.90 0.97 1.00 It is desired to compute the sample lower quartile, which is the sample quantile corresponding to p = 0.25. We have that r = 0.25× (19 + 1) = 5 , and so brc = 5. Hence, Q̂(0.25) = x(5) = 0.32. Sample quantiles can be easily computed in R, using the quantile command. For the above example, use the following code: > X <- c(0.04, 0.05, 0.08, 0.20, 0.32, 0.40, 0.43, 0.44, 0.54, 0.62, 0.72, 0.74, 0.74, 0.88, 0.89, 0.90, 0.90, 0.97, 1.00) > quantile(x=X, probs=0.25, type=6)
25%
0.32
R implements 9 different methods for computing the sample quantiles. If the
type=6 argument is omitted, then instead R computes the ‘Type 7’ p quantile,
which is the observation in the [p(n− 1) + 1]th position in the ordered sample,
again using linear interpolation if necessary. The numerical values of the two
types of quantiles are usually different, but will be similar when n is large.
4
2.1.5 Five number summary
A simple and convenient description of a dataset is the five number summary,
which is a list consisting of:
1. the sample minimum, x(1)
2. the sample lower quartile, Q̂(0.25)
3. the sample median, Q̂(0.5)
4. the sample upper quartile, Q̂(0.75)
5. the sample maximum, x(n)
The median and quartiles can be calculated using the methods described in the
previous section. Other methods also exist, and thus for a given dataset there is
no unique five number summary. The five number summary can be obtained in
R simply via either the summary() or quantile() commands, e.g.:
> summary(battlife)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.600 3.100 3.400 3.412 3.825 4.700
> quantile(battlife)
0% 25% 50% 75% 100%
1.600 3.100 3.400 3.825 4.700
The 0%, 25%, 50%, 75% and 100% quantiles correspond to the sample minimum,
lower quartile, median, upper quartile and maximum respectively. As before, by
default R computes the Type 7 quartiles. To obtain the Type 6 quartiles defined
in the previous section, use:
> quantile(battlife, type=6)
0% 25% 50% 75% 100%
1.600 3.100 3.400 3.875 4.700
In addition to the sample variance and standard deviation, another measure
of variability is the sample interquartile range, which is the difference between
the sample upper and lower quartiles,
IQR = Q̂(0.75)− Q̂(0.25) .
5
2.2 Graphical summaries
2.2.1 Bar chart
A bar chart is used to display the distribution of a sample of values of a qualita-
tive or discrete variable. On the horizontal axis, the bar chart shows the distinct
values of the qualitative variable. The vertical axis shows either the number of
times that value occurs in the data (i.e. the frequency), or the proportion of
times that value occurred. To indicate that the values are different categories,
the bars are positioned so that they do not touch one another. As an example,
consider the opinion poll data from Chapter 1, which is plotted on the bar chart
below. A bar chart is appropriate as the variable of interest is qualitative. The
horizontal axis shows each of the different parties and the vertical axis shows the
number of individuals that intend to vote for that party.
Conservative Labour Lib Dem UKIP Other
Bar chart of opinion poll data
0
50
10
0
15
0
20
0
25
0
30
0
35
0
Figure: Bar chart of the opinion poll data
Often it is of interest to estimate the proportion of individuals in the popula-
tion that support a particular party, e.g. Labour. If the sample can be considered
representative, then a reasonable way to estimate the population proportion is
simply to use the proportion of individuals supporting the corresponding party
in the sample. We will discuss the performance of this method further in later
chapters. However, in most real world polls it is clear that the sample obtained
6
is not representative. In the field of Survey Statistics, more sophisticated esti-
mation methods are used to attempt to adjust for this non-representativeness.
These more sophisticated methods are outside the scope of this course. The bar
chart above can be generated using the following R code:
polldata <- c(369, 314, 75, 118, 124) names(polldata) <- c("Conservative", "Labour", "Lib Dem", "UKIP", "Other") barplot(polldata, main="Bar chart of opinion poll data", col=c("blue","red","gold","purple","grey")) 2.2.2 Histogram A histogram is used to display the distribution of a sample of values of a con- tinuous quantitative variable. To construct a density histogram we use the following procedure: 1. Choose an origin t0, and a bin width h. 2. Use the chosen origin and bin width to define a regular mesh of equally spaced points tj = t0 + jh , j ∈ Z . Note that tj is defined for both nonnegative and negative integers, i.e. j = −1,−2, . . ., as well as j = 0, 1, 2, . . . . 3. Define bins Bk = (tk−1, tk], k ∈ Z. Note that the bins are of width h, are disjoint and cover the entire real line. 4. Define the height of the density histogram by Hist(x) = νk nh , for x ∈ Bk , where νk is equal to the number of sample values that lie in the bin Bk, i.e. νk is equal to the number of xj , j ∈ Z, satisfying xj ∈ Bk. The histogram is plotted by drawing a rectangular bar corresponding to each interval Bk, with height given by Hist(x). Unlike a bar chart, the bars in a histogram are positioned so that they touch one another, reflecting the continuous nature of the data. Note that the area of a bar in the histogram is 7 proportional to the number of observations in the corresponding bin. The total area enclosed by a density histogram is 1. Example 2.3. Consider the battery lifetime data from Chapter 1. The density histogram shown below has origin at 3.0 and bin width h = 0.5. Histogram of battery lifetime data Battery lifetime D en si ty 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0. 0 0. 2 0. 4 0. 6 Figure: Histogram of the battery lifetime data The figure can be obtained using the following R code: hist(battlife, breaks=seq(from=1.5,to=5,by=0.5), freq=FALSE, col="grey", main="Histogram of battery lifetime data", xlab="Battery lifetime") The breaks argument specifies the location of the points in the mesh, i.e. the bin endpoints. The argument freq=FALSE ensures a density histogram is plotted; otherwise, the height is multiplied by n to give a frequency histogram. The other arguments specify colour options, the title caption and axis labels. 2.2.3 Box plots An alternative way of displaying a sample of values of a continuous variable is to use a box plot (or box-and-whisker plot). This plot is based on the five-number summary, and has three main components: • ‘the box’, whose edges are at the sample lower and upper quartiles. The median is also indicated as a horizontal line inside the box. 8 • ‘the whiskers’ are lines that extend from the box to the adjacent values, which are the maximum and minimum of the part of the sample remaining after any outliers have been removed. • ‘outliers’, which are indicated by circles (◦). Outliers are extreme observations that are considered to be very far from the main body of the data. In general in Statistics, outlier detection can be a challenging problem. For the purposes of drawing box plots, we identify outliers according to the following simple rule of thumb: • An observation xi is classified as an outlier if xi < Q̂(0.25)− 1.5× IQR or xi > Q̂(0.75) + 1.5× IQR .
In other words an observation is classified as an outlier if it is further than
1.5 × IQR from the nearest edge of the box, where IQR is the interquartile
range, Q̂(0.75) − Q̂(0.25). Different software packages may implement different
rules of thumb. The different components of a box plot are labelled below:
Q̂(0.25)
Median
Q̂(0.75)
Min (excl. outliers)
Max (excl. outliers)
Outliers
Distance
> 1.5 IQR
Figure: General structure of a box plot
9
We now illustrate the calculations underlying box plots using an example.
Example 2.4. Consider the battery lifetime data from Chapter 1. The lower
and upper quartiles are 3.1 and 3.875 respectively. The interquartile range is
thus
IQR = 3.875− 3.1 = 0.775 .
Observations are thus classed as outliers if they lie above
Q̂(0.75) + 1.5× IQR = 3.875 + 1.5× 0.775 = 5.0375 ,
or alternatively if they lie below
Q̂(0.25)− 1.5× IQR = 3.1− 1.5× 0.775 = 1.9375 .
There are no data points lying above the upper threshold, and so the upper
adjacent value is equal to the sample maximum, 4.7. However, there are two
observations (1.6, 1.9) below the lower threshold, which are thus classified as
outliers. The lower adjacent value is equal to the minimum of the sample ex-
cluding these outliers, namely 2.2.
The box plot can be obtained in R using the command:
> boxplot(battlife, main=”Box plot of battery life data”)
The resulting figure is shown below, and agrees approximately with our calcula-
tions. We can see that the distribution of the data appears to be fairly symmetric
around its centre.
1.
5
2.
0
2.
5
3.
0
3.
5
4.
0
4.
5
Box plot of battery life data
Figure: Box plot of the battery life data
10
It is worth noting that for box plots, R computes the quartiles using neither
the Type 6 nor Type 7 quantiles discussed previously, but instead using another
different method which we will not explain. Again the results should be similar
for large n.
Example 2.5. For the income data, we have the box plot below. This indicates
that the distribution of the middle 50% of observations is fairly symmetric about
the median. However, the lower whisker is very short and the upper whisker is
much longer, suggesting that the distribution is skewed. The presence of a dozen
outliers further is further evidence of skewness.
50
10
0
15
0
Box plot of income data
Figure: Box plot for the Manchester income data
Box plots can also be used to graphically compare the distribution of two or
more samples of measurements of the same variable. We illustrate this technique
with an example below.
Example 2.6. Consider a data set containing blood plasma β endorphin con-
centrations (pmol/l) for 22 runners who had taken part in the Tyneside Great
North Run one year. Measurements were obtained from 11 runners who success-
fully completed the race, and 11 runners who collapsed near the end of the race.
The data were as follows. The data have been rearranged in increasing order
within each group for convenience.
Group Measurements
Successful 14.2, 15.5, 20.2, 21.9, 24.1, 25.1, 29.6, 29.6, 34.6, 37.8, 46.2
Collapsed 66, 72, 79, 84, 102, 110, 123, 144, 162, 169, 414
In order to facilitate comparison it is helpful to put the box plots for the two
groups on the same axes. This can be achieved in R as follows:
11
> runners_grp <- read.table(file="runners_group.txt", header=TRUE) > boxplot(pmol~group, data=runners_grp,
ylab=”Beta endorphin concentration (pmol/l)”,
main=”Distribution of endorphin concentration in runners”)
The box plot is shown below. It is clear that the endorphin concentration is
much higher and much more variable among the collapsed runners. In both
groups the distribution looks reasonably symmetric.
Collapsed Successful
0
10
0
20
0
30
0
40
0
Distribution of endorphin concentration in runners
B
et
a
en
do
rp
hi
n
co
nc
en
tr
at
io
n
(p
m
ol
/l)
Figure: endorphin concentration in runners
Summary and discussion
The techniques described above can also be applied when comparing the distri-
bution of the same measurement from k different groups. Tables can be pre-
sented giving the summary statistics for each group, for example the 5-number
summaries. It may be of interest to compare, for example:
(i) the k medians as measures of the centre of the k distributions
(ii) the k interquartile ranges as measures of spread
(iii) the minimum, maximum and range.
12
These features can also be assessed by examining box plots, which are graph-
ical representations of the five-number summary. It is also a good idea to com-
ment on the shape of the distributions, e.g. whether they are symmetric or
skewed. Within each group, symmetry and skewness can be assessed via:
• inspecting if the median is in the centre of the box
• comparing the lengths of the two whiskers.
It is also useful to identify any values highlighted as possible outliers.
We can also compare and contrast group means and standard deviations.
Large differences between the mean and median within a group can also indicate
skewness. Other graphical summaries such as histograms can also be compared
and contrasted across groups. It is useful as a check to confirm that the infor-
mation in the histograms agrees with the findings from the box plots.
The summaries we have discussed in this chapter are applied to a particular
data set (perhaps consisting of multiple groups). However, if we were to repeat
the experiment (or survey) we would obtain a different random sample from
the population, and so also obtain different numerical values for the summaries.
Thus, summary statistics such as the mean are also random variables and have
a sampling distribution. An important part of Statistics is developing an un-
derstanding of these sampling distributions, and harnessing this understanding
to be able to draw reliable inferences about a population from the sample. We
discuss these ideas further in subsequent chapters.
13
Representing sample data
Numerical summaries
Sample mean and median
Sample variance and standard deviation
Population quantiles
Sample quantiles
Five number summary
Graphical summaries
Bar chart
Histogram
Box plots