F71SM STATISTICAL METHODS
1 DATA SUMMARY
1.1 Introduction
Data: our data consist of a sample x = (x1, x2, . . . , xn) from some population (real or concep-
tual); in the case that every member of the population has the same chance of being included
in the sample we have a (simple) random sample.
For example, five students were selected at random from last year’s second year under-
graduate class and their ages (years) were recorded, giving us a random sample of five ages:
x = (21, 19, 18, 18, 23).
In some situations, the original data may be summarised in a frequency distribution, in
which the value xi occurs with frequency fi , i = 1, 2, . . . , k, with
∑k
i=1 fi = n. For example, here
is a frequency distribution for a random sample of 150 undergraduate students’ ages (years):
x 17 18 19 20 21 22 23 24 25 26
f 2 12 23 27 25 22 19 13 6 1
1.2 Visual display 1 – histogram
A histogram is a display in which we divide the range of the data into a number of intervals
(or ‘bins’) and draw rectangles whose areas are proportional to the frequency of data (number
of values) in the intervals.
1
We choose the number and widths of the intervals to give us a clear and informative display.
Too many intervals gives us a display which is too detailed and rough; too few intervals gives
us a display which is too crude and we lose too much information. See below for histograms
of the same set of ages (measured to greater accuracy than before): for most purposes, the
left-hand display is too detailed, the middle one is fine, the right-hand one is too crude.
1.3 Summarising level (average, position)
The principal measures of this characteristic of a data set are mean, median, and mode.
Sample mean x̄ =
sum of observations
sample size
=
1
n
n∑
i=1
xi or
1
n
∑
x
Frequency distribution format: x̄ =
1
n
k∑
i=1
fixi or
1
n
∑
fx or
∑
fx∑
f
Sample median : middle observation of sorted data set (mean of middle two in the case that
n is even); that is, the median is the 1
2
(n + 1)th sorted value, using linear interpolation
between values as necessary.
Sample mode : observation with greatest frequency (useful in certain circumstances only).
Sample of 5 ages (data set ages1)
mean = 99/5 = 19.8yrs, sorted data: 18, 18, 19, 21, 23 so median = 19yrs
Sample of 150 ages (data set ages2)
mean = 3161/150 = 21.07yrs, median is 75.5th sorted observation = 21yrs, mode = 20yrs
The sample mean is the value each observation would have if the total was the same and all
observations were equal. The sample median is the value with about 50% of the observations
lower than it and 50% of the observations higher than it. The sample mode is a ‘typical’ value
in the sense of being the ‘most common’ value.
2
1.4 Summarising spread (variability, dispersion, variation)
The principal measures of this characteristic of a data set are standard deviation, inter-
quartile range, and range.
Sample standard deviation s is a measure of variability about the mean.
The sum of squared deviations about the mean is
∑n
i=1 (xi − x̄)
2
; sample variance s2 =
1
n−1
∑n
i=1 (xi − x̄)
2
; sample standard deviation s is the positive square root of s2 and is in
the same units as the data.
s =
√√√√ 1
n− 1
n∑
i=1
(xi − x̄)
2
=
√
1
n− 1
(∑
x2i −
1
n
(∑
xi
)2)
Frequency distribution format:
s =
√√√√ 1
n− 1
k∑
i=1
fi (xi − x̄)
2
=
√
1
n− 1
(∑
fix
2
i −
1
n
(∑
fixi
)2)
Inter-quartile range (or midspread) IQR = Q3 − Q1, where Q1 and Q3 are the first and
third quartiles respectively (values with 25% and 75% of the observations lower than that
value, respectively).
The quartiles divide an ordered data set into 4 parts, each containing 25% of the values
in the data set. We take Q1 to be the
1
4
(n + 1)th sorted observation and Q3 to be the
3
4
(n + 1)th sorted observation, using linear interpolation between values as necessary.
(Quartiles are special cases of quantiles, which divide an ordered data set into a specified
number of parts, each containing the same percentage of values in the data set. Per-
centiles divide the data set into 100 parts. Q1 and Q3 are the 25th and 75th percentiles,
respectively.)
Range = highest observation− lowest observation
The sample standard deviation is a measure of the average deviation of the values from the
mean – but it measures this in a very particular way – it is the square root of the average of
the squared deviations from the mean; hence an alternative name (which is rarely used), the
‘root mean square deviation from the mean.’
The range of values covered by the IQR (from the lower quartile to the upper quartile)
includes the central 50% of the observations.
Sample of 5 ages (data set ages1)
standard deviation s = ((1979− 992/5)/4)1/2 = 2.17yrs, range = 23− 18 = 5yrs
Sample of 150 ages (data set ages2)
standard deviation s = ((67207− 31612/150)/149)1/2 = 2.00yrs,
lower quartile Q1 is the 37.75th sorted observation = 19.75
upper quartile Q3 is the 113.25th sorted observation = 23
IQR = 23− 19.75 = 3.25yrs
range = 26− 17 = 9yrs
3
1.5 Visual display 2 – boxplot
The 5 vertical lines indicate (from L to R) the minimum observation, lower quartile, median,
upper quartile, and maximum observation. The width of the central box is the IQR. The
position of the median line inside the central box and the relative lengths of the whiskers
leading out to the edges give an indication of the degree of symmetry or asymmetry present
in the data set (see skewness below).
1.6 Use of information and robustness
Both the mean x̄ and standard deviation s make full use of the data (they score well on use
of information), but both are highly susceptible to changes at the extremes of the sample (they
score poorly on data resistance – they are not robust to changes in the data). The median
and IQR score less well on use of information, but better on data resistance. The mode scores
very poorly on use of information, but well on data resistance. The range scores very poorly
against both criteria.
For example, add a student aged 66 to the set of data ages1:
new data set (sorted): 18, 18, 19, 21, 23, 66
new mean = 27.5 (which is totally unrepresentative and artificial)
new median = 20 (which is still a useful measure)
new standard deviation = 18.96 (greatly inflated by the single observation 66)
new range = 66− 18 = 48 (greatly inflated by the single observation 66).
Since the sample mean scores well on use of information but poorly on data resistance, a
compromise measure of level is in use which scores better on data resistance. It is based on the
idea of trimming off the observations at the extremes of the data set and finding the mean of
the remaining observations.
The (5%) trimmed mean is the mean of the observations that remain after removing the
4
lowest 5% and highest 5% of the observations (approximating the number to be trimmed off
with the nearest integer if necessary).
In the case of the sample of 150 ages (data set ages2) we trim off the lowest 8 observations
(total value 142) and the highest 8 observations (total value 200), leaving 134 observations with
total value 3161− 200− 142 = 2819, so trimmed mean = 2819/134 = 21.04, which in this case
is very close to the (untrimmed) mean (= 21.07).
1.7 Change of origin and scale
A linear transformation corresponds to a change of origin and scale.
yi = a + bxi ⇒ ȳ = a + bx̄, sy = |b|sx ← verify these results
The change of origin (a) and scale (b) both affect the value of the mean. The change of
origin does not affect the standard deviation (which is a measure based on deviations from the
mean).
For example, consider a set of salaries (x, £) ranging from £20,200 to £78,100, with mean
£36,200 and standard deviation £7,150. Suppose tax is payable at 40% on the amount of a
salary in excess of £12,500. The tax paid on a salary, y say, is given by y = 0.4(x− 12500) =
0.4x− 5000. The mean and standard deviation of the amounts of tax paid on the salaries are
therefore 0.4× 36200− 5000 = £9, 480 and 0.4× 7150 = £2, 860, respectively. The mean and
standard deviation of the salaries net of tax are £26,720 and £4,290 respectively (check).
1.8 Standardisation
How many standard deviations is xi above/below the mean? The answer gives the standard-
ised value of the observation xi , usually denoted zi.
Standardised value (or standard score): zi =
xi − x̄
s
(note z̄ = 0, sz = 1)
1.9 Weighted mean
Data: xi with weights wi(≥ 0), i = 1, 2, . . . , k, where
∑k
i=1wi = 1.
Weighted mean =
∑k
i=1wixi ← it is a generalisation of the sample mean introduced earlier
(for which wi = 1/n)
The mean of a frequency distribution is an illustration of a weighted mean, with weights
wi = fi/n, the proportion of observations in each cell.
Weighting is often required when we average quantities that are themselves ratios.
5
1.10 Skewness
In addition to summarising the level and spread of a set of data we may want to summarise
the skewness (degree of asymmetry) of the data. A set of data with positive skew tails off to
the right (in the direction of higher values); a set of data with negative skew tails off to the left
(in the direction of lower values). Positive skew is more common than negative skew. A good
example of positive skew often occurs with insurance claim amounts, which are bounded below
but are often not bounded above.
One measure of skewness for a set of data is given by
measure of skewness =
3 (sample mean− sample median)
sample standard deviation
Another measure (the ‘moment coefficient of skewness’) is
coefficient of skewness =
m3
m
3/2
2
where mk =
1
n
∑
(x− x̄)k
(Note: mk is called the kth moment about the mean).
These measures are invariant to changes of origin and scale (they have no units of measure-
ment, they are dimensionless). They are zero, positive, or negative according to the type of
skewness present. The histograms below illustrate data sets with (L to R) clear negative skew,
approximate symmetry, and clear positive skew, respectively.
6
1.11 Other visual displays
Here is a simple, but effective, dotplot of the sample of 150 ages (the data set ages2)
.
: .
. : :
: : : :
: : : : .
: : : : :
: : : : :
: : : : : .
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : : : .
-+———+———+———+———+———+—–
16.0 18.0 20.0 22.0 24.0 26.0 age
Next we have a display called a stem-and-leaf plot of a data set consisting of 200 claim
amounts, ranging from £2,004 to £7,886. Each observation is split into a stem unit (here
£1000) and a leaf unit (here £100) so, for example, the value £2004 is represented as stem
value 2 with leaf value 0. The observation £7,886 is represented as stem value 7 with leaf
value 9. Looked at under a 90o rotation it is very like a histogram, but with more detailed
information.
The decimal point is 3 digits to the right of the |
2 | 0
2 | 779
3 | 1123344
3 | 55666667777788889999
4 | 00001112222233333333344444444
4 | 55555555555666666666667777778888999999999
5 | 00000001111222222222222333333344444444
5 | 555555666666777777788888888899999
6 | 0000001134
6 | 555556667788899
7 | 11
7 | 9
The stems here (representing £2000, £3000, etc) have been split, allowing two rows of display
for each stem unit.
7
1.12 Worked examples
1.1 A random sample of 15 motor windscreen claim amounts (£) are as follows:
121 107 139 72 123 114 215 156 100 136 169 89 115 153 111
For these data n = 15,
∑
x = 1, 920,
∑
x2 = 263, 214
Sample mean = 1920/15 = £128.0
Data sorted:
72 89 100 107 111 114 115 121 123 136 139 153 156 169 215
Sample median is the 8th observation when sorted; this is £121
Trimmed mean = (1920− 72− 215)/13 = 1633/13 = £125.62
Lower quartile is 4th observation when sorted; this is £107
Upper quartile is 12th observation when sorted; this is £153
Interquartile range = 153− 107 = £46
Range = 215− 72 = £143
Standard deviation =
(
1
14
(
263214− 1920
2
15
))1/2
= £35.31
1.2 The numbers of occupants in private cars travelling south over the Forth Road Bridge
between 08:00hrs and 09:00hrs one Monday morning were noted for each of a random
sample of 100 such cars. The results are given in the following frequency distribution.
number of occupants x 1 2 3 4 5
number of cars f 72 17 8 2 1
∑
f = 100,
∑
fx = 143,
∑
fx2 = 269
Sample mean = 143/100 = 1.43
Sample standard deviation =
(
1
99
(
269− 143
2
100
))1/2
= 0.81
1.3 Two groups of students sat the same exam. The first group of 64 students scored a mean
mark of 52.297 with standard deviation 7.521; the second group of 42 students scored a
mean mark of 46.571 with standard deviation 7.742.
We calculate the sample mean and standard deviation of the combined group of 106
students.
Note that
∑
x = nx̄ and
∑
x2 = (n− 1)s2 + (
∑
x)
2
/n
Group 1: n = 64,
∑
x = 64× 52.297 = 3347,
∑
x2 = 63× 7.5212 + 33472/64 = 178, 601
Group 2: n = 42,
∑
x = 42× 46.571 = 1956,
∑
x2 = 41× 7.7422 + 19562/42 = 93, 551
Combining the groups we get n = 106,
∑
x = 3347 + 1956 = 5303,
∑
x2 = 178601 +
93551 = 272, 152 ⇒ x̄ = 5303/106 = 50.03, s =
(
1
105
(
272152− 5305
2
106
))1/2
= 7.959
1.4 The heights of 64 male students (in metres, to the nearest cm) are presented in summarised
form below in a grouped frequency distribution. We use the group mid-point to represent
the heights of the students in each group in the calculations and provide the values of fx
and fx2 for each group.
8
height mid-point frequency
class interval x f fx fx2
1.50 – 1.54 1.52 2 3.04 4.6208
1.55 – 1.59 1.57 3 4.71 7.3947
1.60 – 1.64 1.62 5 8.10 13.1220
1.65 – 1.69 1.67 8 13.36 22.3112
1.70 – 1.74 1.72 12 20.64 35.5008
1.75 – 1.79 1.77 14 24.78 43.8606
1.80 – 1.84 1.82 10 18.20 33.1240
1.85 – 1.89 1.87 6 11.22 20.9814
1.90 – 1.94 1.92 3 5.76 11.0592
1.95 – 1.99 1.97 1 1.97 3.8809
64 111.78 195.8556
Sample mean x̄ = 111.78/64 = 1.747m
Sample standard deviation s =
(
1
63
(
195.8556− 111.78
2
64
))1/2
= 0.0996m
Note: for the original raw (ungrouped) data the mean is 1.746m and standard deviation
is 0.0987m.
Alternatively, we can change origin and scale; here we use y = (x− 1.52)/0.05
x y f fy fy2
1.52 0 2 0 0
1.57 1 3 3 3
1.62 2 5 10 20
1.67 3 8 24 72
1.72 4 12 48 192
1.77 5 14 70 350
1.82 6 10 60 360
1.87 7 6 42 294
1.92 8 3 24 192
1.97 9 1 9 81
64 290 1564
For the y data:
Sample mean ȳ = 290/64 = 4.5313
Sample standard deviation sy =
(
1
63
(
1564− 290
2
64
))1/2
= 1.9918
Now x = 0.05y + 1.52 and so, for the original x data:
Sample mean x̄ = 0.05× (290/64) + 1.52 = 1.747m
Sample standard deviation s = 0.05× 1.9918 = 0.0996m
1.5 A group of students comprises 58% men and 42% women. The mean height of the men
is 175.2cm and the mean height of the women is 163.7cm. The mean height of the group
overall is
0.58× 175.2 + 0.42× 163.7 = 170.4cm.
9