程序代写代做代考 2 Representing sample data

2 Representing sample data

It can be difficult to interpret a data set when it is specified as a raw list of

numbers. In order to help us understand the key features of a particular set of

numerical data, it is useful to be able to compute and present summaries. These

summaries can either be numerical or graphical.

2.1 Numerical summaries

2.1.1 Sample mean and median

Measures of location give the data analyst a sense of where the centre of the

data is located. The simplest measure of location is the sample mean, which

is the numerical average of the data. Suppose the n observations are x1, . . . , xn,

then the sample mean is

x̄ =
1

n∑
i=1

xi =
x1 + . . .+ xn

n
.

The sample mean can be influenced by the presence of extreme values, or outliers,

in the data set. It may also be an inappropriate measure of location of the

distribution if the data are skewed (see Section 2.2.3). In either case, it may be

more appropriate to use the sample median defined below.

We define the sample median in terms of the order statistics, denoted

x(1), x(2), . . . , x(n). The order statistics are the sample values arranged in in-

creasing order. Thus, x(1) is the smallest value, x(2) is the second smallest value,

and so on.

In the case that n is odd, the sample median is the middle observation of

the data after arranging in increasing order. For even n, it is the average of the

middle two observations after arranging the data in increasing order. In terms

of the order statistics, the sample median is

x̃ =



x(n+1

2
) if n is odd

1
2

(
x(n

2
) + x(n

2
+1)

)
if n is even .

Example 2.1. For the car battery lifetime data from Chapter 1, n = 40 and so

the median is x̃ = 1
2
(x(20) + x(21)) =

1
2
(3.4 + 3.4) = 3.4. The sample mean and

median can be computed in R easily as follows:

> battery <- read.table("battery.txt", header=TRUE) > battlife <- battery$life # these two commands read in the data > mean(battlife)

[1] 3.4125

> median(battlife)

[1] 3.4

2.1.2 Sample variance and standard deviation

As well as measures of location, it is important to be able to give measures of

the variability or spread of the data. One of the simplest and most commonly

used measures is the sample variance,

s2 =
1

n− 1

n∑
i=1

(xi − x̄)2

=
1

n− 1

(
n∑

i=1

x2i − nx̄
2

)
,

and sample standard deviation, s =
√
s2, s > 0.

Importantly, the sample variance is given by the sum of the squared devia-

tions (xi− x̄)2 from the sample mean divided by the degrees of freedom n−1.
The reason for dividing by (n− 1) rather than n is somewhat technical and will
be discussed in further detail later in the course. For now, a brief explanation is

that there are only (n− 1) independent deviations from the mean. Specifically,
it is true that

n∑
i=1

(xi − x̄) = 0 ,

and so once (x1 − x̄), . . . , (xn−1 − x̄) are known the value of (xn − x̄) is fixed.
The importance of this will become clearer in later chapters.

The sample variance is equal to zero only if all observations have the same

value, i.e. if xi = x̄ for i = 1, . . . , n. If the observations are more spread out,

then (xi − x̄)2 will become larger and so too will s2. Thus s2 is a reasonable
measure of spread. An alternative measure of variability is the interquartile

range, discussed in Section 2.1.5.

2.1.3 Population quantiles

Suppose that X is a continuous random variable with strictly increasing c.d.f.

FX(x) and p.d.f. fX(x). The population quantile corresponding to probability

p ∈ [0, 1], also known as the population p quantile, is the value x such that

P(X ≤ x) = FX(x) = p ,

or equivalently such that ∫ x
−∞

fX(t) dt = p .

We use the notation Q(p) to refer to the p quantile defined above, so that

P{X ≤ Q(p)} = p. Note that if F−1X is the inverse function of the c.d.f., then

Q(p) = F−1X (p) .

The quantile corresponding to p = 0.5 is referred to as the (population)

median, and the 0.25 and 0.75 quantiles are referred to as the population lower

quartile and upper quartile respectively.

2.1.4 Sample quantiles

Given a random sample X1, . . . , Xn from the distribution FX(x), we wish to be

able to estimate the population quantile Q(p) using the corresponding sample

quantile Q̂(p). There is no single answer as to how to calculate a sample

quantile: Statisticians have proposed several different definitions, which all have

different theoretical properties.

In this course, we define the sample p quantile Q̂(p) to be the observation

in the p(n + 1)th position when the data are arranged in increasing order. If

p(n+ 1) is not an integer, linear interpolation is used to calculate the quantile.

More precisely, the value of Q̂(p) defined above can be calculated as follows using

the order statistics:

1. Calculate r = p× (n+ 1)

2. If r is an integer, set Q̂(p) = x(r)

3. If r < 1, set Q̂(p) = x(1) 3 4. Otherwise, set Q̂(p) = x(brc) + (r − brc)(x(brc+1) − x(brc)) , where bkc denotes the largest integer not exceeding k ∈ R. bkc is known as the floor of k. Some important special cases are the sample lower quartile, median, and upper quartile, which are the sample 0.25 quantile, 0.5 quantile and 0.75 quan- tile respectively. Example 2.2. Suppose that we have the following data set of size n = 19, arranged in ascending order for convenience: 0.04 0.05 0.08 0.20 0.32 0.40 0.43 0.44 0.54 0.62 0.72 0.74 0.74 0.88 0.89 0.90 0.90 0.97 1.00 It is desired to compute the sample lower quartile, which is the sample quantile corresponding to p = 0.25. We have that r = 0.25× (19 + 1) = 5 , and so brc = 5. Hence, Q̂(0.25) = x(5) = 0.32. Sample quantiles can be easily computed in R, using the quantile command. For the above example, use the following code: > X <- c(0.04, 0.05, 0.08, 0.20, 0.32, 0.40, 0.43, 0.44, 0.54, 0.62, 0.72, 0.74, 0.74, 0.88, 0.89, 0.90, 0.90, 0.97, 1.00) > quantile(x=X, probs=0.25, type=6)

25%

0.32

R implements 9 different methods for computing the sample quantiles. If the

type=6 argument is omitted, then instead R computes the ‘Type 7’ p quantile,

which is the observation in the [p(n− 1) + 1]th position in the ordered sample,
again using linear interpolation if necessary. The numerical values of the two

types of quantiles are usually different, but will be similar when n is large.

2.1.5 Five number summary

A simple and convenient description of a dataset is the five number summary,

which is a list consisting of:

1. the sample minimum, x(1)

2. the sample lower quartile, Q̂(0.25)

3. the sample median, Q̂(0.5)

4. the sample upper quartile, Q̂(0.75)

5. the sample maximum, x(n)

The median and quartiles can be calculated using the methods described in the

previous section. Other methods also exist, and thus for a given dataset there is

no unique five number summary. The five number summary can be obtained in

R simply via either the summary() or quantile() commands, e.g.:

> summary(battlife)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.600 3.100 3.400 3.412 3.825 4.700

> quantile(battlife)

0% 25% 50% 75% 100%

1.600 3.100 3.400 3.825 4.700

The 0%, 25%, 50%, 75% and 100% quantiles correspond to the sample minimum,

lower quartile, median, upper quartile and maximum respectively. As before, by

default R computes the Type 7 quartiles. To obtain the Type 6 quartiles defined

in the previous section, use:

> quantile(battlife, type=6)

0% 25% 50% 75% 100%

1.600 3.100 3.400 3.875 4.700

In addition to the sample variance and standard deviation, another measure

of variability is the sample interquartile range, which is the difference between

the sample upper and lower quartiles,

IQR = Q̂(0.75)− Q̂(0.25) .

2.2 Graphical summaries

2.2.1 Bar chart

A bar chart is used to display the distribution of a sample of values of a qualita-

tive or discrete variable. On the horizontal axis, the bar chart shows the distinct

values of the qualitative variable. The vertical axis shows either the number of

times that value occurs in the data (i.e. the frequency), or the proportion of

times that value occurred. To indicate that the values are different categories,

the bars are positioned so that they do not touch one another. As an example,

consider the opinion poll data from Chapter 1, which is plotted on the bar chart

below. A bar chart is appropriate as the variable of interest is qualitative. The

horizontal axis shows each of the different parties and the vertical axis shows the

number of individuals that intend to vote for that party.

Conservative Labour Lib Dem UKIP Other

Bar chart of opinion poll data

0
50

10
0

15
0

20
0

25
0

30
0

35
0

Figure: Bar chart of the opinion poll data

Often it is of interest to estimate the proportion of individuals in the popula-

tion that support a particular party, e.g. Labour. If the sample can be considered

representative, then a reasonable way to estimate the population proportion is

simply to use the proportion of individuals supporting the corresponding party

in the sample. We will discuss the performance of this method further in later

chapters. However, in most real world polls it is clear that the sample obtained

is not representative. In the field of Survey Statistics, more sophisticated esti-

mation methods are used to attempt to adjust for this non-representativeness.

These more sophisticated methods are outside the scope of this course. The bar

chart above can be generated using the following R code:

polldata <- c(369, 314, 75, 118, 124) names(polldata) <- c("Conservative", "Labour", "Lib Dem", "UKIP", "Other") barplot(polldata, main="Bar chart of opinion poll data", col=c("blue","red","gold","purple","grey")) 2.2.2 Histogram A histogram is used to display the distribution of a sample of values of a con- tinuous quantitative variable. To construct a density histogram we use the following procedure: 1. Choose an origin t0, and a bin width h. 2. Use the chosen origin and bin width to define a regular mesh of equally spaced points tj = t0 + jh , j ∈ Z . Note that tj is defined for both nonnegative and negative integers, i.e. j = −1,−2, . . ., as well as j = 0, 1, 2, . . . . 3. Define bins Bk = (tk−1, tk], k ∈ Z. Note that the bins are of width h, are disjoint and cover the entire real line. 4. Define the height of the density histogram by Hist(x) = νk nh , for x ∈ Bk , where νk is equal to the number of sample values that lie in the bin Bk, i.e. νk is equal to the number of xj , j ∈ Z, satisfying xj ∈ Bk. The histogram is plotted by drawing a rectangular bar corresponding to each interval Bk, with height given by Hist(x). Unlike a bar chart, the bars in a histogram are positioned so that they touch one another, reflecting the continuous nature of the data. Note that the area of a bar in the histogram is 7 proportional to the number of observations in the corresponding bin. The total area enclosed by a density histogram is 1. Example 2.3. Consider the battery lifetime data from Chapter 1. The density histogram shown below has origin at 3.0 and bin width h = 0.5. Histogram of battery lifetime data Battery lifetime D en si ty 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0. 0 0. 2 0. 4 0. 6 Figure: Histogram of the battery lifetime data The figure can be obtained using the following R code: hist(battlife, breaks=seq(from=1.5,to=5,by=0.5), freq=FALSE, col="grey", main="Histogram of battery lifetime data", xlab="Battery lifetime") The breaks argument specifies the location of the points in the mesh, i.e. the bin endpoints. The argument freq=FALSE ensures a density histogram is plotted; otherwise, the height is multiplied by n to give a frequency histogram. The other arguments specify colour options, the title caption and axis labels. 2.2.3 Box plots An alternative way of displaying a sample of values of a continuous variable is to use a box plot (or box-and-whisker plot). This plot is based on the five-number summary, and has three main components: • ‘the box’, whose edges are at the sample lower and upper quartiles. The median is also indicated as a horizontal line inside the box. 8 • ‘the whiskers’ are lines that extend from the box to the adjacent values, which are the maximum and minimum of the part of the sample remaining after any outliers have been removed. • ‘outliers’, which are indicated by circles (◦). Outliers are extreme observations that are considered to be very far from the main body of the data. In general in Statistics, outlier detection can be a challenging problem. For the purposes of drawing box plots, we identify outliers according to the following simple rule of thumb: • An observation xi is classified as an outlier if xi < Q̂(0.25)− 1.5× IQR or xi > Q̂(0.75) + 1.5× IQR .

In other words an observation is classified as an outlier if it is further than

1.5 × IQR from the nearest edge of the box, where IQR is the interquartile
range, Q̂(0.75) − Q̂(0.25). Different software packages may implement different
rules of thumb. The different components of a box plot are labelled below:

Q̂(0.25)

Median

Q̂(0.75)

Min (excl. outliers)

Max (excl. outliers)

Outliers

Distance
> 1.5 IQR

Figure: General structure of a box plot

We now illustrate the calculations underlying box plots using an example.

Example 2.4. Consider the battery lifetime data from Chapter 1. The lower

and upper quartiles are 3.1 and 3.875 respectively. The interquartile range is

thus

IQR = 3.875− 3.1 = 0.775 .

Observations are thus classed as outliers if they lie above

Q̂(0.75) + 1.5× IQR = 3.875 + 1.5× 0.775 = 5.0375 ,

or alternatively if they lie below

Q̂(0.25)− 1.5× IQR = 3.1− 1.5× 0.775 = 1.9375 .

There are no data points lying above the upper threshold, and so the upper

adjacent value is equal to the sample maximum, 4.7. However, there are two

observations (1.6, 1.9) below the lower threshold, which are thus classified as

outliers. The lower adjacent value is equal to the minimum of the sample ex-

cluding these outliers, namely 2.2.

The box plot can be obtained in R using the command:

> boxplot(battlife, main=”Box plot of battery life data”)

The resulting figure is shown below, and agrees approximately with our calcula-

tions. We can see that the distribution of the data appears to be fairly symmetric

around its centre.

1.
5

2.
0

2.
5

3.
0

3.
5

4.
0

4.
5

Box plot of battery life data

Figure: Box plot of the battery life data

It is worth noting that for box plots, R computes the quartiles using neither

the Type 6 nor Type 7 quantiles discussed previously, but instead using another

different method which we will not explain. Again the results should be similar

for large n.

Example 2.5. For the income data, we have the box plot below. This indicates

that the distribution of the middle 50% of observations is fairly symmetric about

the median. However, the lower whisker is very short and the upper whisker is

much longer, suggesting that the distribution is skewed. The presence of a dozen

outliers further is further evidence of skewness.

50
10
0

15
0

Box plot of income data

Figure: Box plot for the Manchester income data

Box plots can also be used to graphically compare the distribution of two or

more samples of measurements of the same variable. We illustrate this technique

with an example below.

Example 2.6. Consider a data set containing blood plasma β endorphin con-

centrations (pmol/l) for 22 runners who had taken part in the Tyneside Great

North Run one year. Measurements were obtained from 11 runners who success-

fully completed the race, and 11 runners who collapsed near the end of the race.

The data were as follows. The data have been rearranged in increasing order

within each group for convenience.

Group Measurements

Successful 14.2, 15.5, 20.2, 21.9, 24.1, 25.1, 29.6, 29.6, 34.6, 37.8, 46.2

Collapsed 66, 72, 79, 84, 102, 110, 123, 144, 162, 169, 414

In order to facilitate comparison it is helpful to put the box plots for the two

groups on the same axes. This can be achieved in R as follows:

> runners_grp <- read.table(file="runners_group.txt", header=TRUE) > boxplot(pmol~group, data=runners_grp,

ylab=”Beta endorphin concentration (pmol/l)”,

main=”Distribution of endorphin concentration in runners”)

The box plot is shown below. It is clear that the endorphin concentration is

much higher and much more variable among the collapsed runners. In both

groups the distribution looks reasonably symmetric.

Collapsed Successful

0
10
0

20
0

30
0

40
0

Distribution of endorphin concentration in runners

B
et

a
en

do
rp

hi
n

co
nc

en
tr

at
io

n
(p

m
ol

/l)

Figure: endorphin concentration in runners

Summary and discussion

The techniques described above can also be applied when comparing the distri-

bution of the same measurement from k different groups. Tables can be pre-

sented giving the summary statistics for each group, for example the 5-number

summaries. It may be of interest to compare, for example:

(i) the k medians as measures of the centre of the k distributions

(ii) the k interquartile ranges as measures of spread

(iii) the minimum, maximum and range.

These features can also be assessed by examining box plots, which are graph-

ical representations of the five-number summary. It is also a good idea to com-

ment on the shape of the distributions, e.g. whether they are symmetric or

skewed. Within each group, symmetry and skewness can be assessed via:

• inspecting if the median is in the centre of the box

• comparing the lengths of the two whiskers.

It is also useful to identify any values highlighted as possible outliers.

We can also compare and contrast group means and standard deviations.

Large differences between the mean and median within a group can also indicate

skewness. Other graphical summaries such as histograms can also be compared

and contrasted across groups. It is useful as a check to confirm that the infor-

mation in the histograms agrees with the findings from the box plots.

The summaries we have discussed in this chapter are applied to a particular

data set (perhaps consisting of multiple groups). However, if we were to repeat

the experiment (or survey) we would obtain a different random sample from

the population, and so also obtain different numerical values for the summaries.

Thus, summary statistics such as the mean are also random variables and have

a sampling distribution. An important part of Statistics is developing an un-

derstanding of these sampling distributions, and harnessing this understanding

to be able to draw reliable inferences about a population from the sample. We

discuss these ideas further in subsequent chapters.

Representing sample data
Numerical summaries
Sample mean and median
Sample variance and standard deviation
Population quantiles
Sample quantiles
Five number summary

Graphical summaries
Bar chart
Histogram
Box plots

Related Posts