CS代考 F71SM STATISTICAL METHODS

F71SM STATISTICAL METHODS
1 DATA SUMMARY 1.1 Introduction
Data: our data consist of a sample x = (x1, x2, . . . , xn) from some population (real or concep- tual); in the case that every member of the population has the same chance of being included in the sample we have a (simple) random sample.
For example, five students were selected at random from last year’s second year under- graduate class and their ages (years) were recorded, giving us a random sample of five ages: x = (21, 19, 18, 18, 23).
In some situations, the original data may be summarised in a frequency distribution, in which the value xi occurs with frequency fi , i = 1, 2, . . . , k, with 􏰊ki=1 fi = n. For example, here is a frequency distribution for a random sample of 150 undergraduate students’ ages (years):
x 17 18 19 20 21 22 23 24 25 26 f 2 12232725221913 6 1
1.2 Visual display 1 – histogram
A histogram is a display in which we divide the range of the data into a number of intervals (or ‘bins’) and draw rectangles whose areas are proportional to the frequency of data (number of values) in the intervals.
1

We choose the number and widths of the intervals to give us a clear and informative display. Too many intervals gives us a display which is too detailed and rough; too few intervals gives us a display which is too crude and we lose too much information. See below for histograms of the same set of ages (measured to greater accuracy than before): for most purposes, the left-hand display is too detailed, the middle one is fine, the right-hand one is too crude.
1.3 Summarising level (average, position)
The principal measures of this characteristic of a data set are mean, median, and mode.
Sample mean x ̄ = sample size = n
1􏰋k 1􏰋 􏰊fx
Sample median : middle observation of sorted data set (mean of middle two in the case that n is even); that is, the median is the 12 (n + 1)th sorted value, using linear interpolation between values as necessary.
Sample mode : observation with greatest frequency (useful in certain circumstances only). Sample of 5 ages (data set ages1)
mean = 99/5 = 19.8yrs, sorted data: 18, 18, 19, 21, 23 so median = 19yrs Sample of 150 ages (data set ages2)
mean = 3161/150 = 21.07yrs, median is 75.5th sorted observation = 21yrs, mode = 20yrs
The sample mean is the value each observation would have if the total was the same and all observations were equal. The sample median is the value with about 50% of the observations lower than it and 50% of the observations higher than it. The sample mode is a ‘typical’ value in the sense of being the ‘most common’ value.
sum of observations 1 􏰋n 1 􏰋
xi or n x
fi xi or n f x or 􏰊 f
Frequency distribution format: x ̄ = n
i=1
i=1
2

1.4 Summarising spread (variability, dispersion, variation)
The principal measures of this characteristic of a data set are standard deviation, inter-
quartile range, and range.
Sample standard deviation s is a measure of variability about the mean.
The sum of squared deviations about the mean is 􏰊ni=1 (xi − x ̄)2; sample variance s2 = 1 􏰊n (x − x ̄)2; sample standard deviation s is the positive square root of s2 and is in
n−1 i=1 i
the same units as the data.
􏱍n􏱩
􏱌􏱌1􏰋2 1􏰆􏰋1􏰄􏰋􏰅2􏰇 s = 􏱋 n − 1 ( x i − x ̄ ) = n − 1 x 2i − n x i
i=1
Frequency distribution format:
􏱍􏱌1k
s = 􏱌􏱋 􏰋 f i ( x i − x ̄ ) 2 =
􏱩1􏰆 1􏰄 􏰅2􏰇
n − 1 i=1
n − 1
􏰋 f i x 2 i − 􏰋 f i x i n
Inter-quartile range (or midspread) IQR = Q3 − Q1, where Q1 and Q3 are the first and third quartiles respectively (values with 25% and 75% of the observations lower than that value, respectively).
The quartiles divide an ordered data set into 4 parts, each containing 25% of the values in the data set. We take Q1 to be the 41 (n + 1)th sorted observation and Q3 to be the
34 (n + 1)th sorted observation, using linear interpolation between values as necessary.
(Quartiles are special cases of quantiles, which divide an ordered data set into a specified number of parts, each containing the same percentage of values in the data set. Per- centiles divide the data set into 100 parts. Q1 and Q3 are the 25th and 75th percentiles, respectively.)
Range = highest observation − lowest observation
The sample standard deviation is a measure of the average deviation of the values from the mean – but it measures this in a very particular way – it is the square root of the average of the squared deviations from the mean; hence an alternative name (which is rarely used), the ‘root mean square deviation from the mean.’
The range of values covered by the IQR (from the lower quartile to the upper quartile) includes the central 50% of the observations.
Sample of 5 ages (data set ages1)
standard deviation s = ((1979 − 992/5)/4)1/2 = 2.17yrs, range = 23 − 18 = 5yrs
Sample of 150 ages (data set ages2)
standard deviation s = ((67207 − 31612/150)/149)1/2 = 2.00yrs,
lower quartile Q1 is the 37.75th sorted observation = 19.75 upper quartile Q3 is the 113.25th sorted observation = 23 IQR = 23 − 19.75 = 3.25yrs
range = 26 − 17 = 9yrs
3

1.5 Visual display 2 – boxplot
The 5 vertical lines indicate (from L to R) the minimum observation, lower quartile, median, upper quartile, and maximum observation. The width of the central box is the IQR. The position of the median line inside the central box and the relative lengths of the whiskers leading out to the edges give an indication of the degree of symmetry or asymmetry present in the data set (see skewness below).
1.6 Use of information and robustness
Both the mean x ̄ and standard deviation s make full use of the data (they score well on use of information), but both are highly susceptible to changes at the extremes of the sample (they score poorly on data resistance – they are not robust to changes in the data). The median and IQR score less well on use of information, but better on data resistance. The mode scores very poorly on use of information, but well on data resistance. The range scores very poorly against both criteria.
For example, add a student aged 66 to the set of data ages1:
new data set (sorted): 18, 18, 19, 21, 23, 66
new mean = 27.5 (which is totally unrepresentative and artificial)
new median = 20 (which is still a useful measure)
new standard deviation = 18.96 (greatly inflated by the single observation 66) new range = 66 − 18 = 48 (greatly inflated by the single observation 66).
Since the sample mean scores well on use of information but poorly on data resistance, a compromise measure of level is in use which scores better on data resistance. It is based on the idea of trimming off the observations at the extremes of the data set and finding the mean of the remaining observations.
The (5%) trimmed mean is the mean of the observations that remain after removing the 4

lowest 5% and highest 5% of the observations (approximating the number to be trimmed off with the nearest integer if necessary).
In the case of the sample of 150 ages (data set ages2) we trim off the lowest 8 observations (total value 142) and the highest 8 observations (total value 200), leaving 134 observations with total value 3161 − 200 − 142 = 2819, so trimmed mean = 2819/134 = 21.04, which in this case is very close to the (untrimmed) mean (= 21.07).
1.7 Change of origin and scale
A linear transformation corresponds to a change of origin and scale.
yi =a+bxi ⇒y ̄=a+bx ̄, sy =|b|sx ←verifytheseresults
The change of origin (a) and scale (b) both affect the value of the mean. The change of origin does not affect the standard deviation (which is a measure based on deviations from the mean).
For example, consider a set of salaries (x, £) ranging from £20,200 to £78,100, with mean £36,200 and standard deviation £7,150. Suppose tax is payable at 40% on the amount of a salary in excess of £12,500. The tax paid on a salary, y say, is given by y = 0.4(x − 12500) = 0.4x − 5000. The mean and standard deviation of the amounts of tax paid on the salaries are therefore 0.4 × 36200 − 5000 = £9, 480 and 0.4 × 7150 = £2, 860, respectively. The mean and standard deviation of the salaries net of tax are £26,720 and £4,290 respectively (check).
1.8 Standardisation
How many standard deviations is xi above/below the mean? The answer gives the standard-
ised value of the observation xi , usually denoted zi.
Standardised value (or standard score): zi = xi − x ̄ (note z ̄ = 0, sz = 1)
s
1.9 Weighted mean
Data: xi with weights wi(≥ 0), i = 1,2,…,k, where 􏰊ki=1 wi = 1.
Weighted mean = 􏰊ki=1 wixi ← it is a generalisation of the sample mean introduced earlier
(for which wi = 1/n)
The mean of a frequency distribution is an illustration of a weighted mean, with weights
wi = fi/n, the proportion of observations in each cell.
Weighting is often required when we average quantities that are themselves ratios.
5

1.10 Skewness
In addition to summarising the level and spread of a set of data we may want to summarise the skewness (degree of asymmetry) of the data. A set of data with positive skew tails off to the right (in the direction of higher values); a set of data with negative skew tails off to the left (in the direction of lower values). Positive skew is more common than negative skew. A good example of positive skew often occurs with insurance claim amounts, which are bounded below but are often not bounded above.
One measure of skewness for a set of data is given by
measure of skewness =
Another measure (the ‘moment coefficient of skewness’) is
3 (sample mean − sample median) sample standard deviation
coefficient of skewness = m3 where mk = 1 􏰋 (x − x ̄)k
m3/2 2
(Note: mk is called the kth moment about the mean).
n
These measures are invariant to changes of origin and scale (they have no units of measure- ment, they are dimensionless). They are zero, positive, or negative according to the type of skewness present. The histograms below illustrate data sets with (L to R) clear negative skew, approximate symmetry, and clear positive skew, respectively.
6

1.11 Other visual displays
Here is a simple, but effective, dotplot of the sample of 150 ages (the data set ages2)
.
:. .::
:::: ::::. ::::: ::::: :::::.
::::::: ::::::: ::::::: :::::::: ::::::::
:::::::::. -+———+———+———+———+———+—– 16.0 18.0 20.0 22.0 24.0 26.0 age
Next we have a display called a stem-and-leaf plot of a data set consisting of 200 claim amounts, ranging from £2,004 to £7,886. Each observation is split into a stem unit (here £1000) and a leaf unit (here £100) so, for example, the value £2004 is represented as stem value 2 with leaf value 0. The observation £7,886 is represented as stem value 7 with leaf value 9. Looked at under a 90o rotation it is very like a histogram, but with more detailed information.
The decimal point is 3 digits to the right of the |
2|0
2 | 779
3 | 1123344
3 | 55666667777788889999
4 | 00001112222233333333344444444
4 | 55555555555666666666667777778888999999999 5 | 00000001111222222222222333333344444444
5 | 555555666666777777788888888899999
6 | 0000001134
6 | 555556667788899
7 | 11
7|9
The stems here (representing £2000, £3000, etc) have been split, allowing two rows of display for each stem unit.
7

1.12 Worked examples
1.1 A random sample of 15 motor windscreen claim amounts (£) are as follows:
121 107 139 72 123 114 215 156 100 136 169 89 115 153 111
For these data n = 15, 􏰊x = 1,920, 􏰊x2 = 263,214 Sample mean = 1920/15 = £128.0
Data sorted:
72 89 100 107 111 114 115 121 123 136 139 153 156 169 215
Sample median is the 8th observation when sorted; this is £121 Trimmed mean = (1920 − 72 − 215)/13 = 1633/13 = £125.62 Lower quartile is 4th observation when sorted; this is £107 Upper quartile is 12th observation when sorted; this is £153 Interquartile range = 153 − 107 = £46
Range = 215 − 72 = £143
Standard deviation = 􏰄 1 􏰄263214 − 19202 􏰅􏰅1/2 = £35.31
1.2 The numbers of occupants in private cars travelling south over the Forth Road Bridge between 08:00hrs and 09:00hrs one Monday morning were noted for each of a random sample of 100 such cars. The results are given in the following frequency distribution.
numberofoccupantsx 1 2 3 4 5 numberofcarsf 72 17 8 2 1
􏰊f=100,􏰊fx=143,􏰊fx2 =269 Sample mean = 143/100 = 1.43
Sample standard deviation = 􏰄 1 􏰄269 − 1432 􏰅􏰅1/2 = 0.81
99 100
1.3 Two groups of students sat the same exam. The first group of 64 students scored a mean mark of 52.297 with standard deviation 7.521; the second group of 42 students scored a mean mark of 46.571 with standard deviation 7.742.
We calculate the sample mean and standard deviation of the combined group of 106 students.
Notethat􏰊x=nx ̄and􏰊x2 =(n−1)s2+(􏰊x)2/n
Group 1: n = 64, 􏰊 x = 64 × 52.297 = 3347, 􏰊 x2 = 63 × 7.5212 + 33472/64 = 178, 601 Group 2: n = 42, 􏰊 x = 42 × 46.571 = 1956, 􏰊 x2 = 41 × 7.7422 + 19562/42 = 93, 551 Combining the groups we get n = 106, 􏰊x = 3347 + 1956 = 5303, 􏰊x2 = 178601 +
93551 = 272, 152 ⇒ x ̄ = 5303/106 = 50.03, s = 􏰄 1 􏰄272152 − 53052 􏰅􏰅1/2 = 7.959 105 106
1.4 Theheightsof64malestudents(inmetres,tothenearestcm)arepresentedinsummarised form below in a grouped frequency distribution. We use the group mid-point to represent the heights of the students in each group in the calculations and provide the values of fx and fx2 for each group.
14 15
8

height class interval 1.50 – 1.54 1.55 – 1.59 1.60 – 1.64 1.65 – 1.69 1.70 – 1.74 1.75 – 1.79 1.80 – 1.84 1.85 – 1.89 1.90 – 1.94 1.95 – 1.99
mid-point frequency
x
1.52 1.57 1.62 1.67 1.72 1.77 1.82 1.87 1.92 1.97
f fx
2 3.04 3 4.71 5 8.10 8 13.36 12 20.64 14 24.78 10 18.20 6 11.22 3 5.76 1 1.97
fx2 4.6208 7.3947 13.1220 22.3112 35.5008 43.8606 33.1240 20.9814 11.0592 3.8809
Sample mean x ̄ = 111.78/64 = 1.747m
Sample standard deviation s = 􏰄 1 􏰄195.8556 − 111.782 􏰅􏰅1/2 = 0.0996m
Note: for the original raw (ungrouped) data the mean is 1.746m and standard deviation is 0.0987m.
Alternatively, we can change origin and scale; here we use y = (x − 1.52)/0.05
x y f fy fy2 1.52 0 2 0 0 1.57 1 3 3 3 1.62 2 5 10 20 1.67 3 8 24 72 1.72 4 12 48 192 1.77 5 14 70 350 1.82 6 10 60 360 1.87 7 6 42 294 1.92 8 3 24 192 1.97 9 1 9 81
64 290 1564
For the y data:
Sample mean y ̄ = 290/64 = 4.5313
Sample standard deviation sy = 􏰄 1 􏰄1564 − 2902 􏰅􏰅1/2 = 1.9918 63 64
Now x = 0.05y + 1.52 and so, for the original x data: Sample mean x ̄ = 0.05 × (290/64) + 1.52 = 1.747m Sample standard deviation s = 0.05 × 1.9918 = 0.0996m
1.5 A group of students comprises 58% men and 42% women. The mean height of the men is 175.2cm and the mean height of the women is 163.7cm. The mean height of the group overall is
0.58 × 175.2 + 0.42 × 163.7 = 170.4cm.
64 111.78 195.8556
63 64
9