程序代写代做代考 Excel scheme STAT7001 STATISTICS FOR PRACTICAL COMPUTING —

STAT7001 STATISTICS FOR PRACTICAL COMPUTING —
ASSESSMENT 2 (2016/17 SESSION)

• Your solutions should be your own work and are to be submitted electronically to the
course Moodle page by 12 noon on MONDAY, 24TH APRIL 2017.

• Ensure that you electronically ‘sign’ the plagiarism declaration on the Moodle page
when submitting your work.

• Late submission will incur a penalty unless there are extenuating circumstances (e.g.
medical) supported by appropriate documentation and notified within one week of the
deadline above. Penalties, and the procedure in case of extenuating circumstances, are
set out in the latest editions of the Statistical Science Department student handbooks
which are available from the departmental web pages.

• Failure to submit this in-course assessment will mean that your overall examination
mark is recorded as “non-complete”, i.e. you will not obtain a pass for the course.

• Submitted work that exceeds the specified word count will be penalized. The penalties
are described in the detailed instructions below.

• Your solutions should be your own work. When uploading your scripts, you will be
required to electronically sign a statement confirming this, and that you have read the
Statistical Science department’s guidelines on plagiarism and collusion (see below).

• Any plagiarism or collusion can lead to serious penalties for all students involved,
and may also mean that your overall examination mark is recorded as non-complete.
Guidelines as to what constitutes plagiarism may be found in the departmental student
handbooks: the relevant extract is provided on the ‘In-course assessment 2’ tab on the
STAT7001 Moodle page. The Turn-It-In plagiarism detection system may be used to
scan your submission for evidence of plagiarism and collusion.

• You will receive feedback on your work via Moodle, and you will receive a provisional
grade. grades are provisional until confirmed by the Statistics Examiners’ Meeting in
June 2017.

Background and overview

On 23rd June 2016, a referendum was held in the UK to decide whether or not to remain
part of the European Union (EU). 72% of registered voters took part. Of those, 51.2% voted
to leave the EU, and 48.1% voted to remain.

This result was unexpected, and there has been extensive commentary on the reasons for
it. On 6th February 2017, the BBC News web site carried an article entitled Local voting
figures shed new light on EU referendum (the article is at http://www.bbc.co.uk/news/
uk-politics-38762034). The article is by Martin Rosenbaum, the BBC’s Freedom of
Information specialist. He obtained data from 1070 electoral wards,1 giving the numbers of
‘Leave’ and ‘Remain’ votes cast in each ward.

1An electoral ward is the smallest administrative division for election purposes in the UK, typically with
a population of around 5 500. There are almost 9 500 electoral wards in the UK.

http://www.bbc.co.uk/news/uk-politics-38762034
http://www.bbc.co.uk/news/uk-politics-38762034

In his article, Martin Rosenbaum goes on to calculate some statistical associations between
the proportion of ‘Leave’ votes in a ward, and some of its social, economic and demographic
characteristics according to the most recent UK census which was conducted in 2011. His
article looks at characteristics such as education, age and ethnicity taken individually. What
it does not do, however, is to investigate combinations of these characteristics. This could
be important, because there could be other variables that simultaneously influence (say)
education level and the propensity to vote ‘Leave’, and which thereby create the illusion of
a causal link between them.

The BBC web page provides the voting data that were used in Martin Rosenbaum’s analysis,
but not the census data. However, he has very kindly shared his census data with us — and,
for this in-course assessment, they have been supplemented with some additional information
as well.

The data are provided to you in the CSV file ReferendumResults.csv on the ‘In-course
assessment 2’ tab of the STAT7001 Moodle page. For each of the 1070 electoral wards, this
file provides values of around 45 variables that may be relevant in understanding why people
voted as they did (the Appendix to these instructions gives a full list of variables, along with
other metadata). In addition, for the first 803 wards in the data file, the numbers of ‘Leave’
votes are provided, as well as the total number of votes for ‘Leave’ and ‘Remain’ combined.
For the final 267 wards however, the numbers of ‘Leave’ votes are not provided to you: they
are given as ‘−1’ in the data file.
Your task in this assessment is to use the data on the first 803 wards, to build a statistical
model that will help you to:

• Understand the social, economic and demographic characteristics that are associated
with the voting outcome for a ward; and

• Estimate the proportion of ‘Leave’ votes in each of the 267 wards for which you don’t
have this information.

Detailed instructions

You may use either R or SAS for this assessment.

1. Read the data into your chosen software package, and carry out any necessary recoding
(e.g. to deal with the fact that ‘−1’ represents a missing value).

2. Carry out an exploratory analysis that will help you to start building a sensible sta-
tistical model to explain and predict the proportion of ‘Leave’ votes in a ward. This
analysis should aim to reduce the number of candidate variables to take into the subse-
quent modelling exercise, as well as to identify any important features of the data that
may have some implications for the modelling. You will need to consider the context
of the problem to guide your choice of exploratory analysis. See the ‘Hints’ below for
some ideas.

1

3. Using your exploratory analysis as a starting point, develop a statistical model that
enables you to predict the proportion of ‘Leave’ votes in a ward, based on (a subset of)
the ward characteristics. To be convincing, you will need to consider a range of models
and to use an appropriate suite of diagnostics to assess them. Ultimately however, you
are required to recommend a single model that is suitable for interpretation, and to
justify your recommendation. Your chosen model should be either a linear model, a
generalized linear model or a generalized additive model.

4. Use your chosen model to predict the proportion of ‘Leave’ votes for each of the 267
wards with missing voting data, and also to estimate the standard deviation of your
prediction errors.

Submission for this assessment is electronic, via the STAT7001 Moodle page. You are required
to submit three files, as follows:

• A report on your analysis, not exceeding 2 000 words of text (including titles, footnotes,
appendices etc.) plus two pages of graphs and / or tables. Your report should be in
three sections, as follows:

I Describe briefly what aspects of the problem context you considered at the out-
set, how you used these to start your exploratory analysis, and what were the
important points to emerge from this exploratory analysis.

II Describe briefly (without too many technical details) what models you considered
in step (3) and why you chose the model that you did.

III State your final model clearly, summarise what your model tells you about the
characteristics associated with the proportion of ‘Leave’ votes, and discuss any
potential limitations of the model.

Your report should not include any computer code. It should include some graphs and
/ or tables, but only those that support your main points.

You are free to choose your own file name for the electronic submission of your report.

• An R script or SAS program corresponding to your analysis and predictions. Your
script /program should run without user intervention on any computer with R or SAS
installed, providing the file ReferendumResults.csv is present in the current working
directory / current folder. When run, it should produce any results that are mentioned
in your report, together with the predictions and the associated standard deviations.
The script / program should be named ########_ICA2.r or ########_ICA2.sas as
appropriate, where ######## is your student ID number. For example, if your ID num-
ber is 150123456 and you are using R, your script should be named 150123456_ICA2.r.

• A text file containing your predictions for the 267 wards with missing voting data.
This file should be named ########_pred.dat, where ######## is your student ID
number. The file should contain three columns, separated by spaces and with no
header. The first column should be the ward identifier (corresponding to variable ID in

2

file ReferendumResults.csv; the second should be the predicted proportion of ‘Leave’
votes for that ward, and the third should be the standard deviation of your prediction
error.

Marking criteria

There are 75 marks for this exercise. These are broken down as follows:

Report: 40 marks. The marks here are for: displaying awareness of the context for the
problem and using this to inform the statistical analysis; good judgement in the choice
of exploratory analysis and in the model-building process; a clear and well-justified
argument; clear conclusions that are supported by the analysis; and appropriate choice
and presentation of graphs and / or tables. The mark breakdown is as follows:

Awareness of context: 5 marks.

Exploratory analysis: 10 marks. These marks are for (a) tackling the problem in
a sensible way that is justified by the context (b) carrying out analyses that are
designed to inform the subsequent modelling.

Model-building: 10 marks. The marks are for (a) starting in a sensible place that
is justified from the exploratory analysis (b) appropriate use of model output and
diagnostics to identify potential areas for improvement (c) awareness of different
modelling options and their advantages and disadvantages (d) consideration of
the social, economic and demographic context during the model-building process.

Quality of argument: 5 marks. The marks are for assembling a coherent ‘narra-
tive’, for example by drawing together the results of the exploratory analysis so
as to provide a clear starting point for model development, presenting the model-
building exercise in a structured and systematic way and, at each stage, linking
the development to what has gone before.

Clarity and validity of conclusions: 5 marks. These marks are for stating clearly
what you have learned about the social, economic and demographic characteris-
tics that are related to the voting outcome in a ward, and for ensuring that this
is supported by your analysis and modelling.

Graphs and / or tables: 5 marks. Graphs and / or tables need to be relevant,
clear and well presented (for example, with appropriate choices of symbols, line
types, captions, axis labels and so forth). There is a one-slide guide to ‘Using
graphics effectively’ in the slides / handouts for Lecture 1 of this year’s course.
Note that you will only receive credit for any graphs in your report if your
submitted script / program generates these graphs, appropriately labelled, when
it is run.

Note that you will be penalised if your report exceeds EITHER the specified 2 000-word
limit or the number of pages of graphs and / or tables. Following the UCL guidelines
at https: // www. ucl. ac. uk/ srs/ academic-manual/ c4/ failure/ word-count ,

3

https://www.ucl.ac.uk/srs/academic-manual/c4/failure/word-count

the maximum penalty is 7 marks, and no penalty will be imposed that takes the fi-
nal mark below 30/75 if it was originally higher. Subject to these conditions, penalties
are as follows:

• More than two pages of graphs and / or tables: zero marks for graphs and / or
tables, in the marking scheme given above.

• Exceeding the word count by 10% or less: mark reduced by 4.
• Exceeding the word count by more than 10%: mark reduced by 7.

In the event of disagreement between reported word counts on different software sys-
tems, the count used will be that from the examiner’s system.

Coding: 15 marks. There are 3 marks here for reading the data and handling missing
values correctly; 7 marks for effective use of your chosen software (e.g. programming
efficiently and correctly); and 5 marks for clarity of your code — commenting, layout,
choice of variable / object names and so forth.

Prediction quality: 20 marks. The remaining 20 marks are for the quality of your pre-
dictions. Note, however, that you will only receive credit for your predictions if they
correspond to results produced by your submitted script / program when it is run: if
this is not the case, your predictions will earn zero marks.

For these marks, you are competing against each other. Your predictions will be as-
sessed using the following score:

S =
267∑
i=1

[
log σi +

(Yi − p̂i)
2

2σ2i
,

]

where:

Yi is the actual proportion of ‘Leave’ votes for the ith prediction;

p̂i is your predicted proportion of ‘Leave’ votes for the ith prediction;

σi is your quoted prediction error standard deviation for the ith prediction.

The score S is an approximate version of a proper scoring rule, which is designed to
reward predictions that are close to the actual observation and are also accompanied by
an accurate assessment of uncertainty (this was discussed during the Week 10 lecture,
along with the rationale for using this score for the assessment). Low values are better.
The scores of all of the students in the class (and the lecturer) will be compared:
students with the lowest scores will receive all 20 marks, whereas those with the highest
scores will receive fewer marks. The precise allocation of marks will depend on the
distribution of scores in the class.

If you don’t supply standard deviations for your prediction errors, the value of σi will be
taken as 1/2 for all of your predictions: this is the largest possible standard deviation
for any random variable taking values between 0 and 1, and the value of S will be
correspondingly large so that you will receive few if any marks for your predictions.

4

STAT7001 Assessment 2 — Hints

1. There is not a single ‘right’ answer to this assignment. There is a huge range of options
available to you, and many of them will be sensible.

2. You are being assessed not only on your computing skills, but also on your ability to
carry out an informed statistical analysis: material from other statistics courses (in
particular STAT2002, for students who have taken it) will be relevant here. To earn
high marks, you need to take a structured and critical approach to the analysis and to
demonstrate appropriate judgement in your choice of material to present.

3. At first sight, the task will appear challenging to many of you. However, there is a lot
that we already know: Martin Rosenbaum’s article is an obvious starting point. You
may also want to search for other commentaries on the UK referendum result, to gain
some understanding of what kinds of relationships you might look for in the data.

4. When building your model, you have two main decisions to make. The first is: should
it be a linear, generalized linear or generalized additive model? The second is: which
covariates should you include? You might consider the following points:

Linear, generalized linear or generalized additive? This is best broken down into
two further questions, as follows:

• Conditional on the covariates, can the response variable be assumed to fol-
low a normal distribution with constant variance? In this assignment, the
response variable is a proportion and therefore cannot have exactly a normal
distribution. However, there are thousands of votes in each ward: the Central
Limit Theorem may apply, therefore, so that the response distribution has
approximately a normal distribution — in which case you may judge that
the approximation is adequate for your purposes. However, the ‘constant
variance’ assumption may be more problematic: given that the response is
a proportion, you might think that a binomial distribution would be appro-
priate, but the variance of a binomial proportion is p(1 − p)/n in an obvious
notation. Since this depends on p, and p varies between wards, the variance
cannot be constant. Whether this is a problem depends on how much the
‘Leave’ probability p varies: if it doesn’t vary much, then you may wish to
claim that the variance is approximately constant. If it varies a lot however,
then you could probably improve your predictions (and hence your score!) by
accounting for it. You might consider using your exploratory analysis to gain
some preliminary insights into this point.

• Are the covariate effects best represented parametrically or nonparametrically?
Again, your exploratory analysis can be used to gain some preliminary in-
sights into this. You may want to look at the material from week 6, for
examples of situations where a nonparametric approach is needed.

Which covariates? The data file contains around 45 potential covariates, and many
of them may be highly correlated. It will not be possible for you to carry out a

5

modelling exercise involving all of them, so you need to reduce this number before
you start. You have several options here. Some of them are as follows:

• Look at other published commentaries on the referendum result. What mea-
sures are considered useful? Can these be linked to covariates for which you
have information? Obviously, if you do this then you will need to acknowledge
your sources in your report.

• Define useful summary measures on contextual grounds, and work with these.
For example, 16 of the potential covariates in the data file are percentages of
the population in different age categories (0 to 4, 5 to 7, . . ., all the way up
to ‘90 plus’). You may decide just to work with ‘young voters’ (18 to 29 —
18 is the minimum voting age in the UK), ‘working age’ (30 to 64 say) and
‘retirement age’ (65 and above). Or, indeed, to adopt your own categories
— the results are unlikely to be sensitive to the precise definitions. Similar
comments apply to the potential covariates representing ethnicity, household
deprivation and so on.

• Define new variables based on the correlations between the existing variables,
and work with these. If several variables are highly correlated, then it is
difficult to disentangle their effects and it may be preferable to work with a
single ‘index’ that combines all of them. This is the basis of techniques such
as Principal Components Analysis, that were discussed during the Week 10
lecture (along with how to implement them in R and SAS).

You should not start to build any models until you have reduced the number
of candidate covariates to a manageable number. Your decisions here should be
guided by your exploratory analysis, as well as your understanding of the context.

5. Don’t forget to look for interactions! For example, one of the variables in the data set
is RegionName, which is a factor (i.e. categorical covariate) indicating the UK region
in which each ward is located. Possibly there is regional variation in the strength of
dependence between other characteristics and the proportion of ‘Leave’ votes. Look at
the analysis of the iris data from Workshop 2, for a similar kind of situation.

6. If you use a linear model, it is straightforward to obtain the standard deviations of
your prediction errors using either R or SAS (look at the material in Workshops 2 and 9
respectively, to find out how to do it). However, for generalized linear and generalized
additive models you need some additional computations. Specifically:

(a) Suppose p̂i is your predicted probability of voting ‘Leave’ for the ith ward, and
that Yi is the actual proportion of ‘Leave’ voters as above.

(b) Then your prediction error will be Yi − p̂i.
(c) Yi and p̂i are independent, because p̂i is computed using only information from

the first 803 wards and Yi relates to one of the ‘new’ wards.

(d) The variance of your prediction error is thus equal to Var (Yi) + Var (p̂i).

6

(e) You can calculate the standard error of p̂i in both R and SAS, when making
predictions for new observations — see Workshops 6 and 9. Squaring this standard
error gives you Var (p̂i).

(f) You can estimate Var (Yi) by plugging in the appropriate formula for your chosen
distribution — for example, if you’re using a binomial distribution then V̂ar (Yi) =
p̂i(1 − p̂i)/ni, where ni is the number of votes for the ith ward.

(g) Hence you can estimate the standard deviation of your prediction error as σ̂i =√
V̂ar (Yi) + Var (p̂i).

Appendix: the ReferendumResults.csv data set

Data sources

The data provided in ReferendumResults.csv are from several different sources, as follows:

Martin Rosenbaum’s BBC article (data source MR1)

The article at http://www.bbc.co.uk/news/uk-politics-38762034 provides an Excel spread-
sheet, containing localised voting data supplied to the BBC by councils which counted the
EU referendum. Results are provided for all individual wards where data were available at
this level of detail: there are 1 283 such wards, of a total of 9 291 wards in the UK. Reasons
for the figures not being available at the remaining wards are:

• Three councils did not respond to the BBC’s request.

• Some councils refused to give the information to the BBC.

• For some councils, ballot boxes were mixed before counting so it was not possible to
identify the precise numbers of votes in each ward.

Important caveat: in many wards, some postal votes were mixed in prior to counting.
The BBC spreadsheet states “Figures which include postal votes cannot be treated as exact.
However broad patterns can still be identified in the data.”

Martin Rosenbaum’s data set from the 2011 UK census (data source MR2)

The variables in this data set are those that form the basis for the analysis reported in Martin
Rosenbaum’s article. He provided the following information when supplying the data to us:

7

http://www.bbc.co.uk/news/uk-politics-38762034

‘All the 2011 census data was downloaded via selecting datasets at https: //
www. nomisweb. co. uk/ query/ select/ getdatasetbytheme. asp? opt= 3& theme=

&subgrp= . (I calculated adult mean age from the raw counts of adults of each
age in each ward). Please note that some areas have seen boundary changes to
wards since 2011, so some wards with referendum voting data do not figure in
this list.’

This spreadsheet contains data for 8 570 wards.

The UK Register of Geographic Codes (data source RGC)

Geographical information for each ward was obtained from the UK ‘Register of Geographic
Codes’, downloaded on 14th March 2017 and located by searching at http://geoportal.
statistics.gov.uk/.

UK age structure by ward, from the 2011 UK census (data source ASW)

Martin Rosenbaum’s census data contains information on the mean adult age in each ward,
but it is possible that more detailed information on age profiles would be useful. Percent-
agess of population in different age bands were obtained for the 2011 census, from https://
www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?collapse=yes under Cen-
sus 2011, Key Statistics and then Age Structure. This provides information on the same 8 570
wards that are present in data source MR2.

Data processing

The data sources have been combined in the following way to create ReferendumResults.csv:

1. The spreadsheets from sources MR1, MR2 and ASW were merged using the nine-digit
ward identification code (identified as WardCode in MR1). The Remain variable, giving
the number of ‘Remain’ votes in each ward, was replaced by an NVotes column giving
the total number of ‘Leave’ and ‘Remain’ votes. There are 1 070 wards remaining after
this merge: this decrease from the original 1 283 wards in MR1 is due to the exclusion
of wards for which the boundaries changed between the 2011 census and the 2016
referendum.

2. Source RGC was used to identify the administrative area type and region name for
each ward, again based on its nine-digit identification code. Some ward codes were
found to be duplicated in source RGC, but in all cases the administrative area type
and region name were identical for the duplicates.

3. The rows of the data table were randomly shuffled, so that the order of wards no
longer corresponds to that in any of the data sources. This was done in order to
prevent ‘cheating’ when making predictions.

8

https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?opt=3&theme=&subgrp=
https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?opt=3&theme=&subgrp=
https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?opt=3&theme=&subgrp=
http://geoportal.statistics.gov.uk/
http://geoportal.statistics.gov.uk/
https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?collapse=yes
https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?collapse=yes

4. A subset of 267 wards was identified for the ‘prediction’ part of the assessment. This
was done in such a way that the distributions of all of the covariates in this subset is
very similar to the distribution in the remaining 803 wards. Specifically:

(a) In each region, 25% of the wards were sampled at random as candidates for making
predictions. This sample will be referred to as ‘Group 2’ below, with ‘Group 1’
comprising the remaining wards.

(b) For each of the numeric covariates in the data set, a Kolmogorov-Smirnov test was
performed to test the null hypothesis that the underlying distributions in Groups
1 and 2 are the same.

(c) The prediction sample was accepted only if the p-values for all of the Kolmogorov-
Smirnov tests were greater than 0.5 (this is not a typo). Otherwise, a new candi-
date sample was drawn in step (a) and the procedure was repeated.

The Kolmogorov-Smirnov test is used here as a convenient way to measure whether two
distributions are similar: the use of a high p-value threshold is chosen to ensure that
the resulting Groups 1 and 2 are very well balanced with respect to all of the covariate
values. Note, however, that the voting numbers were not included in this balancing
exercise: this is because the performance of predictions would be artificially enhanced
if the voting numbers were included (for example, we would know that the distribution
of Group 2 voting proportions is similar to that of Group 1 proportions). Note also
that no attempt has been made to balance the groups in terms of combinations of the
covariates.

5. The ‘Group 2’ rows were placed at the end of the data table, with their ‘Leave’ vote
numbers set to −1; a new ID variable was created so that each ward has an ID number
between 1 and 1070; the original nine-digit ward identification code was removed so that
the wards cannot be identified easily using online information; a few other redundant
covariates were removed (e.g. where one variable was the sum of two or more others);
some variables were renamed for ease of interpretation; and the columns were re-ordered
for convenience.

6. Some (but not all) of the covariates were multiplied by random numbers close to 1. This
makes no difference to any models that you fit, because the corresponding regression
coefficients will scale correspondingly; but it makes it even more difficult for you to
identify the wards using online information.

Description of variables

This section gives a brief description of each of the variables in ReferendumResults.csv,
and an indication of which data source it came from. Descriptions are compiled on the basis
of correspondence with Martin Rosenbaum and online documentation, including the UK
Census User Guide (notably Part 4 of the ‘Variables and Classifications’ section, together
with the Glossary) at https://www.ons.gov.uk/census/2011census/2011censusdata/
2011censususerguide/.

9

https://www.ons.gov.uk/census/2011census/2011censusdata/2011censususerguide/
https://www.ons.gov.uk/census/2011census/2011censusdata/2011censususerguide/

Variable name Source Description

ID — Ward ID number, from 1 to 1 070

AreaType RGC

Type of administrative area in which the ward is situ-
ated. This takes one of four values:

E06: Unitary authorities

E07: Non-metropolitan districts

E08: Metropolitan districts

E09: London boroughs

This is included because E08 and E09 indicate a large
urban area.

RegionName RGC

Name of the region within which the ward is situ-
ated. The possible values are North East, North
West, Yorkshire and The Humber, East Midlands,
West Midlands, East of England, London, South
East and South West.

NVotes MR1 Total number of votes (‘Leave’ plus ‘Remain’)

Leave MR1 Number of ‘Leave’ votes

Postals MR1
Indicates whether postal votes were mixed in with the
data prior to counting. P indicates yes, NP indicates no.

Residents MR2 Number of permanent residents

Households MR2 Number of households

MeanAge ASW Mean age of permanent residents

AdultMeanAge MR2 Mean age of adult permanent residents

Age_0to4 ASW % of permanent residents aged 0–4

Age_5to7 ASW % of permanent residents aged 5–7

Age_8to9 ASW % of permanent residents aged 8–9

Age_10to14 ASW % of permanent residents aged 10–14

Age_15 ASW % of permanent residents aged 15

Age_16to17 ASW % of permanent residents aged 16–17

Age_18to19 ASW % of permanent residents aged 18–19

Age_20to24 ASW % of permanent residents aged 20–24

Age_25to29 ASW % of permanent residents aged 25–29

Age_30to44 ASW % of permanent residents aged 30–44

Continued on next page . . .

10

. . . continued from previous page
Variable name Source Description

Age_45to59 ASW % of permanent residents aged 45–59

Age_60to64 ASW % of permanent residents aged 60–64

Age_65to74 ASW % of permanent residents aged 65–74

Age_75to84 ASW % of permanent residents aged 75–84

Age_85to89 ASW % of permanent residents aged 85–89

Age_90plus ASW % of permanent residents aged 90 and above

White MR2 % of permanent residents self-identifying as white

Black MR2 % of permanent residents self-identifying as black

Asian MR2 % of permanent r