R语言统计分析代写: STAT7001 STATISTICS FOR PRACTICAL COMPUTING — ASSESSMENT 2

STAT7001 STATISTICS FOR PRACTICAL COMPUTING — ASSESSMENT 2 (2017/18 SESSION)

  • Your solutions should be your own work and are to be submitted electronically to the course Moodle page by 12 noon on MONDAY, 23RD APRIL 2018.
  • Ensure that you electronically ‘sign’ the plagiarism declaration on the Moodle page when submitting your work.
  • Late submission will incur a penalty unless there are extenuating circumstances (e.g. medical) supported by appropriate documentation and notified within one week of the deadline above. Penalties, and the procedure in case of extenuating circumstances, are set out in the latest editions of the Statistical Science Department student handbooks which are available from the departmental web pages.
  • Failure to submit this in-course assessment will mean that your overall examination mark is recorded as “non-complete”, i.e. you will not obtain a pass for the course.
  • Submitted work that exceeds the specified word count will be penalized. The penalties are described in the detailed instructions below.
  • Your solutions should be your own work. When uploading your scripts, you will be required to electronically sign a statement confirming this, and that you have read the Statistical Science department’s guidelines on plagiarism and collusion (see below).
  • Any plagiarism or collusion can lead to serious penalties for all students involved, and may also mean that your overall examination mark is recorded as non-complete. Guidelines as to what constitutes plagiarism may be found in the departmental student handbooks: the relevant extract is provided on the ‘In-course assessment 2’ tab on the STAT7001 Moodle page. The Turn-It-In plagiarism detection system may be used to scan your submission for evidence of plagiarism and collusion.
  • You will receive feedback on your work via Moodle, and you will receive a provisional grade. grades are provisional until confirmed by the Statistics Examiners’ Meeting in June 2018.

    Background and overview

    In the European Union (EU), car manufacturers are required to limit the carbon dioxide (CO2) emissions from the vehicles that they sell. This is due to the fact that CO2 is the most significant of the greenhouse gases contributing to human-induced climate change, and road vehicles are responsible for a substantial proportion of total emissions.1

    Since 2009, each EU member state has been required to provide emissions information on each new car registered in its territory. A set of annual datasets, compiled from this information, is available from the web site of the European Environment Agency (click the blue text to follow the link). These data are used by the European Commission to calculate the average emissions of CO2 from new passenger cars, and to set emissions targets for car manufacturers. Each manufacturer’s target is based on the total number of cars that they sell, thus allowing

1See http://www.dft.gov.uk/vca/fcb/cars-and-carbon-dioxide.asp for example.

them to make a few high-emissions vehicles if they wish to do so, providing they offset this with a large number of low-emissions vehicles. For a useful summary of the regulation, see the UK Vehicle Certification Agency explanatory booklet (again, click on the blue text).

Car manufacturers are also required to limit emissions of other pollutants including nitrogen oxides (NOx), with the emissions of each new vehicle calculated on the basis of laboratory tests. In September 2015, the manufacturer Volkswagen (which owns other makes including Audi, Porsche, Seat and Sˇkoda) was found to have installed software in its new diesel engines, that automatically limited NOx emissions to unrealistically low levels in laboratory tests (see the relevant Wikipedia article for more on this). Other car manufacturers were subsequently discovered to have been doing the same thing. Since this came to light, there has been pressure on manufacturers to disable such ‘defeat devices’, although the extent to which this has yet been implemented is not yet clear.

On the ‘In-course assessment 2’ tab of the STAT7001 Moodle page, you will find a CSV file called EmissionsData.csv which contains a slightly modified subset of the EU CO2 emissions data since 2010. The file contains 21 156 records (i.e. rows of data): each record contains data from one EU member state, on an individual vehicle type, for a specified year. The years are numbered 1, 2 and 3: these are in chronological order, with years 1 and 2 predating the Volkswagen emissions scandal and year 3 postdating it (I’m not telling you exactly which years they represent!). The first 14104 records contain information on CO2 emissions (in grams per kilometre) along with other information about the vehicle manufacturer, model, mass, engine size and power, and other potentially relevant features: full details can be found in the Appendix to these instructions. For the final 7 052 records, however, the CO2 emissions figures are not provided: they are given as ‘−1’.

Your task in this assessment is to carry out some data preprocessing and then to use the data from the first 14 104 records, to build a statistical model that will help you to:

  • Understand the variation in CO2 emissions between vehicles and over time; and
  • Estimate the CO2 emissions for each of the 7 052 records where you don’t have this

    information.

    Detailed instructions

    You may use either R or SAS for this assessment.
    1. Read the data into your chosen software package, and carry out any necessary recoding

    and preprocessing. Some examples of this include:

    • For the CO2 variable, missing values are represented by −1.
    • One of the variables, TechnolType, contains character codes for any emissions- reducing ‘innovative technologies’ on a vehicle if these are present; this is supple- mented by variable ITReduction which gives, for any vehicle with such innovative technologies, the expected reduction in CO2 emissions (in grams per kilometre).

1

For vehicles with no innovative technologies, the variable TechnolType is blank and ITReduction is recorded as −1. You need to think about a sensible way to handle these two variables, if you plan to use them.

• The variable FuelType gives the type of fuel for the vehicle. However, the data have been compiled from different sources and the fuel type has not been entered consistently. To see this, look at the following table (produced in R):

  > table(EmissionsData$FuelType)
                                                                  DIESEL
                                                                    4472
                                                                ELECTRIC
                       6              44              10               6
                     LPG   NG-biomethane   NG-Biomethane   NG-BIOMETHANE
                     229             125               1              33
                  petrol          Petrol          PETROL Petrol-electric
                     195            5665            4022              73
         Petrol-Electric PETROL-ELECTRIC
                      23              73

You will need to figure out what to do about this (there are other similar examples in the data set).

• The data were originally compiled from reports submitted separately by each individual member state. It is possible, therefore, that different member states submitted reports for exactly the same car model in any given year: these will lead to identical records in the data unless the member states carry out their own emissions inspections, which could potentially lead to different values of the CO2 variable for the same car model. You should look into this, and consider how best to deal with any duplicated values. Note, however, that no car model appears in more than one year in the data provided to you: each model appears only for the earliest year in which it appeared.

There may be other examples as well: you need to check all of the variables carefully and ensure that you understand them, before starting your analysis.2

2. Carry out an exploratory analysis that will help you to start building a sensible sta- tistical model to understand and predict the CO2 emissions for each vehicle. This analysis should aim to identify an appropriate set of candidate variables to take into the subsequent modelling exercise, as well as to identify any important features of the data that may have some implications for the modelling. You will need to consider the context of the problem to guide your choice of exploratory analysis. See the ‘Hints’ below for some ideas.

2This preliminary ‘data screening’ is a vital part of any real-world statistical analysis. In case you think I should have done it for you: I just spent three full days cleaning up the original data downloaded from the EU web site, and I left only the easy bits for you to do!

      Biodiesel
              5
Diesel-electric
diesel          Diesel
   189            5985
   E85        Electric

2

  1. Using your exploratory analysis as a starting point, develop a statistical model that enables you to predict the CO2 emissions for any vehicle based on (a subset of) its characteristics, and also to understand the variation of CO2 between vehicles. To be convincing, you will need to consider a range of models and to use an appropriate suite of diagnostics to assess them. Ultimately however, you are required to recommend a single model that is suitable for interpretation, and to justify your recommendation. Your chosen model should be either a linear model, a generalized linear model or a generalized additive model.
  2. Use your chosen model to predict the CO2 emissions for each of the vehicles for which this information is missing, and also to estimate the standard deviation of your pre- diction errors.

Submission for this assessment is electronic, via the STAT7001 Moodle page. You are required to submit three files, as follows:

  • A report on your analysis, not exceeding 2 500 words of text plus two pages of graphs and / or tables. The word count includes titles, footnotes, appendices, references etc. — in fact, it includes everything except the two pages of graphs / tables. Your report should be in four sections, as follows:
    1. I  Describe, clearly but briefly, the decisions that you made about preprocessing the data before starting your analysis. You don’t need to go into details of exactly how you did it; just summarise what you did and why.
    2. II  Describe briefly what aspects of the problem context you considered at the out- set, how you used these to start your exploratory analysis, and what were the important points to emerge from this exploratory analysis.
    3. III  Describe briefly (without too many technical details) what models you considered in step (3) above, and why you chose the model that you did.
    4. IV  State your final model clearly, summarise what your model tells you about the characteristics associated with variation of CO2 emissions between vehicles and over time, and discuss any potential limitations of the model.

    Your report should not include any computer code. It should include some graphs and / or tables, but only those that support your main points.

    Your report should be in PDF (recommended) or Word, and should be named as ########_rpt.pdf or ########_rpt.docx as appropriate, where ######## is your student ID number. For example, if your ID number is 150123456 and you are using PDF, your script should be named 150123456_rpt.pdf.

  • An R script or SAS program corresponding to your analysis and predictions. Your script /program should run without user intervention on any computer with R or SAS installed, providing the file EmissionsData.csv is present in the current working directory / current folder. When run, it should produce any results that are mentioned

3

in your report, together with the predictions and the associated standard deviations. The script / program should be named ########_ICA2.r or ########_ICA2.sas as appropriate, where ######## is your student ID number. You may not create any additional input files that can be referenced by your script: if you use R however, you may use additional libraries if you wish (see ‘Hints’ below).

• A text file containing your predictions for the 7 052 records with missing CO2 emissions data. This file should be named ########_pred.dat, where ######## is your student ID number. The file should contain three columns, separated by spaces and with no header. The first column should be the record identifier (corresponding to variable ID in file EmissionsData.csv); the second should be the predicted CO2 emissions rates for that record, and the third should be the standard deviation of your prediction error.

Marking criteria

There are 75 marks for this exercise. These are broken down as follows:

Report: 40 marks. The marks here are for: making sensible decisions about how to pre- process the data; displaying awareness of the context for the problem and using this to inform the statistical analysis; good judgement in the choice of exploratory analysis and in the model-building process; a clear and well-justified argument; clear conclu- sions that are supported by the analysis; and appropriate choice and presentation of graphs and / or tables. The mark breakdown is as follows:

Preprocessing decisions: 5 marks. These marks are for explaining clearly what preprocessing you did and why, and for displaying good awareness of what may be necessary in order to avoid errors or bias in the subsequent statistical analysis.

Awareness of context: 5 marks.

Exploratory analysis: 7 marks. These marks are for (a) tackling the problem in a sensible way that is justified by the context (b) carrying out analyses that are designed to inform the subsequent modelling.

Model-building: 8 marks. The marks are for (a) starting in a sensible place that is justified from the exploratory analysis (b) appropriate use of model output and diagnostics to identify potential areas for improvement (c) awareness of different modelling options and their advantages and disadvantages (d) consideration of the context (e.g. ensuring that your models can be interpreted with respect to what you know about vehicle emissions and EU legislation) during the model-building process.

Quality of argument: 5 marks. The marks are for assembling a coherent ‘narra- tive’, for example by drawing together the results of the exploratory analysis so as to provide a clear starting point for model development, presenting the model- building exercise in a structured and systematic way and, at each stage, linking the development to what has gone before.

4

Clarity and validity of conclusions: 5 marks. These marks are for stating clearly what you have learned about how and why CO2 emissions vary between vehicles and over time, and for ensuring that this is supported by your analysis and mod- elling.

Graphs and / or tables: 5 marks. Graphs and / or tables need to be relevant, clear and well presented (for example, with appropriate choices of symbols, line types, captions, axis labels and so forth). There is a one-slide guide to ‘Using graphics effectively’ in the slides / handouts for Lecture 1 of the course. Note that you will only receive credit for any graphs in your report if your submitted script / program generates and automatically saves these graphs, appropriately labelled, when it is run.

Note that you will be penalised if your report exceeds EITHER the specified 2 500-word limit or the number of pages of graphs and / or tables. Following the UCL guidelines at https://www.ucl.ac.uk/srs/academic-manual/c4/failure/word-count, the maximum penalty is 7 marks, and no penalty will be imposed that takes the fi- nal mark below 30/75 if it was originally higher. Subject to these conditions, penalties are as follows:

• More than two pages of graphs and / or tables: zero marks for graphs and / or tables, in the marking scheme given above.

• Exceeding the word count by 10% or less: mark reduced by 4.
• Exceeding the word count by more than 10%: mark reduced by 7.

In the event of disagreement between reported word counts on different software sys- tems, the count used will be that from the examiner’s system. If you submit your report as a PDF file, we will use our own R function called PDFcount to determine the word count: this is available from the Moodle page in file PDFcount.r, if you want to be sure that you get the same count as us.

Coding: 15 marks. There are 5 marks here for reading the data, preprocessing and han- dling missing values correctly and efficiently; 5 marks for effective use of your chosen software in the exploratory analysis and modelling (e.g. programming efficiently and correctly); and 5 marks for clarity of your code — commenting, layout, choice of vari- able / object names and so forth.

Prediction quality: 20 marks. The remaining 20 marks are for the quality of your pre- dictions. Note, however, that you will only receive credit for your predictions if your submitted ########_pred.dat file is identical to that produced by your script / program when it is run: if this is not the case, your predictions will earn zero marks.

For these marks, you are competing against each other. Your predictions will be as- sessed using the following score:3

17052􏰢 S = 􏰍

log 􏰕σ2 + 20􏰖 + i

(Y−μˆ)2 􏰣 i i ,

σi2 + 20
3NOTE (6th April 2018): following some very smart feedback from students, the definition of the

2

i=1

5

where:

Yi is the actual rate of CO2 emissions for the ith prediction (which I know); μˆi = Eˆ (Yi) is your corresponding prediction;
σi is your quoted standard deviation for the prediction error.

The score S is an approximate version of a proper scoring rule, which is designed to reward predictions that are close to the actual observation and are also accompanied by an accurate assessment of uncertainty (this was discussed during the Week 10 lecture, along with the rationale for using this score for the assessment). Low values are better. The scores of all of the students in the class (and the lecturer) will be compared: students with the lowest scores will receive all 20 marks, whereas those with the highest scores will receive fewer marks. The precise allocation of marks will depend on the distribution of scores in the class.

If you don’t supply standard deviations for your prediction errors, the values of the {σi} will be taken as zero: this means that your score will be −∞ if you predict every value perfectly (this is the smallest possible score, so you’ll get 20 marks in this case), and +∞ otherwise (this will earn you zero marks).

STAT7001 Assessment 2 — Hints

  1. There is not a single ‘right’ answer to this assignment. There is a huge range of options available to you, and many of them will be sensible.
  2. You are being assessed not only on your computing skills, but also on your ability to carry out an informed statistical analysis: material from other statistics courses (in particular STAT2002, for students who have taken it) will be relevant here. To earn high marks, you need to take a structured and critical approach to the analysis and to demonstrate appropriate judgement in your choice of material to present.
  3. At first sight, the task will appear challenging. However, there is a lot of informa- tion that can guide you: look at some of the web links earlier in these instructions, and for other commentaries on vehicle emissions and relevant policy, to gain some understanding of what kinds of relationships you might look for in the data.
  4. When building your model, you have two main decisions to make. The first is: should it be a linear, generalized linear or generalized additive model? The second is: which covariates should you include? You might consider the following points:

score has been changed slightly: in the original instructions it was defined as 7052􏰢 (Yi−μˆi)2 􏰣

􏰍logσi+2σi2 ,. i=1

The reason for the change is explained in the ‘important update’ document available from the ICA2 tab of the Moodle page.

6

Linear, generalized linear or generalized additive? This is best broken down into two further questions, as follows:

  • Conditional on the covariates, can the response variable be assumed to fol- low a normal distribution with constant variance? In this assignment, the response variable cannot be negative and therefore cannot have exactly a normal distribution. However, you may find that the residuals from a lin- ear regression model are approximately normal — and you may judge that the approximation is adequate for your purposes. The ‘constant variance’ assumption may also be suspect: for positive-valued quantities, it is com- mon for the variability to increase with the mean. If this is the case here, you need to decide whether it varies enough to matter: you need to think about whether the effect is big enough that you can improve your predictions (and hence your score!) by accounting for it. You might consider using your exploratory analysis to gain some preliminary insights into this point.
  • Are the covariate effects best represented parametrically or nonparametrically? Again, your exploratory analysis can be used to gain some preliminary in- sights into this. You may want to look at the material from week 6, for examples of situations where a nonparametric approach is needed.

    Which covariates? The data file contains quite a lot of information, both on the physical properties of the vehicles (engine size etc.) and also on the manufacturer. The variables ApprovalNo and TechnolType are not straightforward: they encode quite a lot of additional information which may or may not be relevant (see the Appendix for details). You have many choices here, and you will need to take a structured approach to the problem in order to avoid running into difficulties. The following are some potentially useful ideas:

  • Look at other literature on emissions. What are considered to be the most important characteristics controlling the CO2 emissions of a vehicle? Can these be linked to covariates for which you have information? Obviously, if you do this then you will need to acknowledge your sources in your report.
  • Define useful summary measures on contextual grounds, and work with these. For example, there are vehicles with a wide range of engine sizes in the data base: you might decide to group these into ‘small’, ‘medium’, ‘large’ and ‘supercar’ categories instead of taking engine size as a continuous covariate. Alternatively you may decide to group the models of car for each manufac- turer, for example using a clustering analysis as discussed in the Week 10 lecture.
  • Define new variables based on the correlations between the existing variables, and work with these. If several variables are highly correlated, then it is difficult to disentangle their effects and it may be preferable to work with a single ‘index’ that combines all of them. This is the basis of techniques such as Principal Components Analysis, that were discussed during the Week 10 lecture (along with how to implement them in R and SAS).

    You should not start to build any models until you have formed a fairly clear 7

strategy for how to proceed. Your decisions should be guided by your exploratory analysis, as well as your understanding of the context.

  1. Don’t forget to look for interactions! For example, one of the variables in the data set is FuelType, which is a factor (i.e. categorical covariate) indicating the type of fuel for a vehicle: another is EngineSize, which probably doesn’t make much sense for an electric vehicle but might have a big effect on emissions for petrol or diesel cars. Look at the analysis of the iris data from Workshop 2, for a similar kind of situation.
  2. If you use R for this assignment, you may load additional libraries if you wish. You should only do this, however, if you really understand what they are doing: overall, it is strongly recommended that you keep things fairly simple. See the feedback from last year’s assignment (available from the Moodle page) for more on this.
  3. If you use a linear model, it is straightforward to obtain the standard deviations of your prediction errors using either R or SAS (look at the material in Workshops 2 and 9 respectively, to find out how to do it). However, for generalized linear and generalized additive models you need some additional computations. Specifically:
    1. (a)  Suppose μˆi = Eˆ (Yi) is your ith predicted CO2 emission rate that Yi is the corre- sponding actual value.
    2. (b)  Then your prediction error will be Yi − μˆi.
    3. (c)  Yi and μˆi are independent, because μˆi is computed using only information from

      the first 14 104 records and Yi relates to one of the ‘new’ records.

    4. (d)  The variance of your prediction error is thus equal to Var (Yi) + Var (μˆi).
    5. (e)  You can calculate the standard error of μˆi in both R and SAS, when making predictions for new observations — see Workshops 6 and 9. Squaring this standard error gives you Var (μˆi).
    6. (f)  You can estimate Var (Yi) by plugging in the appropriate formula for your chosen distribution — for example, if you’re using a gamma distribution (which is a

      ˆ possibility when using GLMs for non-negative response variables) then Var (Yi) =

      φˆμˆ2i , where φˆ is the estimated dispersion parameter for your model (see Section 2.1 of the notes for Workshop 6).

    7. (g)  Hence you can estimate the standard deviation of your prediction error as σˆi = 􏰠ˆ

      Var(Yi)+Var(μˆi).

Appendix: the EmissionsData.csv data set Data sources and processing

The data provided in EmissionsData.csv are from three of the annual monitoring files supplied by the European Environment Agency. After resolving some problems relating to

8

file corruption, the downloaded data have been processed in the following way to create EmissionsData.csv:

  1. The variable names were modified for ease of interpretation; and a Year variable was added.
  2. Leading and trailing spaces were removed from all character variables — this is to prevent, for example, ‘MAZDA’ and ‘MAZDA ’ being considered as different car manufac- turers.4
  3. Records were removed if they had missing data for key variables, or if they corresponded to vehicles that are not passenger vehicles according to the classification system at https://www.transportpolicy.net/standard/eu-vehicle-definitions/.
  4. Some less relevant variables were removed: according to the naming convention on the European Environment Agency web site, these are variables Ct (vehicle type — this was just used to identify anything that wasn’t a passenger vehicle), Ve (‘version’ — an esoteric code used to distinguish between vehicles that are identical for most current purposes), T (‘type’ — ditto), z (Wh/km) (electric energy consumption — not very relevant for the present analysis), MAN (this is how the manufacturer referred to themselves on the ‘Original Equipment Manufacturer’ form, and doesn’t provide any useful information that can’t be found elsewhere in the data set) and MMS (manufacturer name as held in a registry — ditto).
  5. For each combination of year and member state, sets of records that obviously referred to the same model of car were identified and aggregated. The identification was done by comparing all of the variables in the record except the number of vehicle registrations, considering the numeric variables to be equal if they differed by less than 1 unit. The numbers of registrations in each set were totalled.
  6. Any record was removed if the same member state had reported for the same model of car in an earlier year. This is because in the statistical analysis, ideally each unique model of car should appear just once (although there is a possibility that some models may become available at different times in different countries, and that the specifica- tions may be slightly different in this case). The rationale for removing the ‘duplicate records’ in later years, instead of (say) at random is that the Year variable in the resulting dataset becomes a proxy for ‘age of model’ (because only the more recent models will appear in the dataset for Year 3, for example).
  7. A sample of roughly 20% of the remaining records was selected for use in the ‘model building’ part of the assessment (this will be referred to as ‘Group 1’ below), and a fur- ther sample of around 10% for the ‘prediction’ part (‘Group 2’). This was done in such a way that the two samples were non-overlapping but had very similar distributions of all potential covariates. Specifically:

4Note, however: there remain several instances where an individual manufacturer’s name is presented in different ways (for example ‘Volkswagen’, ‘VOLKSWAGEN’, ‘VOLKSWAGEN – VW’, ‘VOLKSAWGEN; VW’, . . .).

9

  1. (a)  For each combination of year, member state and ‘manufacturer group’ (see Ap- pendix), 20% of the available car models were sampled at random, without re- placement, as candidates to use in Group 1; and 10% for use in Group 2.
  2. (b)  For each of the numeric covariates in the data set, a Kolmogorov-Smirnov test was performed to test the null hypothesis that the underlying distributions in Groups 1 and 2 are the same.
  3. (c)  The samples were accepted only if the p-values for all of the Kolmogorov-Smirnov tests were greater than 0.05 . Otherwise, a new candidate sample was drawn in step (a) and the procedure was repeated.

The Kolmogorov-Smirnov test is used here as a convenient way to measure whether two distributions are similar. Note, however, that the CO2 emissions rates were not included in this balancing exercise: this is because the performance of predictions would be artificially enhanced if they were included (for example, we would know that the mean emissions rate for Group 2 is similar to that for Group 1 proportions). Note also that no attempt has been made to balance the groups in terms of combinations of the covariates.

  1. The ‘Group 2’ records were placed at the end of the data table, with their CO2 emissions rates set to −1; and a new ID variable was created so that each record has an ID number between 1 and 21 156.
  2. Each of the numeric covariates was multiplied by a (different) random number close to 1. This makes no difference to any models that you fit, because the corresponding regression coefficients will scale correspondingly; but it makes it more difficult for you to match the data with any information that you can find online.

Description of variables

This section gives a brief description of each of the variables in EmissionsData.csv.

Variable name

Description

ID

Record ID, from 1 to 21 156

Year

The reporting year (1, 2 or 3) from which this record is derived. Year 1 predates year 2, which predates the Volkswagen emissions scandal; year 3 postdates the emissions scandal.

MemberState

Two-character international standard country code for the EU member state to which the record corresponds: for example, DE is Germany.

MfrGroup

Manufacturer group (for example, manufacturers such as Audi and Porsche are part of the Volkswagen group of companies). Be aware that for some groups, the spelling and punctuation of this variable differs between years!

10

Continued on next page . . .

. . . continued from previous page

Variable name

Description

MfrHarmonised

‘Harmonised’ manufacturer name. This is intended to provide a unique and unambiguous value for each manufacturer. There are some surprises here: for example, most of the Mercedes vehicles in the data set have the value DAIMLER AG for this variable.

ApprovalNo

This is the ‘type approval number’, which certifies the safety, environmental and production standards for the vehicle (see this link). A type approval number consists of several parts, separated by an asterisk ‘*’, for example e1*2001/116*0252*44. Details of the coding can be found at Annex VII of this web address. In the example just given, the e1 indicates that the approval was issued in Germany (e2 would be France, e3 would be Italy and so on); the 2001/116 is the number of the main EU directive to which the approval conforms (in this case, Directive 2001/116/EC of 20th December 2001 — so the ‘2001’ encodes the year of the directive); and the remaining parts encode various amendments to the main directive. The reason for including it is that the year of the directive to which a vehicle complies may be a more accurate proxy for ‘age of this particular type of car’ than just the Year variable. It will be a considerable amount of effort to extract this information from the codes as provided in the data, but — as with the variable CommercialName (see below) — it may be helpful for diagnostic purposes in your statistical analysis e.g. to see whether any outlying data points are unusual with respect to their regulatory approvals.

Make

This provides a finer degree of detail on the manufacturers’ brands: for example, ‘Cadillac’ and ‘Chevrolet’ are both makes of car from the General Motors Company (MfrHarmonised equal to GENERAL MOTORS COMPANY) which itself is part of the Gen- eral Motors group of companies (MfrGroup equal to GENERAL MOTORS).

CommercialName

This is the name by which the car is usually known: for example Volkswagen Beetle, Renault Megane 1.2 1.6V. This is included primarily to help you visualise some of the cars. If you find some outlying data points for example, it may be useful for you to know what models they correspond to.

Registrations

This is the total number of new cars of this specific model that were registered in the specified year and member state.

CO2

Reported ‘Specific CO2 emissions’ for this model (units: g/km). This is your response variable.

11

Continued on next page . . .

. . . continued from previous page

Variable name

Description

Mass

Mass of this model (units: kg).

Wheelbase

Length of wheelbase i.e. distance between the centres of the front and rear wheels (units: mm).

SteeringAxle

Length of the steering axle, roughly corresponding to the width of the car (units: mm).

OtherAxle

Length of the other axle, again roughly corresponding to the width of the car (units: mm).

FuelType

Type of fuel. This should mostly be fairly self-explanatory: see the table on page 2. E85 is ethanol; LPG is ‘liquid petroleum gas’ such as propane or butane. NG-biomethane refers to a vehicle that can run on natural gas or biomethane. The UK Vehicle Certification Agency explanatory booklet gives a useful summary of different fuel types.

FuelMode

This is the ‘fuel mode’ under which the emissions test was carried out. It is mostly relevant only for duel- and flexi-fuel vehicles. It takes one of three values: M (‘mono’) indicates that the record corresponds to single-fuel mode, B (‘bi’) indicates that the record corresponds to bi-fuel mode (for vehicles that have more than one tank for different types of fuel) and F indicates that the record corresponds to flexi-fuel mode (for vehicles with a single tank that can run on mixtures of different types of fuel).

EngineSize

Engine capacity (units: cm3)

Power

Engine power (units: KW)

TechnolType

This is a code indicating “emissions reduction through innova- tive technologies”. Like ApprovalNo, it follows a rather esoteric coding system: for example, e1 10 8 means that the vehicle is fitted with technologies coded as numbers 10 and 8 by the Ger- man (e1 — see above) approval authority. A blank entry here means that no innovative technologies are fitted.

ITReduction

This is the certified reduction in emissions from any innovative technologies recorded in TechnolType (units: g/km)

12