Introduction
CVEN9407-Transport Modelling
Project Brief
This document explains the final project of CVEN9407. This project is an individual project and group submission is not accepted. The purpose of the project is familiarising students with practical econometrics analysis and guiding student on how to draw statistical inference. The project is worth 50% of the final grade. Students are evaluated based on their submitted progress report and their final report. This brief discusses the data, recommended software, the process of data analysis and developing models, format of the repots and submission dates are provided in this brief.
Guidance and assistance
Students are advised to self-monitor their progress on the project and seek for assistance if needed. Students can gauge their performance based on their progress report feedback.
Students can use their workshop hours to discuss issues with the course demonstrator. If further assistance is needed, students can ask for consultation with the course coordinator.
Software
To accomplish this project, assistance of a statistical software package is required. The statistical software package of R is the recommended software package in this study. R is a free statistical package which can be download from this website. To facilitate using R, it is recommended to download R studio as well. R studio can be downloaded from this website.
Basic introduction to the software will be provided in the lectures, and sample codes for completing most of workshop questions will be provided to students.
Note that, using R is not mandatory, and students can work with other statistical software packages if they wish.
Data
The dataset of this study is obtained from the survey of Household Income and Labour Dynamics in Australia (HILDA). HILDA is a longitudinal survey which started from 2001 and is planned to continue until 2021 (for more information about this survey refer to this website). HILDA contains socio- demographic information of people. Moreover, it contains respondents’ rates on their satisfaction in different domains. The main purpose of this study is investigating the impact of transport related variables on life satisfaction. HILDA is a confidential dataset and students must put request to DataVerse to obtain it. A separate documentation will be uploaded on Moodle to guide you how to get access to HILDA.
*** The very first step of this project is obtaining access to HILDA ***
After obtaining access to HILDA, the dataset of this project will be share with you. Due to confidentiality issues, all the personal information is removed from this dataset.
Every student is supposed to focus on one aspect of life satisfaction in a specific year. To achieve your personalised dataset in this project, filter the dataset that is shared with you based on the allocated “year” and “variable of interest” given in a separate table. This table is posted on the Moodle page.
You must keep only one of the 9 variables life satisfaction variables in your dataset, which is going to be the dependent variable of your study. Note that, life satisfaction variables should not be considered as independent variables.
The variable of interest in this table is your dependent variable in this project, where throughout the project, the potential impact of other explanatory variables on this variable will be investigated.
Variables definition
The definition of most of the variables is provided here. However, some of the fields in the processed data set do not exist in the HILDA Data Dictionary. Below you can find the definition of these variables.
The last 40 variable in this list shows the land use variable of individuals’ residences. There are four indexes available which describe the socio demographic condition of zones. These indexes are generated by Australian Bureau of statistics and are referred to as Socio Economic Indexes for Areas (SEIFA). SEIFA variables include:
• The Index of Relative Socio-Economic Disadvantage (IRSD)
• The Index of Relative Socio-Economic Advantage and Disadvantage (IRSAD) • The Index of Education and Occupation (IEO)
• The Index of Economic Resources (IER).
For more information please visit this webpage.
Variable
Definition
Female
Binary variable indicating gender (female =1)
Married
Binary variable indicating marital status (married =1)
ESL
Binary variable indicating if English is the second language
Le_mar
Binary variable indicating if the individual has experienced the life event of marriage last year
Le_sep
Binary variable indicating if the individual has experienced the life event of separation last year
Le_job
Binary variable indicating if the individual has experienced the life event of job change last year
Variable
Le_bth Le_prg
Le_death
Le_fni
Le_fnw
Le_frd
Le_prm
Le_rtr
Le_ins
Mltpljob
Manager Professional Technician ServiceWorker Administrative SalesWorker MachineryOperator Labour
FlxWork HmWork PrtStudy FullStudy Postgrad Bachelor CoupleWo CoupleW LoneW Single Renter hhad10_1 hhad10_2 hhad10_3 hhad10_4 hhad10_5 hhad10_6 hhad10_7 hhad10_8 hhad10_9 hhda10_1 hhda10_2 hhda10_3 hhda10_4 hhda10_5 hhda10_6 hhda10_7 hhda10_8 hhda10_9 hhec10_1 hhec10_2 hhec10_3 hhec10_4 hhec10_5 hhec10_6 hhec10_7 hhec10_8 hhec10_9 hhed10_1 hhed10_2 hhed10_3 hhed10_4 hhed10_5 hhed10_6 hhed10_7 hhed10_8 hhed10_9
Definition
Binary variable indicating if the individual has experienced the life event of giving birth to a child last year Binary variable indicating if the individual has experienced the life event of becoming pregnant last year Binary variable indicating if the individual has experienced the life event of death of spouse/child/close friend/relative last year
Binary variable indicating if the individual has experienced major improvement in financeS last year Binary variable indicating if the individual has experienced worsening in finance last year
Binary variable indicating if the individual has been fired or redundant last year
Binary variable indicating if the individual has been promoted last year
Binary variable indicating if the individual has been retired last year
Binary variable indicating if the individual had serious personal enjerys last year Binary variable indicating if the individual is employed in multiple jobs
Binary variable indicating if the job type is managerial
Binary variable indicating if the job type is professional
Binary variable indicating if the job type is technician
Binary variable indicating if the job type is service work
Binary variable indicating if the job type is administrative
Binary variable indicating if the job type is sales worker
Binary variable indicating if the job type is machinery
Binary variable indicating if the job type is labour
Binary variable indicating if the individual has flexible working hours
Binary variable indicating if the individual can work from home
Binary variable indicating if the individual is doing part time studies
Binary variable indicating if the individual is doing full time studies
Binary variable indicating education level (postgraduate =1)
Binary variable indicating education level (Bachelor=1)
Binary variable indicating if family structure is couple without children
Binary variable indicating if family structure is couple with children
Binary variable indicating if family structure is single parent
Binary variable indicating if family structure is single person
Binary variable indicating if the individual is renting his/her living place
Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 1 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 2 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 3 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 4 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 5 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 6 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 7 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 8 Binary variable indicating if the ‘IRSAD’ index of the home zone is less than 9 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 1 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 2 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 3 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 4 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 5 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 6 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 7 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 8 Binary variable indicating if the ‘IRSD’ index of the home zone is less than 9 Binary variable indicating if the ‘IER’ index of the home zone is less than 1 Binary variable indicating if the ‘IER’ index of the home zone is less than 2 Binary variable indicating if the ‘IER’ index of the home zone is less than 3 Binary variable indicating if the ‘IER’ index of the home zone is less than 4 Binary variable indicating if the ‘IER’ index of the home zone is less than 5 Binary variable indicating if the ‘IER’ index of the home zone is less than 6 Binary variable indicating if the ‘IER’ index of the home zone is less than 7 Binary variable indicating if the ‘IER’ index of the home zone is less than 8 Binary variable indicating if the ‘IER’ index of the home zone is less than 9 Binary variable indicating if the ‘IEO’ index of the home zone is less than 1 Binary variable indicating if the ‘IEO’ index of the home zone is less than 2 Binary variable indicating if the ‘IEO’ index of the home zone is less than 3 Binary variable indicating if the ‘IEO’ index of the home zone is less than 4 Binary variable indicating if the ‘IEO’ index of the home zone is less than 5 Binary variable indicating if the ‘IEO’ index of the home zone is less than 6 Binary variable indicating if the ‘IEO’ index of the home zone is less than 7 Binary variable indicating if the ‘IEO’ index of the home zone is less than 8 Binary variable indicating if the ‘IEO’ index of the home zone is less than 9
Analysis
1. Data analysis
1.1. The first step is to familiarise yourself with the data. For that purpose
• Check the definition of variables
• Check for any missing values in the data
• Calculate the mean and the standard deviations of continuous variables
• Calculate the frequencies for discrete variables
• If needed, plot the data to see the variations in variables
• Check the range of variables and see if it makes sense to you
1.2. The relationship between variables
• Calculate the correlation matrix for available variables
• Highlight the strong correlations in the matrix
• Justify your observation. Explain potential reasons behind strong correlations.
• Are there cases which you expect to see strong correlations, but data shows
otherwise? Discuss these cases.
• Is there any variable that you expect to have a non-linear relationship with the
dependent variable? If you are not sure, plot the dependent variable against it and see
if you can detect any pattern.
• For the variables which you are suspect of non-linear relationships, define new
independent variables with appropriate transformation (logarithmic, exponential,
second or third power, etc.).
• Include the new independent variables in the correlation matrix and discuss the
results.
2. Regression analysis
2.1. It is always recommended to divide the dataset into test and train sub-datasets. The train
dataset, containing 80 percent of records, is used to estimate the parameters of the model
and the test dataset, containing the remaining 20 percent, is used to validate the model.
• Use sample() function in R to randomly divide the dataset into test and train datasets. Even, if you choose to use other statistical packages for this project, this step should be
completed using R (This is because the marker will be using R to check your analysis).
• To avoid making a purely random selection, set the seed number to your student ID. In this method, although you randomly divide data into test and trains sub-datasets, but the process can be repeated. The command in R to fix the seed number is set.seed()
2.2. Selecting the set of explanatory variables to be included in the model
• •
o o o
The main purpose of this study is examining the relationship between transport related variables and the level of satisfaction. The dependent variable is the level of satisfaction and the rest of variables forms the set of independent variables.
The available transport related variables in this study are:
lscom: Travel time to/from paid work per week
hxymvfi: Household annual expenditure on motor vehicle fuel ($)
hxymvri: Household annual expenditure on motor vehicle repairs/maintenance ($)
o
o
• •
•
o
o
•
hxyncri: Household annual expenditure on new motor vehicles, motorbikes or other vehicles
hxypbti: Household annual expenditure on public transport and taxis
For each variable run a separate regression model with only one variable and discuss the estimated coefficient.
For the 31 combinations of transport related variables run a regression model and select the best model. The best model has the highest goodness-of-fit, while all the included variables are statistically significant.
Use the forward stepwise method to add other independent variables to the model Use Bayesian Information Criterion (BIC) index as the improvement criteria in the stepwise method. In each step, add one variable to the model. This variable should be statistically significant and improve BIC the most.
Continue the process until either all the variables are exhausted or none of the remaining variables can improve BIC any further.
The model that you have developed so far is achieved from a mechanical process and theory did not play a role. At this stage you should examine the model to see if fulfils existing theories in the field. There are two issues to be taken into consideration. Frist, exploring the theories on life satisfaction is out of the scope of this subject. So, as a simplifying solution, we only rely on our common sense (Note that in real project our reference must be accepted theories). Second, from this point, the process becomes somehow subjective. In previous steps, BIC and adjusted R square could help you with selecting the best model and making modifications on that. However, from this point, you need to use your judgment to decide how much of goodness-of-fit can be compromised to include or exclude variables based on your expectations (or theories). Different modellers have different judgments and different approaches in implementing their opinions. So, get ready to grow your own modelling judgment.
Justify included variables and the sign of their coefficients. Is there any of the variables that you cannot justify, or its sign is counterintuitive?
On the other hand, is there any of the remaining variables which you expected to be included in your model?
Improve your model by putting aside unreasonable variables and including new variables from the leftovers that you expected to be included. Most likely, this practice deteriorates the model goodness-of-fit. This is where you should decide how much you are willing to compromise the goodness-of-fit to improve justifiability
Note that in this study you want to investigate the relationship between transport related variables and level of satisfaction. So transport related variables should have a higher priority to be included in the model.
o o o
o
2.3. Testing the assumptions of Classic Linear Regression Model
• List all the assumptions behind CLRM and the statistical test that you prefer to use to
validate the assumptions.
• Test your model to see if it satisfies all the assumptions.
• If your model does not satisfy one, or some of the assumptions, double check the set
of your independent variables. Sometimes excluding unnecessary variables solves the
issue.
• If the problem still exists, use standard methods to rectify the problem.
2.4. Validation.
• To validate the accuracy of the model, simulate the dependent variable for the test
dataset and compare the results with the observed values. Discuss the model
prediction ability.
2.5. Regarding your report, as you see there is a long process behind developing a regression
model. However, you do not need to report all the work you have done. Think what would be interesting for readers to learn from your endeavour and how to efficiently convey highlights of your study. For instance, you can provide a plot on BIC variations in step 2.2. which summarises the stepwise process. Your report should include the final model which satisfy all the CLRM assumptions and your justification for the coefficients and their signs.
3. Discrete choice analysis
3.1. Selecting the right model specification
• The first step in developing a discrete choice model is deciding about the model specification. The initial decision on model specification is mainly based on the dependent variable. Note that this decision might change along the way.
3.2. Defining choices and setting up the utility functions
• Discrete choice models, as the name implies, are developed to model the outcome of
selecting one option out of multiple available alternatives. The output of discrete choice models is the probability of selecting each of the alternatives. In this study, we extend the application of discrete choice models to probability of belonging to a category, rather than selecting a category. In fact, in our study people do not make a decision about their level of satisfaction, but they feel belonging to a certain category. Although it does not resemble a choice setup, by modifying our definition of utility function we can still use discrete choice models for this context.
• To simplify the model, aggregate the range of your dependent variable into three categories of: unsatisfied, moderate, and satisfied. The dependent variable varies from 0 to 10. Assume values below 5 to indicate dissatisfaction and values above 7 to indicate complete satisfaction. Based on this assumption, define a new dependent variable which should have three levels. Then calculate the “market share” of each of the categories for the test dataset, train dataset and overall.
• Based on the nature of available independent variables, discuss your alternative specific variables and generic variables, then derive a mathematical formulation for the utility functions.
3.3. Estimating the parameters of the model
• According to the selected model specification, and the defined utility function, run a
discrete choice mode with the same set of independent variables which you concluded
in your regression model.
• Check model’s goodness-of-fit, statistical significance of the coefficients and the
interpretation of them.
• Exclude insignificant variables from the model one by one. Each time that you exclude
a variable, run the mode again and check the significance of the remaining variables.
• When you no longer have any insignificant variable, check if you can include any other variables that you expected to have an impact on your dependent variable.
• Similar to the regression modelling of this project, this process is also subjective and there is no single correct solution. Remember to prioritise transport related variables, aim for higher goodness-of-fit, and keep an eye on the significance of coefficients.
3.4. Examining the assumptions behind the selected model specification
• At this stage you should verify that your model satisfies all the assumptions behind
your model specification. First, list all the assumptions that need to be tested and
provide a legitimate statistical test to validate the assumptions.
• If your model does not satisfy one or a few of the assumptions double check the list of
independent variables. Sometimes excluding an unimportant variable fixes the issue.
• If the problem still exists, use standard methods to rectify the problem.
3.5. Validation.
• To validate the accuracy of the model, simulate the dependent variable for the test
dataset and compare the results with the observed values. Calculate average share of each category from the model and compare it with the observed shares. Discuss the model prediction ability.
Deliverables
This project is an individual project and no group submission is accepted. Students are required to submit one progress report and one final report. All the reports should be typed and submitted to Moodle as a PDF file. Late submission is accepted but 10% of the mark will be deducted for each day of late submission.
The details of each report and the due date for them is provided in the following table.
Report
Items to be covered
Details
Due date
Progress report
Data Analysis Regression analysis
A maximum ten-pager report (excluding the cover page and reference page if necessary) presenting the progress made on the specified items
Fri 05 July, 16:00pm
Final report
All the required items according to the analysis section
A concise report on the project finding. The report should include introduction, data analysis, modelling practice, discussion and conclusion.
The report should not exceed 30 pages (excluding the cover page, table of contents and reference page if necessary).
Fri 16 August, 16:00pm
Assessment
The project is worth 50% of the final grade. Students’ performance is evaluated based on their submitted reports. The progress report is worth 40% of the project’s total mark (20% of final grade) and the final report is worth 60% of the project’s total mark (30% of final grade). The reports will be assessed based on the following criteria.
Report
Assessment criteria
Total credit
Progress report
• The structure of the report. Satisfying page limitation while addressing all the required items (2 points)
• Providing standard descriptive statistics for dependent and independent variables (5 points)
• Identifying potential issues with data (3 points)
• Providing the correlation matrix (2 points)
• Discussing correlation between variables, identifying correlated and uncorrelated
variables and justifying (5 points)
• Investigating transformed versions of variables (5 points)
• The structure of the report. Satisfying page limitation while addressing all the
required items (2 points)
• Reporting the selected multi-variable regression model and discussing the findings
(8 points)
• Validating the assumptions of CLRM (7 points)
• Validating the model against test dataset (3 points)
40
Final report
• The structure of the report. Satisfying page limitation while addressing all the required items (3 points)
• Extended discussion on the range and other descriptive statistics of explanatory variables (2 points)
• Explaining potential variable transformation (2 points)
• Providing correlation matrix and justifying poor and strong correlations (3 points)
• Explaining the process of selecting the best regression model (3 points)
• Validation of CLRM assumptions (2 points)
• Discussing the findings in the regression analysis (3 points)
• Validating the model (2 points)
• Selecting a suitable discrete choice model specification and the utility functions (7
points)
• Reporting the final model and discussing the estimated parameters (13 points)
• Validating the assumptions behind the selected model specification (5 points)
• Validating the model (5 points)
• Conclusion of the study on the relationship between variables (10 points)
60
Total
100