Assignment STAT-6108, 2018-2019 Analysis of Hierarchical Data
Important
Your coursework must be submitted using the TurnItIn link provided in Blackboard before May 9th, 2019, 3pm. For more information about the submission process, please check section 3 d) of the Module Outline.
Remember that the University places the highest importance on maintaining academic integrity and expects all students to do the same. Please make sure you are familiar with the Regulations governing Academic Integrity, which are available at http://www. calendar.soton.ac.uk/sectionIV/academic-integrity-regs.html.
General information
This assignment is divided in 2 parts further split in questions, please provide your answers in a document organised accordingly with this structure. The marks allocated to each part or question are indicated in parenthesis. Your document should be written using a minimum font size 12; line spacing 1.5; and 2.5 left and right margins. Your document should not exceed 4000 words. Anything beyond the word limit will simply not be marked.
For task 1 you must use MLwiN to fit all required models. You are free to use MLwiN or any other software to verify the model assumptions.
Up to 5 marks will be allocated to the general presentation of your report.
Task 1 [70 marks]
Research has suggested that participation in youth organisations has a positive effect in the well-being of teenagers. As a way to gather evidence for the formulation of relevant public policy, the Department for Education has commissioned a survey with the aim of understanding the relationship between a teenager¡¯s well-being and the number of hours he or she allocates to those activities, as well as finding whether differences can be found for different types of organisation (Sport, Arts or Volunteering organisations).
1
The data collected is presented in the file yorg.wsz. A sample of 87 youth associations was selected and all their members between the ages of 12 and 16 were interviewed. Teenagers who are members of more than one association or those who do not partici- pate actively, as measured by not participating at least four hours per week during the last month, will not be taken into consideration. For the remaining individuals, the following variables are available:
ID.org: Organisation identifier
ID.indiv: Individual identifier
WB.index: Standardised index of well-being. Measured in a continuous scale from 0
to 5. Higher values indicate better perceived well-being Age: In completed years
H.week: Average number hours per week spent on association activities during the last month
Type: 1 = Sports; 2= Arts; 3= Volunteering
1) Describe how would you proceed to analyse this dataset using an aggregated (group analysis) approach. Discuss in your own words the main potential issue (if any) of using this tool to answer the research question of interest (Max. 200 words).
[7 marks]
2) Fit a random intercepts model to study the relationship between an individual¡¯s well-being and his/her number of hours allocated per week.
a) Use an appropriate statistical test to decide, at 5% of significance, whether to include a quadratic term in your model. Clearly state the null and alternative hypothesis of the test and your conclusion.
[7 marks]
b) Write the fitted equations of the model (with or without quadratic term depending on the results of your test) including the fixed and random parts. Please do not simply copy the MLwiN output. Use plots, predicted values or the estimates of the regression coefficients to explain in simple words the relationship between those two variables.
[7 marks]
c) Write the equation you would use to predict the expected well-being index for an individual attending the organisation with identifier number 4.
[5 marks]
2
3) Use appropriate statistical tests to assess, at 5% of significance, whether to in- clude variables Age and Type to your current model. Make sure of stating your hypothesis and conclusion. Write the fitted equations of your final model includ- ing the fixed and random parts.
[4 marks]
4) Check the level 1 and 2 residuals of your final model in 3) and comment on the validity of the model assumptions. Use plots and tests to decide whether to include a contextual variable of the ¡±group mean¡± type in your model. If you decide to do so, include it and write the equations of your final model.
[15 marks]
5) Use an appropriate statistical test to assess, at 10% of significance, whether it is necessary to include a random slope for Age in your final model in 4). Regardless of your decision, obtain the Level 2 variance for this model and discuss your results contrasting with the corresponding model with only random intercepts.
[10 marks]
6) Using as starting point your final model in 4), use an appropriate statistical test to assess, at 5% of significance, whether the relationship between hours and well- being varies depending on the type of Organisation. Explain your findings and summarize the relationship between these variables and well-being.
[8 marks]
7) Summarize the conclusions of your analysis in non-technical language (Max. 200 words).
[7 marks]
Task 2 [25 marks]
Your answers to questions in this task should refer to the ideas contained in: Bell, A., Fairbrother, M., and Jones, K. ¡±Fixed and random effects models: making an informed choice.¡± Quality & Quantity (2018): 1-24.
A researcher is interested on studying the relationship between an individual¡¯s income and his/her educational level using data from a cross-sectional survey applied in 25 countries. The data has two levels of hierarchy: country (25 countries) and individual (between 1500 and 2500 observations per country). As income is usually a highly skewed
3
variable, the researcher will model logarithm of income, log(income), as a function of individual¡¯s characteristics such as Sex, Age, Level of Education and contextual variables such as the proportion of individuals with higher education in the country. This piece of research has the following aims:
Aim 1. To quantify the association between an individual expected log(income) and his/her educational level, and to establish whether this association is the same for countries where the proportion of people with higher education is large and for those where it is low.
Aim 2. To build a ranking of countries according to their expected average log(income).
Aim 3. To identify whether the between countries heterogeneity in log(income) is larger than the within country heterogeneity (having controlled by the relevant covariates).
1) According to Bell et all. (2018), what are the defining characteristics that differ- entiate the FE and the RE for the analysis of hierarchical data? (Max 150 words)
[7 marks]
2) For each one of the models FE, RE, REWB and OLS defined in Bell et all. (2018), identify one advantage or disadvantage for its use in the context of this particular research (taking into account the research aims). What would be the model of your choice in this case and why? (Max. 400 words.)
[12 marks]
3) A linear random intercepts model using the logarithm of income as response variable seems to be the most appropriate model for this dataset. Although the residuals seem approximately normal, there are a few outliers. Explain how this situation may (or not) affect the quality of your conclusions respect to each of the three research aims above (Max 250 words).
End of the coursework
4
[6 marks]