ECO220Y5Y: Quantitative Methods in Economics
Final Assignment
Replacement for Exam Assessment on Regression
1 Interactive Regression Exercise 1.1 Motivation
Econometrics is best understood by doing rather than by reading about what someone else has done. There are difficult choices and many pitfalls in arriving at the ‘correct’ model. Sometimes the existing theory underlying the relationships in your model seem a bit off and you could build a much ‘better’ model by including a different set of variables or transforming them (this gets at the internal validity of the model). Sometimes choosing a model with the best fit means you are making decisions that are ideal only for your sample and would not apply well to data outside your sample time period or group of individuals (this gets at the external validity of the model). You need to trade off the internal validity with the external validity as a researcher. As a result, econometrics can sometimes feel more like an art than a science. However, you will be asked to follow a scientific approach to making these model decisions and justifying these decisions in a scientific way. This assignment requires that you make independent choices on specification, analyse the consequences of these choices and adjust your choices to narrow in on a final model. You will be asked to justify the model you have selected and then provide some feedback on its economic implications.
1.2 Overview & Data
The dependent variable for this interactive assignment is the Provincial Achievement Test (PAT) score earned by students in an Alberta high school. There are 70 observations for this data set measuring PAT scores and a number of possible causal factors have been randomly drawn out of a pool of approximately 750 students over approximately one decade. The literature on PAT scores indicates that scores are determined not only by ability and training but also various socio-economic factors. Please see the attached article by James Fallows, ‘The Tests and the Brightest: How Fair Are the College Boards.’ for a summary of views in the literature on how SAT performance in the USA might be impacted by various socio-economic factors (PAT scores and SAT scores should be similarly determined). Measures of ability and training included here are the cumulative high school grade point average (GPA) and participation in advanced placement math and English courses (APMATH and APENG). Advanced placement courses may help students perform better on the PAT. This data set also includes a number of dummy variables measuring qualitative socio-economic factors such as a student’s gender (MALE), ethnicity (WHITE), and native language (ENG). The data set also includes a dummy variable indicating whether or not a student has attended a PAT preparation class (PREP). The data set includes a variable indicating what year (YEAR) the students PAT score and other information was recorded. Finally there are several variables created as the product of two other variables.
Here is a detailed description of all variables in this assignment:
• P ATi = the Provincial Achievement Test score of the ith student on a scale from 0 to 100
• GPAi = the grade point average of the ith student on a scale from 0 to 5
• APMATHi = a dummy variable equal to 1 if the ith student has taken AP Math, 0 otherwise
• APENGi = a dummy variable equal to 1 if the ith student has taken AP English, 0 otherwise
• APi = a dummy variable equal to 1 if the ith student has taken either AP Math and/or AP English, 0 otherwise
• MALEi = a dummy variable equal to 1 if the ith student is Male, 0 if Female
• WHITEi = a dummy variable equal to 1 if the ith student is Caucasian, 0 otherwise
• ENGi = a dummy variable equal to 1 if the ith student’s first language is English, 0 otherwise
• PREPi = a dummy variable equal to 1 if the ith student has attended a PAT preparation course, 0 otherwise
• Y EARi = the year the Provincial Achievement Test was taken for the ith student recorded from 2007 to 2018
• GPAMALEi = (GPAi)(MALEi)
• GPAWHITEi = (GPAi)(WHITEi)
• GPAENGi = (GPAi)(ENGi)
• WHITEMALEi = (WHITEi)(MALEi)
1.3 Summary Statistics
Included below are the Means, Standard Deviations, and Correlation Coefficients for the variables in this assignment
Means and Standard Deviations:
Correlation Coefficients:
2 Section A: Building a Model of PAT Scores 2.1 Choosing the best specification
In this section you will choose the specification you’d like to estimate from the list below, find the regression number of that specification and then look at the regression results for your chosen spec- ification in the appendix at the end. You can base your initial decision on the literature provided regarding potential discrimination in standardised testing design and also the summary statistics and correlation coefficients for the variables. You should then decide if you are satisfied with your model selection based on the results. If you are not satisfied you can use the information from the regression you ran to decide how to adjust the specification. You can now repeat the process until you decide on a final selection of the ‘best’ specification. Once you decide on your preferred specification you will answer the questions found below the regression model options.
Regression Models:
1. Model 1: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + εi
2. Model 2: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + εi
3. Model 3: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4MALEi + εi
4. Model 4: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4PREPi + εi
5. Model 5: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4WHITEi + εi
6. Model 6: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + β5MALEi + εi 7. Model 7: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + β5PREPi + εi 8. Model 8: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + β5WHITEi + εi 9. Model 9: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4MALEi + β5PREPi + εi
10. Model 10: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4MALEi + β5WHITEi + εi
11. Model 11: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4PREPi + β5WHITEi + εi
12. Model 12: PATi = β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + β5MALEi + β6PREPi + εi
13. Model 13: PATi =
14. Model 14: PATi =
15. Model 15: PATi =
16. Model 16: PATi = +β7WHITEi +εi
17. Model 17: PATi =
18. Model 18: PATi =
19. Model 19: PATi =
20. Model 20: PATi =
β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + β5MALEi + β6WHITEi + εi β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + β5PREPi + β6WHITEi + εi β0 + β1GPAi + β2APMATHi + β3APENGi + β4MALEi + β5PREPi + β6WHITEi + εi β0 + β1GPAi + β2APMATHi + β3APENGi + β4ENGi + β5MALEi + β6PREPi
β0 + β1GPAi + β2APi + β3PREPi + εi
β0 + β1GPAi + β2APi + β3PREPi + β4WHITEi + εi
β0 + β1GPAi + β2APi + β3ENGi + β4PREPi + β5WHITEi + εi β0 + β1GPAi + β2APi + εi
Section A Questions:
1. 2. 3. 4.
3 3.1
Write out the estimated model for your preferred specification including coefficients and standard errors.
Evaluate your estimation results with respect to its economic meaning, overall model fit, and the signs and significance of the individual coefficients.
What specification problems (omitted variables, irrelevant variables, multicollinearity) might your regression have? Why?
Do you have any possible suggestions to improve the model that you were not able to choose based on the models provided?
Section B: Correcting a Model of PAT Scores Understanding and correcting issues
In this section you will assess the model you selected in the last section for heteroskedasticity and serial correlation and determine the desired approach to interpret and correct for these issues. Based on your chosen model in Section A with its residual plot given in the last section appendix as well as the scatter plots in the Section B appendix answer the following questions below. Provide a few sentences to justify your answers.
Section B Questions:
1. Do you believe there might be a problem of heteroskedasticity in your chosen model? Do you believe it is pure or impure?
2. Do you believe there might be a problem of serial correlation in your chosen model? Do you believe it is pure or impure?
3. Based on the answers you gave to the two questions above, what would you suggest you do to improve the estimated model and why?
4 Section C: Interpreting a Model of PAT Scores 4.1 Deciding what you can learn from the model
In this section you will assume that a professional econometrician ran 2 models (model A & B) and determined the best specification is model B based on underlying theory. It is not your job in this case to question the model but rather to interpret the results. Based on the regression results for model B answer all of the following questions below by providing your rough work in calculations and at least a few sentences to support your argument. Note that LNPAT is the natural log of PAT scores. Both models are given in the appendix under Section C.
Section C Questions:
1. 2. 3.
4.
5.
5 5.1
Calculate the 98% two-sided confidence interval for the coefficient on MALE. Interpret this coefficient and what the confidence interval you calculated implies for your interpretation.
Test whether the absolute value of the coefficient on GPAWHITE is greater than the absolute value of the coefficient on GPAENG. Explain the meaning of this test result in terms of PAT scores.
Draw and indicate the slope and intercept of the estimated models (lines of best fit) relating GPA to the natural log of PAT scores for white males vs. non-white females conditional on them having taken Advanced Placement classes and speaking English as their first language. Interpret the two estimated lines in words.
Solve for the impact on PAT scores of a student having a GPA of 2 rather than a GPA of 0, given they did not take AP courses, are non-white, male and do not speak English as a first language. Show all your work in this calculation.
Based on inference using the results from Model B but also taking into account both models, do you believe there is potential evidence of discrimination/bias in the way PAT’s are designed or adminis- tered?
Section D: Working on a Model of PAT Scores in Stata Show you can generate your own results using code
In this section you will indicate the code you would plan to use in Stata to achieve some basic tasks. This will draw on the sort of knowledge contained in labs, lectures, the data project and the help session you have received with Stata code that you can refer back to. For each question below you should provide some basic Stata code that could be run and would achieve the results requested. There is often multiple correct ways to approach the coding, some more efficient than others, but the only consideration will be if the actual desired outcome is achieved. Note, you do not need to actually run the code on a data set just indicate what you believe to be a correct approach but you can assume you already have the variables indicated in this assignment loaded and ready in your Stata program.
Section D Questions:
1. Transform the GPA variable into a new variable measuring the natural log of GPA called LNGPA
2. Run a regression of LNPAT on LNGPA
3. Scatter LNPAT against LNGPA and display the line of best fit (linear regression line) for the model you just estimated
4. Calculate the residuals and create a new variable for them called RES
5. Calculate the fitted values and create a new variable for them called YHAT
6. Scatter the residuals (RES) against the fitted values (YHAT) to check for any issues
7. Run a new regression of LNPAT on LNGPA , AP, MALE, ENG
8. At the 1% level of sig, test whether the true coefficient on MALE could be equal to ENG 9. Test for specification error in the regression you ran
10. Test for heteroskedasticity in the regression you ran
6 Appendix:
6.1 Section A Estimated Models
Regression Model 1:
40 50 60 70 80 Fitted values
Regression Model 2:
40 50 60 70 80 Fitted values
Regression Model 3:
40 50 60 70 80 Fitted values
seulsalvauddiets0t0ei12184526425107678RF-
seulsalvauddiets0t0ei12110876541032RF-
seulsalvauddiets0t0ei12121404075867582RF-
Residuals Residuals Residuals
-20 -10 0 10 20 -20 -10 0 10 20 30 -20 -10 0 10 20
Regression Model 4:
50 60 70 80 Fitted values
Regression Model 5:
40 50 60 70 80 Fitted values
Regression Model 6:
40 50 60 70 80 90 Fitted values
seulsalvauddiets0t0ei212156789012345RF-
seulsalvauddiets0t0ei121501245678RF-
seulsalvauddiets0t0ei2125782056161RF-
Residuals
-20 -10 0 10 20 30
Residuals Residuals
-20 -10 0 10 20 -20 -10 0 10 20
Regression Model 7:
40 50 60 70 80 Fitted values
Regression Model 8:
40 50 60 70 80 Fitted values
Regression Model 9:
40 50 60 70 80 90 Fitted values
seulsalvauddiets0t0ei2129012345678RF-
seulsalvauddiets0t0ei1274210845678RF-
seulsalvauddiets0t0ei121221048586732RF-
Residuals Residuals Residuals
-20 -10 0 10 20 30 -20 -10 0 10 20 -20 -10 0 10 20 30
Regression Model 10:
40 50 60 70 80 90 Fitted values
Regression Model 11:
40 50 60 70 80 Fitted values
Regression Model 12:
40 50 60 70 80 90 Fitted values
seulsalvauddiets0t0ei1208456793217108RF-
seulsalvauddiets0t0ei212401245678RF-
seulsalvauddiets0t0ei121210129456789RF-
Residuals Residuals Residuals
-20 -10 0 10 20 30 -20 -10 0 10 20 -20 -10 0 10 20
Regression Model 13:
40 50 60 70 80 90 Fitted values
Regression Model 14:
40 50 60 70 80 Fitted values
Regression Model 15:
40 50 60 70 80 90 Fitted values
seulsalvauddiets0t0ei1208456793217108RF-
seulsalvauddiets0t0ei212401245678RF-
seulsalvauddiets0t0ei121210129456789RF-
Residuals Residuals Residuals
-20 -10 0 10 20 30 -20 -10 0 10 20 -20 -10 0 10 20
Regression Model 16:
40 50 60 70 80 90 Fitted values
Regression Model 17:
40 50 60 70 80 Fitted values
Regression Model 18:
40 50 60 70 80 Fitted values
seulsalvauddiets0t0ei121278182670545014RF-
seulsalvauddiets0t0ei21210876540821RF-
seulsalvauddiets0t0ei1212797162826048459501RF-
Residuals Residuals Residuals
-20 -10 0 10 20 -20 -10 0 10 20 -20 -10 0 10 20
Regression Model 19:
40 50 60 70 80 Fitted values
Regression Model 20:
40 50 60 70 80 Fitted values
seulsalvauddiets0t0ei2121425676780141528RF-
seulsalvauddiets0t0ei2140125674785RF-
Residuals
-20 -10 0 10 20
Residuals
-20 -10 0 10 20
6.2 Section B Scatter Plots
HTGEAE050505RNTPIL0210TMEGAEAHAP0EARN84684682484684242646464284242424686WMAG1464575676978789701494567808090145678902625241414567070848949812345897016545456467575858969601219218272YPE.
1 2 3 4 5 GPA
0 .2 .4 .6 .8 1 AP
0 .2 .4 .6 .8 1 ENG
0 .2 .4 .6 .8 1 APMATH
0 .2 .4 .6 .8 1 MALE
0 .2 .4 .6 .8 1 PREP
0 .2 .4 .6 .8 1 APENG
0 .2 .4 .6 .8 1 WHITE
2005 2010 2015 2020 YEAR
PAT
40 5060 7080 90
PAT
4050 6070 8090
PAT
40 5060 7080 90
PAT
40 5060 7080 90
PAT
4050 6070 8090
PAT
40 5060 7080 90
PAT
40 5060 7080 90
PAT
4050 6070 8090
PAT
40 5060 7080 90
6.3 Section C Additional Model Estimations
Regression Model A:
Regression Model B: