National University of Singapore Department of Statistics and Applied Probability ST5213 Categorial Data Analysis II Assignment 2
Important: This assignment accounts for 20% of your final grade. Your work for this assignment must be submitted by 7 p.m. 9 April 2019 Tuesday to the lecturer. Any student failing to submit work by the deadline will receive a penalty for late submission (20% per day late) unless the lecturer is advised as soon as possible of any extenuating circumstances. You may submit before 9 April by dropping your work into the lecturer’s pigeonhole (S16 Level 7) or during the lecture on 9 April. Submissions through email will not be accepted or marked.
Plagiarism: The work that you submit must be your sole effort (i.e. not copied from anyone else). You may be severely penalized If found guilty of plagiarism.
The two assignment tasks involve the analysis of some data and you should hand in a formal report for each task.
Format: The report for each task should not exceed three pages of A4 paper includ- ing any relevant figures or tables. R code and output can be attached as appendix. Please use Times New Roman font (11-12 point), print your assignment in double-sided format on A4 paper and use the cover page provided. You may be marked down for exceeding the page limit or if any of the above instructions are not followed.
The aim of the report is to convey the aims, methodology and results of your data analysis in a concise, readable fashion with appropriate use of figures or tables for sum- marization. It is strongly recommended that you structure your report into sections, along the following lines.
• Introduction: Summarize the data and the aims of the analysis.
• Methodology: Describe clearly your strategy for selecting a model and statistical
methods that you use with evidence for support.
• Results: Describe the results of your analysis and their interpretations.
• Discussion: Draw conclusions (based on your results) as necessary with supporting evidence.
Marks for each mini-report will be awarded for
• Exposition: your mini-report should be well-organized. You should aim to write in a concise, yet readable, manner.
• Statistical content: marks will be awarded for the correct use of appropriate statis- tical techniques, and for the correct interpretation of results from these techniques.
Note: You will be assessed based on the report alone. I will refer to your R code and output only if it is not clear from your report what are the results of your analysis or if I wish to locate any source of error.
1
Class index number:
NATIONAL UNIVERSITY OF SINGAPORE
ST5213 Categorical Data Analysis II Assignment 2
(Semester 2: AY 2018/2019)
Name:
Matriculation number:
Task 1:
Task 2:
Total:
Task 1.
(a) Consider the 2 × 2 contingency table below where the observations of Y at each setting of X are independent, and the margins n1+ and n2+ are fixed. Then n11 ∼ Bin(n1+, π1) and n21 ∼ Bin(n2+, π2) independently, where π1 = P(Y = 1|X = 1) and π2 =P(Y =1|X =2).
Total
Y
12
n11 n12 n21 n22
n+1 n+2
1 n1+
X
Total n
2 n2+
Write down the maximum likelihood estimates, πˆ1 and πˆ2, of π1 and π2 respectively. Let d = π1 − π2. Then dˆ = πˆ1 − πˆ2 is a natural estimate of d. Find an expression for
ˆ
var(d) and hence show that the (Wald) 95% confidence interval for d is given by
πˆ1(1 − πˆ1) πˆ1 − πˆ2 ± 1.96 n
1+
+
πˆ2(1 − πˆ2) n .
2+
(b) In the case “Sheehan vs. Daily Racing Form”, the plaintiff, Jim Sheehan, alleged that his discharge (after a corporate acquisition) was discriminatory, as it was based solely on age. He offered as evidence a list showing that while 9 of 11 of employees 48 years of age or more were discharged, none of the six employees under the age of 48 were discharged.
i. Use Pearson χ2 test of independence to decide if these data provide evidence of age discrimination.
ii. Construct a 95% confidence interval for the difference in the probabilities that an older worker is discharged, and that a younger worker is discharged by using the results in (a). Does this interval lead to the same conclusion as the test in (b) i.?
iii. UseFisher’sexacttesttoassesstheevidence.Doesthetestsupportthepossibility of age discrimination?
iv. Two additional older employees affected by the acquisition who were not dis- charged were not on the original list. Does including these two employees change the results in parts i.– iii.?
Task 2. The possibility of racial discrimination in the application of the death penalty in the United States is a highly controversial issue, and various studies of the patterns of death penalty application have been undertaken to investigate it. The following tables are two examples of such studies. The first table cross-classifies blacks who had been convicted of murder in Georgia by whether they received the death penalty, race of the victim, and aggravation level of the crime from 1 (least aggravated) to 6 (most aggravated).
3
Aggravation level 1
2 3 4 5 6
Death penalty Race of victim Yes No White 2 60
Black 1 181 White 2 15 Black 1 21
White 6 7 Black 2 9 White 9 3 Black 2 4 White 9 0 Black 4 3 White 17 0 Black 4 0
Table 1: Cross-classification of blacks convicted of murder in Georgia.
The second cross-classifies homicide cases in North Carolina where the death penalty was possible by whether the defendant received the death penalty, race of the defendant, and race of the victim.
Race of victim Nonwhite
White
Death penalty Race of defendant Yes No
Nonwhite White Nonwhite White
29 587 4 76 33 251 33 541
Table 2: Cross-classification of homicide cases in North Carolina.
The question of interest is whether the death penalty is applied in an unfair way. For the first study, “fairness” would imply (conditional) independence between death penalty application and the race of the victim, while in the second it would imply independence between death penalty application and the race of the victim and the race of the defen- dant. Are these forms of independence consistent with the observed tables? What do you think these studies say about the fairness of application of the death penalty in Georgia and North Carolina, respectively?
Address the above questions by searching for the loglinear model that can best explain the association patterns in each table. Use an association graph to represent the condi- tional independence structure in the selected model and interpret the model. Explain whether the zero cells in Table 1 affects your analysis.
4