Queen¡¯s University STAT 486, Winter 2019
Final Examination
The final exam is due on April 19, 2019 by 4pm.
Please submit a hard copy of your solutions and data analysis report to the mailbox of Jeffery 412.
You must work independently for this final exam.
If you need to cite existing work (books, articles, on-line resources), please indicate the
sources in a list of references.
Part I. Problems (15%)
Define your notation if necessary and give sufficient details when answering the problems.
1. [15 marks] The data set ¡°BoneMarrowDFTime.txt¡± posted below is extracted from a study of bone marrow transplant for leukemia patients reported by Copelan et al. (1991). The 137 leukemia patients were recruited from four hospitals in the United States and Australia. The patients were treated with a preparative regimen and then given the transplantation. They were grouped into 3 risk categories based on their disease status at the time of trans- plantation. They were then followed up for up to 7 years. The data set has 3 columns,
1) disease-free survival time (DFtime), which is the time (in days) to relapse, death or end of study,
2) indicator of disease-free status (DFstatus), which =1 if a patient died or relapsed, and =0 if the patient was alive and disease-free, and
3) disease group, which categorizes patients into 3 groups including
acute lymphoblastic leukemia (ALL), identified by Group=1,
acute myeloctic leukemia (AML) and low-risk first remission (Group=2), and
AML high-risk (Group=3).
When answering the following questions, fit log-normal model (or models) only.
(a) Describe a procedure and carry it out on the data, to test if the 3 groups have the same disease-free survival distribution.
(b) Describe the three graphical approaches of Section 3.3.1 specifically for log-normal mod- els and apply them to the data, for checking the fit of your final model (or models). Notice that these approaches all make use of the estimated survival functions (ESF) or properties or functions of ESF. Try to present your graphs so that it is easy for visually checking the model fit and comparing the three groups.
(c) Describe two types of residuals and apply the corresponding residual analysis to the data, for assessing the fit of the log-normal model with the Group factor as the covariate. Which type or types of residuals are more appropriate in the context of analyzing this data set?
1
Part II. Analysis of Breast Cancer Data (85%)
The data set ¡°ma8 surv.txt¡± is posted at OnQ course site. The data are obtained from a randomized Phase III comparative study of vinorelbine combined with doxorubicin (new treatment) versus doxorubicin alone (standard treatment) on patients with disseminated metastatic/recurrent breast cancer.
The primary interest of the clinical trial is to study if the new treatment extends patients¡¯ survival time compared to the standard treatment.
Some other variables are also collected on the patients. Detailed descriptions of the variables are given on the last 2 pages of this document. It is also important to conduct exploratory analysis to study how these variables affect the survival time and build an appropriate model (or models) to describe their association.
In your analysis please give some priorities to semi-parametric and non-parametric methods such as Kaplan-Meier estimates, (weighted) log-rank tests and the Cox regression model. Only apply the parametric methods if the semi-parametric and non-parametric meth- ods work poorly for the data. Support your analysis with appropriate graphical exploration, model checking and residual analysis. Write a report to describe the data, the problems you investigate, the statistical models and methods you apply, explain and summarize your data analysis and give clear interpretation and conclusions. You can include in the report for example, the description of the important variables, interpretation of your models and the relevant parameters…
Please aim for a clear and concise report. The suggested length is no more than 5 pages of text in the typed report. Tables and/or figures should be inserted in the report (but do not count for length). Please attach R (or SAS) code at the back of your report as an appendix. But your report should be clear and self-contained without referring to the raw code or output. If you need to present extensive numerical results and plots, please summarize and put them in tables and figures and discuss them in the text.
Marking Scheme:
Total marks: 85;
45 marks on statistical analysis; 40 marks on report writing.
2
The dataset “ma8_surv.txt” contains the data on the survival times and some
baseline characteristics (i.e., information collected at the time when the
patient entered the study) of the patients in a clinical trial: a Phase III
Comparative Study of Vinorelbine Combined with Doxorubicin versus
Doxorubicin Alone in Disseminated Metastatic/recurrent Breast Cancer (CCTG MA.8).
The definitions of these variables are:
Id: The identification of the patient (characteristic variable);
Survival: Number of days (from randomization) a patient survived for the patient
died or time from the randomization to the last contact if the patient is still
alive or has lost to follow-up at the time of final analysis;
Dead: =1 if the patient died; =0 if the patient is still alive or has lost to
follow-up (censored);
Arm: =0 if treated by Doxorubicin alone; =1 if treated by Vinorelbine combined
with Doxorubicin;
Age: Age at randomization (in years);
Perform: Performance status of the patient at randomization (=0 if full active;
=1 if restricted in physically strenuous activity but ambulatory and able to
carry out work of a light or sedentary nature; =2 if ambulatory and capable of
all self care but unable to carry out any work activities; =3 if capable of only
limited self care; =4 if completed disabled);
Meno: =0 if the patient was pre-menopausal at randomization; =1 if post-menopausal;
Measure: =0 if the tumour was measurable; =1 if the tumour is evaluable;
Numsites: number of body sites with tumours observed;
Bone: =1 if the location of the tumour was in bone; =0 if not;
Nodal: =0 if nodal involvement of the tumour at 1st diagnosis was negative;
=1 if positive;
Er: =0 if the estrogen receptor of the tumour was negative; =1 if positive;
Timdiag: time since the tumour was originally diagnosed (in days);
Disfree: time since the tumour was removed (in days);
3
Global: The score of the patient¡¯s global quality of life at the beginning of
the study (0-100);
Adv: =1 if the patient experienced adverse event at the beginning of the study;
=0 if not;
Chemmet: =1 if the patient treated by a chemotherapy for metastatic cancer at the
beginning of the study; =0 if not;
Regimen: =0 if no prior regimens taken before randomization; =1 if at least one
prior regiments;
WBC: Blood WBC count at the beginning of the study;
Gran: Blood granulocyte count at the beginning of the study;
Platelet: Blood platelet count at the beginning of the study.
******
Some variables have values . for some patients, this means the true information
is unknown, or missing.
Suggestions about dealing with missing values:
Do not take . as a level of the covariate (factor).
For a variable with missing values, focus on this variable only and check
if it is important in explaining survival time distributions.
If not important, then do not consider the variable in the rest of your
analysis (as if this variable (column) is removed from the data set).
If important, check how many patients have this variable missing,
if only a few (say <2%), then do not include these patients in
your analysis (as if these patients (rows) are removed from the data set).
Imputation is a typical approach for handling missing data, when there is a
noticeable proportion of missing values. It is not covered in this course.
Please clearly describe how you deal with the missing values;
feel free to explore and try any approaches you think are reasonable.
******
4