University of Western Ontario
CS2035 (Jan-May 2020)
Assignment 4
Due Date: 2020-04-02 at 11:55pm
Marking Scheme
This assignment will count towards 10% of the total mark of the course. The table below shows the percentage contributed by each exercise of this assignment:
Contribution Exercise 1 3% Exercise 2 3% Exercise 3 4%
Total 10%
Submitting your assignment
Submit your assignment on OWL in the form of well commented Matlab script files in the “Assignments” section. The file names must follow this convention:
A4 STUDENTNUMBER EXn.m where STUDENTNUMBER is your 9-digit student number and n represents the exercise number (hence n is either 1, 2 or 3). Functions (if any) can either be in the same script (“all-in-one” style) or in separated scripts (with the script name the same as the function name).
In any case, make sure you submit all the necessary files such that the grader can run your programs. This includes the csv data files (the ones provided with this assignment).
Your grade will be based on what the grader can run. A program that does not run (i.e., stops because of any error) will be graded zero.
One way to check this is to copy what you plan to submit in a new folder and run your scripts from this new folder (and/or send to a friend and ask her/him to run it).
While you can work with other students, your submission must reflect you own work. Programs that are suspiciously similar will be graded zero.
Page 1 of 4
University of Western Ontario CS2035 (Jan-May 2020)
Exercise 1 – Ovarian Cancer Detection
Serum proteomic pattern diagnostics use protein mass spectrometry to differentiate biological samples from patients with and without cancer. In this exercise, our goal is to build a classifier that can distinguish between cancerous and normal biopsies from the mass spectrometry data. The same biological sample is analyzed by two different spectrometric instruments (called A and B) for each patient.
The data from instrument A is presented in file ovarian A.csv and in ovarian B.csv for instrument B. One row, for both files, corresponds to the data collected for the same patient and the columns represent the spectrometric measurements (their meaning is not relevant here).
The file ovarian diagnostic.csv indicates, for each of the 216 patients, if the sampled tumour was cancerous (Cancer) or not (Normal).
We want to determine which instrument (between A and B) is the best at detecting ovarian cancer.
a) Using a multivariate logistic regression (on all the measurement variables for each instrument) and ROC analysis, determine the best instrument to detect ovarian cancer in this group of patients. Provide explicitly the criteria you used to make your decision. Generate a ROC figure that illustrates your findings.
b) We want a true positive rate (TPR) of 90% for the best method. What is the false positive rate (FPR) in that case? What would be the FPR with the worst method, for the same TPR (90%)?
Page 2 of 4
University of Western Ontario CS2035 (Jan-May 2020)
Exercise 2 – Pulsars
Pulsars are a rare type of Neutron star that produce radio emission. As pulsars rotate, their emission beam sweeps across the sky. When this beam crosses the line of large radio telescopes, it produces a detectable pattern of broadband radio emission. Each pulsar produces a slightly different emission pattern. In practice , detection uses radio frequency interference and noise, making legitimate signals hard to find.
The dataset pulsars.csv contains eight measurements from 2,000 stars obtained with radio telescopes. The description of those eight variables is provided, for information only, in the file pulsars-info.txt. The nineth variable indicates if the star is a pulsar (1) or not (0). This dataset has been checked by human annotators.
a) Using a multi-dimensional scaling (MDS) with a Sammon minimizing criteria, show with a single figure that pulsars congregate together when projected onto a two-dimensional space. Write a short comment that helps locate the pulsars in the figure.
b) Evaluate the pulsar detection ability of a multivariate logistic regression (using all eight variables) by calculating its AUC. Is it an effective method? Why?
c) (This question is not directly related to b)) Can you find a clustering methods (among the ones seen in class) that would be successful in identifying a cluster mostly composed of pulsars of the first dataset (pulsars.csv)? You must show your work for this question, that is: if you find one, just write the code that identifies the pulsar cluster and produce one single figure. If you don’t find any satisfactory clustering method, show what your failed attempts produced in a single multi-panel figure.
d) The file pulsar2.csv contains new measurements from 500 other stars but their pulsar nature has not been checked by a human annotator. Determine which stars, out of those 500, are pulsars. Aim for a 90% true positive rate. How many new pulsars have you found in this new dataset?
(Hint: use the logistic regression from the first dataset to infer the probability that a star in the second dataset is a pulsar. Then, infer the threshold that gave a TPR of 90% and use this threshold to classify the candidate pulsars of the second dataset)
Page 3 of 4
University of Western Ontario CS2035 (Jan-May 2020)
Exercise 3 – Wheat Seeds
The file seeds.csv is a dataset that records seven measurements of kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, coded 1, 2 and 3 respectively in the eighth column. The description of the seven measurements is provided, for information only, in the file seeds-info.txt.
We want to identify a clustering method that manages to identify reasonably well the Canadian wheat variety based on the seven measurements.
a) Perform a classical multi-dimensional scaling (MDS) that projects this dataset onto a 2-dimensional space.
b) Run these four clustering algorithms on the projected MDS points: • agglomerative complete
• agglomerative centroid
• k-means
• spectral
Plot a four-panels figure where each panel represents the MDS map. Every data point must be annotated with a text indicating its variety (that is 1, 2, or 3) and the colour of each data point indicates its cluster.
c) (This question is difficult). For each clustering method, calculate the true positive and false positive rates regarding the classification of the Canadian variety. Plot those four couple of rates in a ROC-type figure (i.e., FPR on the x-axis and TPR on the y-axis). Which one of the four clustering methods is the best, assuming a true positive rate of at least 95% is required?
END OF ASSIGNMENT 4
Document saved on 2020-03-23 07:30:47-04:00
Page 4 of 4