Assignment 1
ENN543, Data Analytics and Optimisation, Semester 2, 2020
This document sets out the three (3) questions you are to complete for ENN543 Assignment 1. The assignment is worth 25% of the overall subject grade. Weights for individual questions are indicated throughout the document. Students should submit their answers in a single separate document (either a PDF or word document), and upload this to TurnItIn.
Further Instructions:
1. Data required for this assessment is available on blackboard alongside this document in ENN543 Assessment 1 Data.zip. Please refer to individual questions regarding which data to use for which question.
2. Answers should be submitted via the TurnItIn submission system, linked to on Black- board. In the event that TurnItIn is down, or you are unable to submit via TurnItIn, please email your responses to enn543query@qut.edu.au.
3. MATLAB code or scripts (or equivilent materials for other languages) should be sub- mitted as supplementary material (i.e. additional files) or appendices. Note that this material will not be directly marked (i.e. marks will not be assigned for code quality). Figures and outputs/results that are critical to question answers should be included in the main question response, and not appear only in an appendix or supplementary file.
4. Students who require an extension should lodge their extension application with HiQ (see http://external-apps.qut.edu.au/studentservices/concession/). Please note that teaching staff (including the unit coordinator) cannot grant extensions.
1
Problem 1. Linear Regression (20%). Prediction of residuary resistance of sailing yachts at the initial design stage is of a great value for evaluating the ships performance and for estimating the required propulsive power. Essential inputs include the basic hull dimensions and the boat velocity. The Delft data set comprises 308 full-scale experiments, which were performed at the Delft Ship Hydromechanics Laboratory for that purpose. The results of these experiments are in the file yacht.dat. These experiments include 22 different hull forms, derived from a parent form closely related to the Standfast designed by Frans Maas.
The columns correspond to the following variables (in order):
• Residuary resistance per unit weight of displacement, adimensional; • Longitudinal position of the center of buoyancy, adimensional;
• Prismatic coefficient, adimensional;
• Length-displacement ratio, adimensional;
• Beam-draught ratio, adimensional;
• Length-beam ratio, adimensional;
• Froude number, adimensional.
Using this data:
1. Fit a model to predict the resistance per unit weight of displacement as a function of the other variables. Discuss if this is a valid model.
2. Given the above model as a starting point, investigate how it can be improved. In this you should consider:
(a) The use of training and validation datasets. The data should be divided such that the split between these two sets is approximately 80% for training and 20% for validation.
(b) Are all variables important for the model?
2
Problem 2. Regularised Regression (40%). Web pages collect large volumes of data on page views, page links, etc., to monitor readership. For commercial ventures, this can help inform publishing and layout decisions, as well as advertising. The BlogFeedback dataset contains data on blog readership, and can be used to predict page views in the next 24 hours based on past readership data.
You have been supplied with two variants of this data:
1. Files named blogData noBow train.csv and blogData noBow test.csv contain features that capture the average readership information for the blog, and information for the specific post (see blogData Variables.txt for further information);
2. Files named blogData train.csv and blogData test.csv contains all the features of the noBow files alongside 200 bag-of-words features1 that capture the blog post content.
Note that the testing data contains examples from later times to the training data, simulating a real-world case where the model is trained on historic data to predict the future.
Using this data:
1. Fit a model using Linear regression, Ridge and LASSO regression on the noBow data. With these models consider the following:
(a) Determine the best value of λ to use in the Ridge model to obtain the best predictive model.
(b) Determine the best value of λ to use in the LASSO model to obtain the best predictive model.
2. Fit a model using Linear regression, Ridge and LASSO regression on the data contain- ing the Bag-of-Words features. With these models consider the following:
(a) Determine the best value of λ to use in the Ridge model to obtain the best predictive model.
(b) Determine the best value of λ to use in the LASSO model to obtain the best predictive model.
3. Compare the performance of the two Linear, Ridge and LASSO models. You should consider factors such as the errors of the models, the R2 and Adjusted R2, and the model validity in your discussion. Which, if any, models are suitable for use? Justify your response.
1Bag-of-words features capture the number of instances of particular words in a docu- ment. An introduction to Bag-of-Words can be found at https://machinelearningmastery.com/ gentle-introduction-bag-words-model/. Note however that an understanding of bag-of-words is not needed for this question or subject.
3
Problem 3. Clustering (40%). Sensors such as accelerometers and gyroscopes are be- coming increasingly common in wearable and mobile devices. From these signals, it is pos- sible to detect different activities, and potentially even different people. You have been supplied with a set of data collected using the accelerometer on a smart phone that captures acceleration as people go about daily office tasks. The data has five columns as follows:
• X, which indicates acceleration in the X direction;
• Y, which indicates acceleration in the Y direction;
• Z, which indicates acceleration in the Z direction;
• ActivityID, a categorical variable indicating which of 7 activities is being performed;
• SubjectID, a categorical variable indicating which of 10 subjects the sample corresponds to.
Using this data, you are to investigate if the classes of activity and the users can be separated by clustering the three acceleration variables. In particular you are to:
1. Cluster the data using a GMM with the aim of:
(a) Separating the data into the 7 activity classes. Using the provided ground truth, evaluate the accuracy of the clustering result.
(b) Separating the data into the 10 identity classes. Using the provided ground truth, evaluate the accuracy of the clustering result.
(c) Separating the data 70 clusters such that each cluster corresponds to a particular individual performing a particular activity. Using the provided ground truth, evaluate the accuracy of the clustering result.
2. Repeat the three clustering tasks using HAC and DBScan, and compare the perfor- mance of the clustering results obtained using the GMM, HAC and DBScan. Comment on any differences observed between the three methods, and which method is more suit- able in this situation. Your discussion should consider not just performance, but the suitability of each approach given the information available in the task, and the model hyper-parameters that need to be set.
4