Machine Learning and Data Mining in Business Semester 1, 2022
Final Exam
General instructions:
• This exam requires five submissions: written answers (as a PDF file), two Jupyter Notebooks (ipynb files), and HTML versions of the two Jupyter Notebooks.
Copyright By PowCoder代写 加微信 powcoder
• Type your answers in the answer document (provided as a separate Word file).
• Make your answers as concise as possible while still fully answering the questions.
• You can answer all questions entirely in plain English, but feel free to use equations if you like.
• Enter your code in the Jupyter Notebook templates. Please follow the structure of the notebooks without modification. Keep the notebooks as organised and readable as possible.
• To generate an HTML copy of a Jupyter Notebook, click on File/Download As/HTML.
• You can use a notebook server such as Google Colab to run your code, provided that you
submit your code and output according to the template.
• The marking will consider the exam duration: focus on doing your best within the available time.
• Tip: I strongly suggest you start by doing the minimum necessary to have an end-to-end version of the notebooks as quickly as possible. Then work on improving them.
• Tip: reduce the size of the training set when prototyping so that the code runs faster.
• You can use code from the tutorials without citation. You can also borrow code snippets
from other sources with citations. Beyond that, the exam must be entirely your original work.
Question 1 (20 marks)
Read the “What went wrong with Zillow” news story included with the exam material.
Zillow is an American online real-estate marketplace that developed a home-flipping business called Zillow Offers. Their strategy was to use machine learning algorithms to estimate the
BUSINESS SCHOOL
Page 1 of 6
BUSINESS SCHOOL
value of residential properties and make offers according to the output of their models. Zillow would then renovate the houses it bought and sell the properties as quickly as possible.
Unfortunately for the company, Zillow paid too much for many houses, leading it to report $304 million in losses in the third quarter of 2021 and forcing the company to reduce its workforce by 25%. Now discontinued, Zillow Offers became one of the most prominent business failures related to machine learning.
Suppose that you are a data scientist for a company developing a service based on supervised learning. Identify two strategies that your data science team can adopt to avoid a failure like Zillow Offers. Justify your answers.
Question 2: Tabular Data (50 marks)
This question uses the Adult census income dataset available from the UCI Machine Learning repository, a popular tabular dataset for benchmarking classification methods. The task is to predict whether an individual’s income is above 50,000 US dollars.
2.1 Exploratory Data Analysis Computational task (Jupyter Notebook): Perform exploratory data analysis. Question (answer document):
List and describe the two most interesting findings from your exploratory data analysis. Give preference to results that helped you to build better models. (10 marks)
2.2 Measuring performance
Computational task (Jupyter Notebook):
• Write a Python function that takes a model or list of models as input and estimates the predictive performance of each model according to relevant metrics. The function should return the result as a table.
• Test the function using a simple baseline model.
Question (answer document):
Justify your model selection approach and choice of metrics. Identify the trade-offs in terms of computational cost and accuracy. If relevant, explain how you could improve your estimation strategy if more computing time were available. (10 marks)
Page 2 of 6
2.3 Linear models
Computational task (Jupyter Notebook):
• Fit a baseline linear model that uses all the predictors. This model should be as simple as possible.
• Build two linear models that progressively improve upon your baseline model. If no improvement is possible, it’s sufficient to experiment with two interesting approaches.
• Build the best possible linear model that you can.
• The notebook should clearly report the performance of all your linear models.
Questions (answer document):
(a) Describe your baseline linear model and results. (2 marks)
(b) Describe and justify the two improvements you made to your baseline model. If there was no improvement, describe your two experiments. Discuss the results. (4 marks)
(c) Describe and justify your best linear model. Discuss the results. What can you conclude from this part? (4 marks)
2.4 Nonlinear models
Computational task (Jupyter Notebook):
• Fit a baseline model based on your preferred nonlinear method. Keep the implementation simple.
• Using the same learning algorithm, build two models that progressively improve upon your baseline model. If no improvement is possible, it’s sufficient to experiment.
• Build the best possible nonlinear model that you can.
• The notebook should clearly report the performance of all your nonlinear models.
Questions (answer document):
(a) Describe and justify your baseline model and its implementation. Discuss the results. (2 marks)
(b) Describe and justify the two improvements you made to your baseline model. If there was no improvement, describe and explain your two experiments. Discuss the results. (4 marks)
BUSINESS SCHOOL
Page 3 of 6
BUSINESS SCHOOL
(c) Describe and justify your best nonlinear model and its implementation. Discuss the results. Why do you think this learning algorithm performed better than other methods? What can you conclude from this part? (4 marks)
2.5 Final Model and evaluation
Computational task (Jupyter Notebook):
(a) Develop or select your final model for predicting test data.
(b) Evaluate the performance of the final model, best linear model, and best nonlinear model on the test set. The final model may be the same as the best linear or nonlinear model, in which case you’d only compare two models.
Questions (answer document):
(a) Describe and justify your final model and its implementation. Why did you choose this model? (5 marks)
(b) Discuss the results. What can you conclude from the entire analysis? (5 marks)
Question 3: Deep Learning (30 marks)
This question uses the Fashion MNIST dataset, containing 28×28 grayscale images of fashion items and associated labels from ten classes. The objective is to classify each image as showing a t-shirt, trousers, pullover, dress, coat, sandal, shirt, sneaker, bag, or ankle boot. There are 60,000 images in the training set and 10,000 in the test set.
The Jupyter Notebook for this question downloads the dataset and shows some of the images.
Computational task (Jupyter Notebook):
(a) Implement and train the baseline model specified by the diagram below. Please note that interpreting the diagram is part of the assessment. Evaluate the model on the test set. You must implement the model using the corresponding PyTorch layers.
Page 4 of 6
(b) Using the baseline model as a starting point, find two ways to improve accuracy on the test set by modifying the architecture or adding a suitable technique to its implementation or training. The idea here is to stay somewhat close to the baseline network, instead of using a completely different architecture. If no improvement is possible, it’s sufficient to experiment.
(c) Build the best image classifier that you can while keeping the computing time reasonable. You can use any architecture and implementation in this task.
Questions (answer document):
The following questions refer to the corresponding computational task.
(a) Describe and justify your implementation choices for the baseline model. Discuss the results.
(b) Describe and justify the two improvements. What is the test accuracy? Discuss the results. (c) Describe and justify your final image classifier and its implementation. Discuss the results.
BUSINESS SCHOOL
Page 5 of 6
BUSINESS SCHOOL Additional guidance (Questions 2 and 3)
You don’t have to write a lot to get full marks, you just have to be precise. The marking will be based on the following:
• Whether your models achieve the expected level of performance.
• The quality, thoroughness, and correctness of the computational work that supports
your answers.
• The extent to which your written answers demonstrate that you used your knowledge of machine learning to build better models.
• The extent to which your written answers and computational work demonstrate that you applied data analysis skills to build better models and obtain insights.
• The quality of your conclusions (within the scope of the task) and whether the reasoning behind them is correct.
Page 6 of 6
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com