University of Toronto Scarborough
Introduction to Machine Learning
and Data Mining
CSCC11H3
Fall 2021
Take-home Final Exam
Due December 21, 2021 at 11:59 pm
Analysis of the Stock Market Fluctuations, Anomalies
and Fear Index Using Data-driven Methodologies
Overview
Throughout this take-home exam, you will use your machine learning and analytic skills by building an
algorithm to investigate the driving forces behind anomalous events in stock markets. In particular, you
will characterize the extent to which the interplay among economic, political, psychological and
behavioural forces can leave an impact on stock market movements by resorting to applications of
various machine learning concepts and techniques. The data provided to you cover diverse sources of
information mainly retrieved from the public news and social media, multiple sectors of the
government and international organizations, balance sheets of many companies, etc. The ultimate goal
will be to construct a classifier for predicting the future direction of the VIX index (also known as the
fear index) and hence the stock market, which in turn will provide us with a significant amount of
valuable information. Furthermore, you will use probability mixture models, clustering algorithms,
neural networks, logistic regression and class conditionals, among several other topics.
Why this exam?
The take-home exam will teach you the end-to-end process of investigating data through a machine
learning lens. It will teach you how to model complex phenomena based on your knowledge acquired
during the semester and develop algorithms to extract salient features that best represent your data.
Furthermore, you will gain first-hand experience with some of the powerful machine learning
algorithms and, more importantly, have invaluable experience evaluating the performance of your
implemented algorithms and validating the accuracy of their results.
What will I learn?
By the end of this exam, you will be able to:
– Deal with an imperfect, real-world dataset.
Validate a machine learning result using test data, your critical thinking, and other analytic capabilities.
Evaluate the accuracy of your machine learning algorithm results using different quantitative metrics.
– Create, select and transform features to compare the performance of different machine learning
algorithms for their fine-tuning.
– Communicate your ideas, methodologies and results concisely, coherently and formally.
Pep Talk
The exam description might sound a bit complicated and time-consuming when reading along its
description lines for the first time. However, I can assure you that it has been designed to be done
during 5-8 hours, and I believe most of you will enjoy working on the problem once you get engaged in
doing it. Besides, I am scheduling bi-daily office hours to help you out along this path, and of course,
you are always welcome to send me an email and book a personal appointment.
Outline of the Written Report
(please note that in general I’m not expecting a work of literature and it is considered sufficient to
convey your main points and fruitfully interpret your results. Yet, the following items are the
structure of most formal reports that you will write in your future career, hence I will mention some
of them)
1. Jupyter Notebook: You are required to prepare your report using the Jupyter notebook and
submit a pdf or html version of the final report on Crowdmark. Please try to organize your
report in different sections and use various available headings and typesetting to embellish on
your work.
2. Title and Abstract: Please choose an appropriate title and provide a brief abstract (in a very
high level) summarizing the entire work, its main points and the most significant findings. You
may aim for at most 200 words.
3. Introduction: This part introduces the reader to the dataset and to the area to which it pertains.
You should describe why this is an important problem to investigate and give the reader a
review of pertinent background information about the underlying problem. Basically, introduce
the reader to the problem and why it is meritorious of investigation. The introduction should be
written at a very basic level (i.e., no mathematics or notation), and remember that your reader
may not know anything about the area in which you are writing. You may aim for 1000 words
maximum. Introduction is also the best place to communicate all important and critical
questions which you tried to tackle and find their answer, regardless of the final output. For
instance, you may address the following questions (these are just a few examples and the list is
not exhaustive):
1. Is it even possible to forecast the stock market fear index? What are the reasons in
favour of and against such possibility.
2. Are the past data available to you sufficient to for constructing a predictive model to
forecast the direction of the stock market movements in the future?
3. Are there other factors existing which might have great influence on the behaviour that
one observes in stock markets? If so, is it possible to omit these additional elements
from our analysis and simplify the problem for the purpose of making its study more
practical? If not, what would be the main drawbacks in case such simplifications are
taken into account?
.
.
.
4. Model Specification: In this section, you want to describe, in clear detail, the data analysis used
to specify your candidate models. Pretend as if you are taking the reader by the hand and
leading him/her through your thought process which leads to your model selections. In doing
this, however, try not to overdo the first-person writing. It can sound less professional and less
authoritative if you continually write things like, “I tried this, and then I tried that …”.
5. Fitting and Diagnostics: This part of the report should describe the model fitting and
diagnostics techniques you used, with the goal of identifying a “final” model for forecasting.
Identify also what possible deficiencies your final model has. Remember, no model is perfect.
You will also need to be relatively detailed (as much as your time allows) and use your critical
thinking to come up and answer questions related to the possible obstacles.
6. Forecasting: This section should describe the techniques you used to forecast future
observations. Why is forecasting important? What impacts could forecasting the VIX have?
7. Discussion: Here you want to offer a summary of what you have done and draw your main
conclusions. Also, it is a good idea to discuss here other issues related to the data analysis. For
example, does your analysis have any shortcomings or lack of generalization? What were the
main problems you encountered? It is OK if your final model is not picture-perfect as real-life
data analysis is often more difficult than textbook problems.
8. Bibliography: Cite all the references (if necessary).
9. Appendices: Use appendices to catalogue extra graphics/plots/output. Basically, it is a good
idea to use appendix to house information that you want the reader to have access to, but you
feel that these information would interrupt the flow of the main body of the report.
General Points and Grading Scale
1. You are permitted to use any library which would facilitate your understanding of the problem.
2. There is no specific target length for your report. You should do enough to provide a an analysis
of this dataset, with attention paid to each sections listed above. You may decide to include
descriptions of the methods of analysis which have shed light on some fundamental questions
you tried to answer.
3. Your reports will be read very carefully by my senior PhD assistant and under my supervision
in order to ensure fair assessment of your understanding of the course material and the abilities
you have obtained throughout the course.
4. Your report will be graded out of a total of 100 points, based on your understanding of the
context, your analysis and your writing (this last component is not mandatory given the currrent
circumstances):
1. Context: Were the questions answered in terms of the variables of the dataset? Have you
attempted to frame your conclusions and interpretations in a subject-matter context? Have
you provided some background information about the data set and why it is of interest?
2. Analysis: Were the chosen models, graphs, and data analyses appropriate for the problem?
Were the analyses carried out correctly? Were your conclusions about the data sensible and
clearly justified by numerical or graphical evidence?
3. Writing: How organized, clearly written and comprehensible is the report? Would the client
reading this report be confident that it was written by an educated, well-trained computer
scientist (of course by considering the restriction pertaining to the short period of time to
write it)?
Overview
Why this exam?
What will I learn?
Pep Talk