Home Fall 2023 3
Previous Semesters 3
PROJECT 3: ASSESS LEARNERS
h Table of Contents
Copyright By PowCoder代写 加微信 powcoder
About the Project
Your Implementation Contents of Report Testing Recommendations Submission Requirements Grading Information Development Guidelines Optional Resources
This assignment is subject to change up until 3 weeks prior to the due date. We do not anticipate changes; any changes will be logged in this section. 08/26/2023 Updated section 3.8 to more clearly define the requirement to use randomly selected data.
1 OVERVIEW
In this assignment, you will implement four supervised learning machine learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs). You will also conduct several experiments to evaluate the behavior and performance of the learners as you vary one of its hyperparameters. You will submit the code for the project in Gradescope SUBMISSION. You will also submit to Canvas a report where you discuss your experimental findings.
1.1 Learning Objectives
The specific learning objectives for this assignment are focused on the following areas:
Supervised Learning: Demonstrate an understanding of supervised learning, including learner training, querying, and assessing performance.
Programming: Each assignment will build upon one another. The techniques developed here regarding supervised learning and CARTs will play important roles in future projects.
Decision Tree Module: The decision tree(s) implemented in this project will be used in at least one future project.
2 ABOUT THE PROJECT
Implement and evaluate four CART regression algorithms in object-oriented Python: a “classic” Decision Tree learner, a Random Tree learner, a Bootstrap Aggregating learner (i.e, a “bag learner”), and an Insane Learner. As regression learners, the goal for your learner is to return a continuous numerical result (not a discrete result). You will use techniques introduced in the course lectures. However, this project may require readings or additional research to ensure an understanding of supervised learning, linear regression, learner performance, performance metrics, and CARTs (i.e., decision trees).
3 YOUR IMPLEMENTATION
You will implement four CART learners as regression learners: DTLearner, RTLearner, BagLearner, and InsaneLearner. Each of the learners must implement this API specification, where LinRegLearner is replaced by DTLearner, RTLearner, BagLearner, or InsaneLearner, as necessary. In addition, each learner’s constructor will need to be revised to align with the instantiation examples provided below.
This project has two main components: First, you will write code for each of the learners and for the experiments required for the report. You must write your own code for this project. You are NOT allowed to use other people’s code or packages to implement these learners. Second, you will produce a report that summarizes the observation and analysis of several experiments. The experiments, analysis, and report should leverage the experimental techniques introduced in Project 1.
For the task below, you will mainly be working with the Istanbul data file. This file includes the returns of multiple worldwide indexes for several days in history. In this task, the overall objective is to predict what the return for the MSCI Emerging Markets (EM) index will be based on the other index returns. Y in this case is the last column to the right of the Istanbul.csv file while the X values are the remaining columns to the left (except the first column). As part of reading the data file, your code should handle any data cleansing that is required. This includes the dropping of header rows and date-time columns (i.g., the first column of data in the Istanbul file, which should be ignored). Note that the local test script does this automatically for you, but you will have to handle it yourself when developing your implementation.
The Istanbul data is also available here: Istanbul.csv
When the grading script tests your code, it randomly selects 60% of the data to train on and uses the other 40% for testing.
The other files, besides Istanbul.csv, are there as alternative sets for you to test your code on. Each data file contains N+1 columns: X1, X2, … XN, (collectively called the features), and Y (referred to as the target).
Before the deadline, make sure to pre-validate your submission using Gradescope TESTING. Once you are satisfied with the results in testing, submit the code to Gradescope SUBMISSION. Only code submitted to Gradescope SUBMISSION will be graded. If you submit your code to Gradescope TESTING and have not also submitted your code to Gradescope SUBMISSION, you will receive a zero (0).
3.1 Getting Started
To make it easier to get started on the project and focus on the concepts involved, you will be given a starter framework. This framework assumes you have already set up the local environment and ML4T Software. The framework for Project 3 can be obtained from: Assess_Learners_2023Fall.zip.
Extract its contents into the base directory (e.g., ML4T_2023Fall). This will add a new folder called “assess_learners” to the course directory structure: The framework for Project 3 can be obtained in the assess_learners folder alone. Within the assess_learners folder are several files:
./Data (folder) LinRegLearner.py testlearner.py grade_learners.py
The data files that your learners will use for this project are contained in the Data folder. (Note the distinction between the “data” folder created as part of the local environment and the “Data” folder within the assess_learners folder that will be used in this assignment.)
LinRegLearner is available for your use and must not be modified. However, you can use it as a template for implementing your learner classes. The testlearner.py file contains a simple testing scaffold that you can use to test your learners, which is useful for debugging. It must also be modified to run the experiments. The grade_learners.py file is a local pre-validation script that mirrors the script used in the Gradescope TESTING environment.
You will need to create the learners using the following names: DTLearner.py, RTLearner.py, BagLearner.py, and InsaneLearner.py. In the assess_learners/Data directory you will find several datasets:
3_groups.csv ripple_.csv
simple.csv winequality-red.csv winequality-white.csv winequality.names.txt Istanbul.csv
In these files, we have provided test data for you to use in determining the correctness of the output of your learners. Each data file contains N+1 columns: X1, X2, … XN, and Y.
3.2 Task & Requirements
You will implement the following files:
DTLearner.py – Contains the code for the regression Decision Tree class.
RTLearner.py – Contains the code for the regression Random Tree class.
BagLearner.py – Contains the code for the regression Bag Learner (i.e., a BagLearner containing Random Trees). InsaneLearner.py – Contains the code for the regression Insane Learner of Bag Learners.
testlearner.py – Contains the code necessary to run your experiments and perform additional independent testing.
All your code must be placed into one of the above files. No other code files will be accepted. All files must reside in the assess_learners folder. The testlearner.py file that is used to conduct your experiments is run using the following command:
3.3 Implement the DT and RT Learners (15 points each)
Implement a Decision Tree learner class named DTLearner in the file DTLearner.py. For this part of the project, your code should build a single tree only (not a forest). You should follow the algorithm outlined in the presentation here: decision tree slides.
We define “best feature to split on” as the feature (Xi) that has the highest absolute value correlation with Y.
The algorithm outlined in those slides is based on the paper by JR Quinlan which you may also want to review as a reference. Note that Quinlan’s paper is focused on creating classification trees, while we are creating regression trees here, so you will need to consider the differences.
You will also implement a Random Tree learner class named RTLearner in the file RTLearner.py. The RTLearner should be implemented exactly like your DTLearner, except that the choice of feature to split on should be made randomly (i.e., pick a random feature then split on the median value of that feature). You should be able to accomplish this by revising a few lines from DTLearner (those that compute the correlation) and replacing the line that selects the feature with a call to a random number generator.
The DTLearner and RTLearners will be evaluated against 4 test cases (4 using Istanbul.csv and 1 using another data set from the assess_learners/Data folder). We will assess the absolute correlation between the predicted and actual results for the in-sample data and out-of-sample data with a leaf size of 1, and in-sample data with a leaf size of 50.
3.3.1 Example
The following example illustrates how the DTLearner class methods will be called:
The following example illustrates how the RTLearner class methods will be called:
The DTLearner and RTLearner constructors take two arguments: leaf_size and verbose. “leaf_size” is a hyperparameter that defines the maximum number of samples to be aggregated at a leaf. If verbose is True, your code can generate output to a screen for debugging purposes. When the tree is constructed recursively, if there are leaf_size or fewer elements at the time of the recursive call, the data should be aggregated into a leaf. Xtrain and Xtest should be NDArrays (Numpy objects) where each row represents an X1, X2, X3… XN set of feature values. The columns are the features and the rows are the individual example instances. Ypred and Ytrain are single dimension NDArrays. Ypred is the prediction based on the given feature dataset. You need to follow the algorithms described above. Do not use a Node Implementation. Your internal representations of the trees must be NDArrays.
3.4 Implement BagLearner (20 points)
Implement Bootstrap Aggregation as a Python class named BagLearner. Your BagLearner class should be implemented in the file BagLearner.py. It should support the API EXACTLY as illustrated in the example below. This API is designed so that the BagLearner can accept any learner (e.g., RTLearner, LinRegLearner, even another BagLearner) as input and use it to generate a learner ensemble. Your BagLearner should support the following function/method prototypes:
The BagLearner constructor takes five arguments: learner, kwargs, bags, boost, and verbose. The learner points to the learning class that will be used in the BagLearner. The BagLearner should support any learner that aligns with the API specification. The “kwargs” are keyword arguments that are passed on to the learner’s constructor and they can vary according to the learner (see example below). The “bags” argument is the number of learners you should train using Bootstrap Aggregation. If boost is true, then you should implement boosting (optional implementation). If verbose is True, your code can generate output; otherwise, the code should be silent.
As an example, if we wanted to make a random forest of 20 Decision Trees with leaf_size 1 we might call BagLearner as follows: As another example, if we wanted to build a bagged learner composed of 10 LinRegLearners we might call BagLearner as follows:
Note that each bag should be trained on a different subset of the data. You will be penalized if this is not the case.
Boosting is an optional topic and not required. There is a citation in the Resources section that outlines a method of implementing boosting.
If the training set contains n data items, each bag should contain n items as well. Note that because you should sample with replacement, some of the data items will be repeated.
This code should not generate statistics or charts. If you want to create charts and statistics, you can modify testlearner.py. You can use code like the below to instantiate several learners with the parameters listed in kwargs:
3.5 Implement InsaneLearner (Up to 10-point penalty)
Your BagLearner should be able to accept any learner object so long as the learner obeys the defined API. We will test this in two ways: 1) By calling your BagLearner with an arbitrarily named class and 2) By having you implement InsaneLearner as described below. If your code dies in either case, you will lose 10 points. Note, the grading script only does a rudimentary check thus we will also manually inspect your code for correct implementation and grade accordingly.
Using your BagLearner class and the provided LinRegLearner class; implement InsaneLearner as follows:
InsaneLearner should contain 20 BagLearner instances where each instance is composed of 20 LinRegLearner instances. We should be able to call your InsaneLearner using the following API:
The InsaneLearner constructor takes one argument: verbose. If verbose is True, your code can generate output; otherwise, the code should be silent. The code for InsaneLearner must be 20 lines or less.
Each “;” in the code counts as one line.
Every line that appears in the file (except comments and empty lines) will be counted.
You can not use the “exec()” statement or strings of Python commands concatenated together using any variant of a new line line ‘\n’ Hint: Only include methods necessary to run the assignment tasks and the author methods.
There is no credit for this, but a penalty if it is not implemented correctly. Comments, if included, must appear at the end of the file. Note: We recommend avoiding blank lines and comments in your InsaneLearner implementation.
3.6 Implement author() (Up to 10-point penalty)
All learners (DT, RT, Bag, Insane) must implement a method called author() that returns your Georgia Tech user ID as a string. This must be explicitly implemented within each individual file and cannot be included through the use of inheritance. It is not your 9 digit student number. Here is an example of how you might implement author() within a learner object:
And here’s an example of how it could be called from a testing program:
Note: No points are awarded for implementing the author function, but a penalty will be applied if not present.
3.7 Boosting (Optional Learning Activity – 0 points)
This is a personal enrichment activity and will not be awarded any points. Conversely, there is no deduction if boosting is not implemented. Implement boosting as part of BagLearner. How does boosting affect performance compared to not boosting? Does overfitting occur as the number of bags with boosting increases? Create your own dataset for which overfitting occurs as the number of bags with boosting increases.
Submit your report regarding boosting as report-boosting.pdf
3.8 Implement testlearner.py
Implement testlearner.py to perform the experiment and report analysis as required. This is intended to give a central location to complete the experiments, and produce all necessary outputs, in a single standardized file.
Data provided by the testlearner to the learner must be randomly selected
It is the ONLY file submitted that produces outputs (e.g. charts, stats, calculations).
It is the ONLY file allowed to read data with the provided data reading routine (i.e., it is not allowed to use util.py for data reading). It MUST run in under 10 minutes.
It MUST execute all experiments, charts, and data used for the report in a single run.
You are able to use a dataset other than Istanbul.csv if not explicitly stated but must adhere to the following run command EXACTLY regardless of the dataset used (you will have to read the other datasets internally):
3.9 Technical Requirements
The following technical requirements apply to this assignment
1. YoumustuseaNumPyarray(i.e.,NDArray()),asshowninthecoursevideos,torepresentthedecisiontree.Youmaynotuseanotherdatastructure(e.g.,
node-based trees) to represent the decision tree.
2. YoumaynotusePandasorpythonlistvariableswithintheDTLearnerorwithintheRTLearner.
3. YoumaynotcastavariableofanotherdatatypeintoanNDArray.
4. AlthoughIstanbul.csvistime-seriesdata,youshouldignorethedatecolumnandtreatthedatasetasanon-timeseriesdataset.Inthisproject,weignore the time-ordered aspect of the data. In a later project, we will consider time-series data.
5. Filesmustbereadusingoneoftwoapproaches:1)Theopen/readlinefunctions(anexampleofwhichisprovidedinthetestlearner.pyfile)or2)usingNumPy’s genfromtxt function (an example of which is used in the grade_learners.py file). Do not use util.py to read the files for this project.
6. Thechartsmustbegeneratedas.pngfiles(whenexecutedusingthecommandgivenaboveintheseinstructions)andshouldincludeanydesiredannotations. Charts must be properly annotated with legible and appropriately named labels, titles, and legends. Image files cannot be post-processed to add annotations or otherwise change the image prior to their inclusion into the report. Charts must be saved to the project directory.
7. Thelearnercodefiles(i.e.,DTLearner,RTLearner,BagLearner,InsaneLearner)shouldnotgenerateanyoutputtothescreen/terminal/displayordirectly produce any charts. Testlearner.py is the only file that should generate charts.
8. YourlearnersshouldbeabletohandleanynumberoffeaturedimensionsinXfrom2toN.
9. Performance:
DTLearner tests must complete in 10 seconds each (e.g., 50 seconds for 5 tests) RTLearner tests must complete in 3 seconds each (e.g., 15 seconds for 5 tests BagLearner tests must complete in 10 seconds each (e.g., 100 seconds for 10 tests) InsaneLearner must complete in 10 seconds each
10. Ifthe“verbose”argumentisTrue,yourcodecanprintouttheinformationfordebugging.Ifverbose=FalseyourcodemustnotgenerateANYoutputotherthan the required charts. The implementation must not display any text (except for “warning” messages) on the screen/console/terminal when executed in Gradescope SUBMISSION.
11. Anynecessarytext,tables,orstatisticscanbesavedinafilecalledp3_results.txtorp3_results.html.
12. WatermarkedCharts(i.e.,wheretheGTUsernameappearsoverthelines)canbesharedinthedesignatedpinned(e.g.,“Project3–StudentCharts”)thread alone. Charts presented in reports or submitted for grading must not contain watermark.
3.10 Hints and Resources
“Official” course-based materials:
How to use a decision tree if you have one ( video) How to build a decision tree & Random Trees ( video) Media:How-to-learn-a-decision-tree.pdf Balch slides on decision trees Decision-tree-example.xlsx Example tabular version of decision tree
Additional supporting materials:
You may be interested in looking at ’s slides on instance-based learning. A definition of correlation is used to assess the quality of the learning.
Bootstrap Aggregating
numpy corrcoef numpy argsort RMS error
If after submission for grading you are not entirely satisfied with the implementation, you are encouraged to continue to improve the learner(s) as they can play a role in future projects.
4 CONTENTS OF REPORT
In addition to submitting your code to Gradescope, you will also produce a report. The report will be a maximum of 7 pages (excluding references) and be written in JDF Format. Any content beyond 7 pages will not be considered for a grade. At a minimum, the report must contain the following sections:
First, include an abstract that briefly introduces your work and gives context behind your investigation. Ideally, the abstract will fit into 50 words, but should not be more than 100 words.
Introduction
The report should briefly describe the paper’s justification. While the introduction may assume that the reader has some domain knowledge, it should assume that the reader is unfamiliar with the specifics of the assignment. The introduction should also present an initial hypothesis (or hypotheses).
Discuss the setup of the experiment(s) in sufficient detail that an informed reader (someone with the familiarity of the field
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com