EE 559 Course Project Assignment Post: Sat., 4/9/2022 : Mon., 5/2/2022, 11:59 PM v1.1: trivial system performance for Dataset 2 corrected on page 11;
pandas added to allowed libraries list on page 2;
update to libraries allowed for non-EE559 topics also on page 2.
Introduction
Copyright By PowCoder代写 加微信 powcoder
The goal of this project is to develop one (or more) machine learning systems that operate on the given real-world dataset(s). In doing so, you will apply tools and techniques that we have covered in class. You may also confront and solve issues that occur in practice and that have not been covered in class. Discussion sessions and past homework problems will provide you with some tips and pointers for this; also internet searches and piazza discussions will likely be helpful.
Projects can be individual (one student working on the project) or a team (2 students working together on one project).
You will have significant freedom in designing what you will do for your project. You must cover various topics that are listed below (as “Required Elements”); the methods you use, and the degree of depth you go into each topic, are up to you. And, you are encouraged to do more than just the required elements (or to dive deeper than required on some of the required elements).
Everyone will choose their project topics based on the three topic datasets listed below.
Collaboration and comparing notes on piazza may be helpful, and looking for pertinent information on the internet can be useful. However each student or team is required to do their own work, coding, and write up their own results.
Topic datasets:
There are 3 topic datasets to choose from:
(i) Power consumption of Tetouan City Problem type: regression
(ii) Student performance (in Portuguese schools) Problem type: classification or regression
(iii) Algerian forest fires Problem type: classification
These datasets are described in the appendix, below.
Note: for each topic dataset, you are required to use the training and test sets provided on D2L. Some of the features have been preprocessed (e.g., normalization, transformation, deletion, noise removal or addition). Additionally, we want everyone to use the same training set and test set, so that everyone’s system can be compared by the same criteria.
p. 1 of 15
Which topic datasets can you use?
Individual projects should choose any one dataset. If it is the student performance dataset, then pick either the classification or regression version.
Team projects should choose any one dataset. If it is the student performance dataset, then do both classification and regression versions. If it is the forest fire dataset or the power consumption dataset, then you are expected explore the topic in more depth than a typical individual project. For example, this could mean doing more feature engineering, trying more (nonlinear) feature expansion/reduction, or trying more models.
Computer languages and available code
You must use Python as your primary language for the project. You may use Python and its built-in functions, NumPy, scikit-learn, pandas, and matplotlib. You may find LibSVM or SVMLight useful, and may use them as well. Additionally, you may use imblearn (imbalanced-learn) functions for undersampling or oversampling of your data if that would be useful for your project; you may use pandas only for reading, writing, and parsing csv files; and you may use a function or class for RBF network implementation (e.g., scipy.interpolate.Rbf). Within these guidelines, you may use any library functions or methods, and you may write your own code, functions, and methods. Please note that for library routines (functions, methods, classes) you use, it is your responsibility to know what the routine does, what its parameters mean, and that you are setting the parameters and options correctly for what you want the routine to do.
As posted on piazza at https://piazza.com/class/kyatv32xz9f9i?cid=359, if you have parts of your project that uses some techniques outside of EE 559 topics, for those parts you may where necessary use other libraries as appropriate. Please see the piazza post for guidelines as to descriptions and information needed to be included in your report, in this case.
Use of C/C++ is generally discouraged (for your own time commitments and because we didn’t cover it in relation to ML). However, there could be some valid reasons to use C/C++ for some portions of your code, e.g. for faster runtime. If you want to use it, we recommend you check with the TAs or instructor first.
Be sure to state in your project report what languages and toolboxes/libraries you used; what you coded yourself specifically for this class project; and of course any code you use from other sources must be credited as such.
Required elements
• The items below give the minimal set of items you are required to include in your project, for each dataset you report on. Note that you are welcome and encouraged to do more than the minimal required elements (for example, where you are required to use one method, you are welcome to try more than one method and compare the results). Doing more work will increase your workload score,
p. 2 of 15
might increase your interpretation score, and might improve your final system’s performance.
EE 559 content
o The majority of your work must use algorithms or methods from EE 559 (covered in any part of the semester).
o You may also (optionally) try algorithms and methods that were not covered in EE 559 for comparison; and describe the method and results in your report.
Consider preprocessing: use if appropriate [Discussion 7, 8, 9]
Tip: you might find it easier to let Python (using pandas) handle csv parsing.
Normalization or standardization. It is generally good practice to consider these. It is often beneficial if different features have significantly different ranges of values. Normalization or standardization doesn’t have to be all features or none; for example, binary variables are typically not standardized.
For classification problems, if the dataset is significantly unbalanced, then some methods for dealing with that should be included. If it’s moderately unbalanced, then it might be worth trying some methods to see if they improve the performance. Some approaches to this are done by preprocessing of the data.
Representation of categorical (ordinal or cardinal) input data should be considered. Non-binary categorical-valued features usually should be changed to a different representation.
Consider feature-space dimensionality adjustment: use if appropriate
You can use a method to try reducing and/or expanding the dimensionality, and to choose a good dimensionality. Use the d.o.f. and constraints as an initial guide on what range of dimensionality to try.
In addition to feature-reduction methods we covered in EE 559, feel free to try others (some others are mentioned in the forest fires dataset description).
Cross validation or validation
◦ Generally it’s best to use cross validation for choosing parameter values, comparing different models or classifiers, and/or for dimensionality adjustment. If you have lots of data and you’re limited by computation time, you might instead use validation without cross-validation.
Training and prediction
Individual projects should try at least 3 different classification/regression techniques that we have covered (or will cover) in class. Team projects should cover at least 4 classification/regression techniques, at least 3 of which are covered in EE 559. Beyond this, feel free to optionally try other methods.
Note that the required trivial and baseline systems don’t count toward the 3 or 4 required classifiers or regressors, unless substantial additional work is done to optimize it to make it one of your chosen systems.
p. 3 of 15
Proper dataset (and subset) usage
Final test set (as given), training set, validation sets, cross validation.
Interpretation and analysis
Explain the reasoning behind your approach. For example, how did you decide whether to do normalization and standardization? And if you did use it, what difference did it make in system performance? Can you explain why?
Analyze and interpret intermediate results and final results. Especially, if some results seem surprising or weren’t what you expected. Can you explain (or hypothesize reasons for) what you observe?
(Optional) If you hypothesize a reason, what could you run to verify or refute the hypothesis? Try running it and see.
(Optional) Consider what would be helpful if one were to collect new data, or collect additional data, to make the prediction problem give better results. Or, what else could be done to potentially improve the performance. Suggest this in your report and justify why it could be helpful.
Reference systems and comparison
At least one trivial system and one baseline system are required. Each dataset description states what to use for these systems.
Run, and then compare with, the baseline system(s). The goal is to see how much your systems can improve over the baseline’s system performance.
Also run, and compare with, the trivial system. The trivial system doesn’t look at the input values 𝑥 for each prediction; its comparison helps you assess whether your systems have learned anything at all.
Performance evaluation
o Report on the cross-validation (or validation) performance (mean and standard deviation); it is recommended to use one or more of the required performance measures stated in your dataset’s description (in the Appendix, below). If you use other measure(s), justify in your report why.
o Report the test-set performance of your final (best-model) system.
▪ For your final-system test-set performance, you may use the best parameters found from model selection, and re-train your final system using all the training data to get the final weight vector(s).
▪ Report on all required performance measures listed in the dataset description.
o You may also report other measures if you like. Use appropriate measures for your dataset and problem.
Written final report and code
Submit by uploading your report and code in the formats specified below, to D2L. p. 4 of 15
General Tips
1. Be careful to keep your final test set uncorrupted, by using it only to evaluate performance of your final system(s).
2. If you find the computation time too long using the provided dataset(s), you could try the following: check that you are using the routines and code efficiently, consider using other classifiers or regressors, or down-sample the training dataset further to use a smaller N for your most repetitive work. In the latter case, once you have narrowed your options down, you can do some final choices or just your final training using the full (not down-sampled) training dataset.
3. Wherever possible, it can be helpful to consider the degrees of freedom, and number of constraints, as discussed in class. However, this is easier to evaluate for some classifiers than for others; and yet for others, such as SVM and SVR, it doesn’t directly apply.
Grading criteria
Please note that your project will not be graded like a homework assignment. There is no set of problems with pre-defined completion points, and for many aspects of your project there is no one correct answer. The project is open-ended, and the work you do is largely up to you. You will be graded on the following aspects:
• Workload and difficulty of the problem (effort in comparison with required elements, additional work beyond the required minimum, progress in comparison with difficulty of the problem);
• Approach (soundness and quality of work);
• Performance of final system on the provided test set;
• Analysis (understanding and interpretation);
• Quality of written code (per guidelines given in Discussion 6 and related materials posted)
• Final report write-up (clarity, completeness, conciseness).
In each category above, you will get a score, and the weighted sum of scores will be your total project score.
For team projects, both members of a team will usually get the same score. Exceptions are when the quantity and quality of each team member’s effort are clearly different.
Final report [detailed guidelines (and template) to be posted later]
In your final report you will describe the work that you have done, including the problem statement, your approach, your results, and your interpretation and understanding.
Note that where you describe the work of others, or include information taken from elsewhere, it must be cited and referenced as such; similarly, any code that is taken from elsewhere must be cited in comments in your code file. Instructions for citing and referencing will be included with the final report instructions.
p. 5 of 15
Plagiarism (copying information from elsewhere without crediting the source) of text, figures, or code will cause substantial penalty.
For team projects, each team will submit one final report; the final report must also describe which tasks were done by each team member.
You will submit one pdf of your (or your team’s) report, one pdf of all code, and a zip file with all your code as you run it. .
Please note that your report should include all relevant information that will allow us to understand what you did and evaluate your work. Information that is in your code but not in your report might be missed. The purpose of turning in your code is so that we can check various other things (e.g., random checks for proper dataset usage, checking for plagiarism, verifying what you coded yourself (as stated in the report), and running the code when needed to check your reported results).
p. 6 of 15
EE 559 Appendix: Overview of Project Datasets Spring 2022 For all datasets
Use only the provided training and test sets on D2L. In the provided datasets, some of the features may have been preprocessed, and thus are incompatible with the feature representation of the dataset that is posted on the UCI website. You are required to use only the provided (D2L) datasets for your course project work, so that everyone on a given topic is graded on the same datasets.
1. Power Consumption dataset
Description written by: description
The goal of this problem is to predict the power consumption in three different zones of Tetouan city, Morocco. The measurements were taken every 10-min over an entire year.
Problem type: regression
Data Description
The dataset contains 6 features, 5 of which are numerical, and one is categorical. The numerical features are related to the weather and the categorical feature describes the 10-min interval during which the measurements were taken. In detail:
1. Date Time (date is in month/day/year format)
2. Temperature
3. Humidity
4. Wind Speed
5. General diffuse flows
6. Diffuse flows
Note on training, test, and validation sets
You are provided with a training and a test set. The test set was created by randomly selecting 73 (20%) days out of the 365 available days. To avoid overfitting during the validation or cross- validation procedures, we suggest that you follow a similar approach. In other words, you should create a day of the year feature, which will have range [1, 365], and the validation set(s) should be drawn as M random days out of the 292 training days. You might want to use skelarn’s GroupKFold class to achieve this more easily. And remember you cannot use the test set to make any decisions about your models!
Further discussion (not required reading for the project): when dealing with time-varying data, we cannot create training and test sets completely at random. We want to avoid the situation where our system makes predictions having knowledge about the future. A realistic approach to this problem is to use the first M days as training and the last 365-M for testing, but this approach would create difficulties when also applied to validation or cross-validation processes.
p. 7 of 15
Therefore, we chose this day-based approach, which presents a good trade-off between the realistic problem (respecting history vs. future) and the original problem (treating every 10- minute period as a separate (independent) data point).
Required performance measures to report
– Coefficient of determination (R-squared, 𝑅!):
∑# (𝑦−y/)!
𝑅!=1−”$%” ”
∑# 1𝑦 −𝑦2! “$% ”
where 𝑦/” is the predicted value and 𝑦 is the mean value of all 𝑦”
Note that: −∞ < 𝑅! < 1. Higher values of 𝑅! indicate models with better performance.
Values below zero mean the system performs worse than the trivial system.
You can code your own implementation or use sklearn’s implementation. !
- Root Mean Squared Error (RMSE= (𝑀𝑆𝐸)"). Note that sklearn’s implementation can compute either MSE or RMSE based on parameter squared
The performance will be evaluated mainly in terms of the 𝑅! score. However, you must provide values for both 𝑅! score and RMSE in your report.
Feature engineering will likely play an important role in this project. You are already required to create one feature (numeric feature from the “date time” feature) before performing any validation procedure, but you are encouraged to create others. Examples: month, weekday or weekend, hour of day, minute of day. You can also consider nonlinear transformations or combinations of features. Note that a nonlinear transformation does not have to be a global nonlinear transformation over all original feature variables; you can be more selective like having a nonlinear transformation over just 1, 2, or 3 variables that you choose, for example: allow a nonlinear function of the minute and/or hour of the day, as a set of new features.
Required reference systems
You must code, evaluate and report the test performance of the following systems.
- Trivial system: a system that always outputs the mean of the training data output values.
- Baseline system: linear regression using only the 5 numerical features (ignoring date time).
Estimated difficulty level (1-5, 1 is easy, 5 is difficult):
2-3 (easy to moderate)
References
[1] Dataset information: (https://archive.ics.uci.edu/ml/datasets/Power+consumption+of+Tetouan+city)
[2] Relevant paper: https://ieeexplore.ieee.org/document/8703007 p. 8 of 15
2. Student Performance dataset
Description written by: Summary description
The problem addressed is to predict students’ academic performance in secondary schools (and to then use the results to intervene and help at-risk students do better). The data is from students in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features, and was collected by using school reports and questionnaires. The dataset provided for this project measures the students’ performance in the subject of Portuguese language (por).
Problem type: Classification (5-class)* or Regression.
For individual projects, choose either regression or classification to work on. For team projects, do both regression and classification.
Data Description
The full dataset [1] is posted on the UCI website (includes mathematics and Portuguese topics). For this project we will use the only the Portuguese dataset. The number of input attributes per data point is 30, not counting the grade features (Gx). 13 of the attributes are integer valued, 13 are binary-valued categorical, and 4 are multi-valued categorical. There are no missing values. In detail:
1. school - student's school (binary: 'GP' - or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com