End of Year Assessment Summary: You have to participate in the Kaggle competition and have to submit a 2- page report (using the provided template at the end of this description) and an implementation code. |
least one submission to the Kaggle competition to get a non-zero weight! |
or combination of classifiers you use, how you handle issues specific to competition data set such as high dimensionality of the data (large number of features), how you do model selection (training-validation split or cross validation), and how you do further investigations to take into account the three extra information: additional but incomplete labelled training data, test label proportion and the annotation confidence on labels. Details of Research Report You are expected to write a 2-page report detailing your solution to the Kaggle competition problem. Please use the provided latex or word template (see the end part |
of this description). Your report should include the following components (you are allowed to combine descriptions #2 and #3 but make sure we can easily identify them). 1. APPROACH (Maximum mark: 10) You should present a high-level description and explanation of the the machine learning approach (e.g. support vector machine, logistic regression, or a combination thereof) you have adopted. Try to cover how the method works and notable assumptions on which the approach depends. Pay a close attention to characteristics of the data set, for example: high dimensionality. 2. METHODOLOGY (Maximum mark: 25) |
Describe how you did training and testing of the classifier of your choice. This should include model selection (Did you do model selection? what was it meant for? what were you selecting?) and feature pre-processing or feature selection if you chose to do it. Feature pre-processing could be in the form of: |
• Standardisation: to remove the mean and scale the variance for each feature, that is to
make each feature having 0 mean and 1 standard deviation.
- Normalisation: to scale individual observation or data point to have a unit norm, be itL1 or L2 norm.
- Binarisation: to threshold numerical feature values to get boolean values.
- Scaling: to scale features to lie between minimum and maximum values, for example tolie in [0,1] or [-1,1].
Feature selection methods are for example: filter methods such as univariate feature
selection based on chi-squared statistics, wrapper methods such as recursive feature
elimination, and L1 norm penalisation for sparse solutions. You are provided with two
types of features: CNNs features and GIST features. Are they equally important? Describe any of your creative solutions with respect to additional characteristics of the competition data set, such as how to incorporate the extra information about: additional training data with many missing features, test label proportions, and training label confidence.
Reference to appropriate literature may be included.
3. RESULTS AND DISCUSSION (Maximum mark: 25)
The main thing is to present the results sensibly and clearly. Present the results of your model selection. There are different ways this can be done:
• Use table or plot to show how the choice of classifier hyper-parameters affect |
performance of the classifier using validation set (refer to lectures in week 9). Classifier |
hyper-parameters are for example, regularisation values in support vector machine |
and in logistic regression. |
• Use graphs to show changing performance for different training sets (learning curve; refer to lectures in week 9), if you choose to do that.
If any, provide analysis on the usefulness of taking into account the provided additional
incomplete training data, test label proportions, and training label confidence.
You should also take the opportunity to discuss any ways you can think of to improve the work you have done. If you think that there are ways of getting better performance, then explain how. If you feel that you could have done a better job of evaluation, then explain how. What lessons, if any have been learnt? Were your goals achieved? Is there anything you now think you should have done differently?
Details of Code (Maximum mark: 40)
You must also submit your implementation codes. Please make sure we will be able to run your code as is. High quality codes with a good structure and comments will be marked favorably. As mentioned earlier, the code component will be weighted based on your performance in the Kaggle competition. No submission to the competition means 0.0 weight.