MSCBDT5002_FINAL_EXAM
Q1. Supervised Outlier Detection (15 points)
In this question, you need to use a supervised classification model to find outliers from our given image data set. The data set will contain two types of tags: outliers and inliers. And the main content of the data set is some random scenes with text as the main body.
Data Descriptions:
- AllthedataisinData_Q1.
- Folder Outlier_train contains all training data labeled as outlier.
- Folder Inlier_train contains all training data labeled as inlier.
- Folder test contains all the testing data.
Submissions:
- Pleasewriteyourmainexperimentalstepsandthemethodstoareport in Q1_readme.pdf. If your code refer to any blog, github, paper and so on, please write the their links in it.
- Output your results in Q1_output.csv. Your .csv file should contain 2 columns as shown below. In “Result”, 0 represents negative and 1 represents positive.
- PackallcodefilesinfolderQ1_code.
- Packallfiles/foldersaboveinfolderQ1likebelow:
Notes:
- Because the number of outliers and inlier is extremely uneven, you need to deal with the problem of data imbalance in the given dataset.
- You are allowed to use any of the methods we mentioned in class or methods and libraries you searched from the Internet.
- We will grade according to the code, the experiment steps and methods you mentioned in the report and the recall and precision of the your model’s prediction.
Q2. Grid-Based Outlier Discovery Approach (8 points)
In this question, you should implement a grid-based outlier detection method
to find outliers in a large data set.
Data Descriptions:
1. RelevantdataisinfolderData_Q2. 2. X.csv:Testingdata,asinput.
ID | Result |
0 | 0 |
1 | 1 |
… | … |
n | 1 |
submissionSample.csv: sample of submission, 0 indicate inlier, 1 indicate outlier.
Requirements:
1. No relevant third-party packages, you must implement the algorithm by
yourself. Submissions:
- Please report your main experimental steps in Q2_readme.pdf. If your codes refer to any blog, github, paper and so on, please report their links in it.
- Output your results in Q2_output.csv. The format refer to submissionSample.csv or below. Note that the .csv file should contain one column.
- PackallcodefilesinfolderQ2_code.
- Packallfiles/foldersaboveinfolderQ2.
Notes:
We will grade according to the code, efficiency of your method, the experiment steps and methods you mentioned in the report and the recall and precision of the your model’s prediction.
Q3. Data Augmentation (5 points)
We all know that adequate training data is a precondition for training machine learning models. But in real-world problems, the data that can be used to train the model is often not enough. Suppose you are doing a classification task and your training dataset is extremely insufficient. Please explain how you will expand the amount of data.
Notes:
You do NOT need to code in this question, but you need to answer in detail.
Please give at least two specific examples to illustrate, such as image
classification, text classification and so on. You can also refer to other
materials to answer this question, if you do so, please also list your
references.
Submissions:
- Put your answer and references in Q3_readme.pdf, and put it in folder Q3.
- Nopagelimitfortheanswer.
result |
0 |
1 |
… |
1 |
Q4. Expectation-Maximization Algorithm (8 points)
In this question, you are required to code by yourself to complete the EM
algorithm.
Data Descriptions:
- ThedataisinData_Q4folder.
- The test data is shown in Q4_Data.csv. There are 6 attributes, which are ‘A’,’B’…’F’, and totally 626 instances in the dataset. You need to cluster all the instances into two classes. Assume the initial centers are c1=(0,0,0,0,0,0) and c2=(1,1,1,1,1,1).
Requirements:
- Report the updated centers and SSE for the first two iterations.
- Report the overall iteration step when your algorithm terminates.
- Report the final converged centers for each cluster.
Submissions:
1. PutallreportsinrequirementsinQ4_readme.pdf.
2. SubmityoursourcecodeinfolderQ4_code.
3. Putfiles/folderaboveinfolderQ4.
Notes:
Please use the terminate condition below:
Terminate condition: the EM algorithm will terminate when:
1). The sum of L1-distance for each pair of old-new center
∑ ‖Cold − Cnew‖1 each center
is smaller than 0.0001, or
2). The iteration step is greater than the maximum iteration step 100.
Q5. Sentiment Analysis and Opinion Mining (18
points)
Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor).
Recently, the birth of genetically edited babies has created a huge controversy. People have different opinions on the development of genetic technology. Now you are asked to do a Sentiment Analysis Task based on topics such as “gene editing”, “genetic engineering”, and “transgene”.
In this task, you need to implement a series of processes from background investigation to collecting data to determining the solution to implementing the
algorithm to get the results.
Requirements:
➢ About training:
1. You can use any algorithm that you know, supervised learning and
unsupervised learning are both ok.
2. You can use any data resource. You need to find your own data resources
such as some corpus or lexical resource.
3. You can not directly use complete models that others have already trained
to do classification without any detailed process.
4. You can use some basic word vector models to build your algorithm, such
as word2vec.
➢ About testing:
1. You need to collect 100 pieces of news/comments/articles related to the
above topic, then use your algorithm or model to divide them into two
categories——positive or negative. (You may need some knowledge of
Crawler, in Python, BeautifulSoup is a very useful crawler tool.)
2. You can get the test text from any website or social media.
3. The text you collect must be in English.
Submissions:
1. Please write down your algorithm details and all links of the model/data
resources you used in the Q5_readme.pdf. If your code refer to any blog,
github, paper and so on, please write the their links in it.
2. Please put all the code of this question in the Q5_code folder.
3. You need submit Q5_output.csv. Your .csv file should contain 3 columns as shown below. In “Result”, 0 represents negative and 1 represents positive.
4.Put all files/folders above in folder Q5.
Notes:
1. Crawler is not required and will be not included in the scoring criteria. You
can also get the text manually or by other tools.
2. Your grade will be based on your report, code and accuracy of the results.
ID | Contents | Result |
0 | text0 | 0 |
1 | text1 | 1 |
… | … | … |
99 | text99 | 1 |
Q6. Short Video Classification (18 points)
Short video applications are becoming more and more popular among the young. In reality, internet companies generally use automatic classification algorithms to process large amounts of short video uploaded by users. Now you are asked to implement a short video classification algorithm.
Data Descriptions:
1. DataisinData_Q6folder:
2. In our data set, there are a total of 2063 training videos (in the
“train_video” folder) and 896 test videos (in the “test_video” folder). They belong to the following 15 categories:
Label ID | Video Content |
0 | dog |
1 | boy selfie |
2 | seafood |
3 | snack |
4 | doll catching |
5 | Ballroom dance |
6 | origami |
7 | weave |
8 | ceramic art |
9 | Zheng playing |
10 | fitness |
11 | parkour |
12 | diving |
13 | billiards |
14 | eye makeup |
“train_tag.txt” stores the label information. For example, in the line
“873879927.mp4,3”, “873879927.mp4” represents the file name of the video,
“3” is the label of the video.
Requirements:
➢ About training:
1. You can use any algorithm that you know.
2. You can not directly use complete models that others have already trained
to do classification without any detailed process.
➢ About grading rule
Your grade will be based on your report, code and accuracy of the results.
Submissions:
1. Please write down your algorithm details in the Q6_readme.pdf. If your
code refer to any blog, github, paper and so on, please write the their links in
it.
2. Please put all the code of this question in the Q6_code folder.
3. You need submit Q6_output.csv. Your .csv file should contain 2 columns as shown below.
file_name | label |
861108106.mp4 | 0 |
… | … |
801454381_11_21.mp4 | 13 |
4. Put all files/folders in Q6 folder.
Q7. Selective Materialization Problem (10 points)
(1) Can you select a set V of k views such that Gain (V U {top view}, {top
view}) is maximized? Set k=3. Please give your answer. (7 points)
(2) The lecture note shows how greedy algorithm perform badly. Please give
a complete proof of the lower bound of this greedy algorithm. (Maybe you
need some references.) (3 points)
Requirements:
1. For(1),youmustcodebyyourselfratherthancalculatebyhand. Submissions:
1. PutyourcodesinQ7_codefolder.
2. For(1),youshouldgivetheanswerinQ7_readme.pdf.
3. For(2),youshouldgivetheproofinQ7_readme.pdf.
4. Putallfiles/foldersinQ7folder.
Q8. Recommendation System (18 points)
You have learned some basic models including user-based and item-based collaborative filtering methods in class. However, some features of items or users can also help to improve the performance of recommendation system. In this question, you are given a movie rating dataset which contains basic rating information, movie titles, movie genres and user information. You should try to figure out how to utilize these features to construct a recommendation system.
You need to:
Based on rating_train.csv and other relevant data in this question, build a recommendation system to predict user ratings for movies in rating_test.csv. Data Descriptions:
1. DataisinData_Q8folder.
2. DatadescriptionsareshowninData_Q8. Submissions:
1. Put all you codes in Q8_code folder.
2. Your prediction result named as Q8_output.csv. (Notes: Each line
represents the user’s rating of the movie, which means your final
output should contain 3 columns: ‘UserID’, ‘MovieID’ and ‘Rating’)
Bonus:
There will be some bonus score if you use some creative or the state-of-arts models . Please report the advantages of your methods and list all your references in Q8_readme.pdf.